| --- |
| title: "Installing the Arrow Package on Linux" |
| output: rmarkdown::html_vignette |
| vignette: > |
| %\VignetteIndexEntry{Installing the Arrow Package on Linux} |
| %\VignetteEngine{knitr::rmarkdown} |
| %\VignetteEncoding{UTF-8} |
| --- |
| |
| On macOS and Windows, when you `install.packages("arrow")`, |
| you get a binary package that contains Arrow’s C++ dependencies along with it. |
| On Linux, `install.packages()` retrieves a source package that has to be compiled locally, |
| and C++ dependencies need to be resolved as well. |
| Generally for R packages with C++ dependencies, |
| this requires either installing system packages, which you may not have privileges to do, |
| or building the C++ dependencies separately, |
| which introduces all sorts of additional ways for things to go wrong. |
| |
| Our goal is to make `install.packages("arrow")` "just work" for as many Linux distributions, |
| versions, and configurations as possible. |
| This document describes how it works and the options for fine-tuning Linux installation. |
| The intended audience for this document is `arrow` R package users on Linux, not developers. |
| If you're contributing to the Arrow project, |
| you'll probably want to manage your C++ installation more directly. |
| Note also that if you use `conda` to manage your R environment, this document does not apply. |
| You can `conda install -c conda-forge --strict-channel-priority r-arrow` and you'll get the latest official |
| release of the R package along with any C++ dependencies. |
| |
| # Installation basics |
| |
| Install the latest release of `arrow` from CRAN with |
| |
| ```r |
| install.packages("arrow") |
| ``` |
| |
| Daily development builds, which are not official releases, |
| can be installed from the Ursa Labs repository: |
| |
| ```r |
| install.packages("arrow", repos = "https://dl.bintray.com/ursalabs/arrow-r") |
| ``` |
| |
| There currently are no daily `conda` builds. |
| |
| You can also install the R package from a git checkout: |
| |
| ```shell |
| git clone https://github.com/apache/arrow |
| cd arrow/r |
| R CMD INSTALL . |
| ``` |
| |
| If you don't already have the Arrow C++ libraries on your system, |
| when installing the R package from source, it will also download and build |
| the Arrow C++ libraries for you. To speed installation up, you can set |
| |
| ```shell |
| export LIBARROW_BINARY=true |
| ``` |
| |
| to look for C++ binaries prebuilt for your Linux distribution/version. |
| Alternatively, you can set |
| |
| ```shell |
| export LIBARROW_MINIMAL=false |
| ``` |
| |
| to build the Arrow libraries with optional features such as compression libraries |
| enabled. This will increase the build time but provides many useful features. |
| Prebuilt binaries are built with this flag enabled, so you get the full |
| functionality by using them as well. |
| |
| Both of these variables are also set this way if you have the `NOT_CRAN=true` |
| environment variable set. |
| |
| If you already have `arrow` installed and want to upgrade to a different version, |
| install a development build, or try to reinstall and fix issues with Linux |
| C++ binaries, you can call `install_arrow()`. |
| This function is part of the `arrow` package, |
| and it is also available as a standalone script, so you can |
| access it for convenience without first installing the package: |
| |
| ```r |
| source("https://raw.githubusercontent.com/apache/arrow/master/r/R/install-arrow.R") |
| ``` |
| |
| `install_arrow()` will install from CRAN, |
| while `install_arrow(nightly = TRUE)` will give you a development build. |
| `install_arrow()` does not require environment variables to be set in order to |
| satisfy C++ dependencies. |
| |
| <!-- TODO: does remotes::install_github("apache/arrow/r") work now? --> |
| |
| # How dependencies are resolved |
| |
| In order for the `arrow` R package to work, it needs the Arrow C++ library. |
| There are a number of ways you can get it: a system package; a library you've |
| built yourself outside of the context of installing the R package; |
| or, if you don't already have it, the R package will attempt to resolve it |
| automatically when it installs. |
| |
| If you are authorized to install system packages and you're installing a CRAN release, |
| you may want to use the official Apache Arrow release packages corresponding to the R package version. |
| See the [Arrow project installation page](https://arrow.apache.org/install/) |
| to find pre-compiled binary packages for some common Linux distributions, |
| including Debian, Ubuntu, and CentOS. |
| You'll need to install `libparquet-dev` on Debian and Ubuntu, or `parquet-devel` on CentOS. |
| This will also automatically install the Arrow C++ library as a dependency. |
| |
| When you install the `arrow` R package on Linux, |
| it will first attempt to find the Arrow C++ libraries on your system using |
| the `pkg-config` command. |
| This will find either installed system packages or libraries you've built yourself. |
| In order for `install.packages("arrow")` to work with these system packages, |
| you'll need to install them before installing the R package. |
| |
| If no Arrow C++ libraries are found on the system, |
| the R package installation script will next attempt to download |
| prebuilt static Arrow C++ libraries |
| that match your both your local operating system and `arrow` R package version. |
| C++ libraries (source or binary) will only be retrieved if you have set the environment variable |
| `LIBARROW_BINARY` or `NOT_CRAN`. |
| If found, they will be downloaded and bundled when your R package compiles. |
| For a list of supported distributions and versions, |
| see the [arrow-r-nightly](https://github.com/ursa-labs/arrow-r-nightly/blob/master/README.md) project. |
| |
| If no binary is found, it will download the Arrow C++ source that matches the R package version |
| (CRAN release or nightly build) and attempt to build it locally. |
| If no matching source bundle is found, it will also look to see if you are in |
| a checkout of the `apache/arrow` git repository and thus have the C++ source there. |
| Depending on your system, building Arrow C++ from source likely will be slow; |
| consequently, it is designed to happen only when you |
| run `install.packages("arrow")` or `R CMD INSTALL` but not when running `R CMD check`, |
| unless you've set the `NOT_CRAN=true` environment variable. |
| |
| For the mechanics of how all this works, see the R package `configure` script, |
| which calls `tools/linuxlibs.R`. |
| If the C++ library is built from source, `inst/build_arrow_static.sh` is executed. |
| This build script is also what is used to generate the prebuilt binaries. |
| |
| # Troubleshooting and additional options |
| |
| The intent is that `install.packages("arrow")` will just work and handle all C++ |
| dependencies, but depending on your system, you may have better results if you |
| tune one of several parameters. Here are some known complications and ways to address them. |
| |
| ## Package installed without C++ dependencies |
| |
| If you get an error like |
| |
| ``` |
| Cannot call io___MemoryMappedFile__Open(). Please use arrow::install_arrow() to install required runtime libraries. |
| ``` |
| |
| for every `arrow` function you call, |
| that means that installing the package failed to retrieve or build C++ libraries |
| compatible with the current version of the R package. `install_arrow()` provides |
| some convenience wrappers around the various environment variables described below |
| and has defaults that should isolate the package installation and make it |
| more likely to succeed. See its help page for details. |
| |
| As of version 0.17, it is expected that C++ dependencies should be built successfully |
| on all Linux distributions, so if you see this message because the C++ libraries |
| errored while building, |
| please [report an issue](https://issues.apache.org/jira/projects/ARROW/issues). |
| |
| ## Using system libraries |
| |
| If a system library or other installed Arrow is found but it doesn't match the R package version |
| (for example, you have libarrow 0.14 on your system and are installing R package 0.15.1), |
| it is likely that the R bindings will fail to compile. |
| Because the Apache Arrow project is under active development, |
| is it essential that versions of the C++ and R libraries match. |
| When `install.packages("arrow")` has to download the C++ libraries, |
| the install script ensures that you fetch the C++ libraries that correspond to your R package version. |
| However, if you are using Arrow libraries already on your system, version match isn't guaranteed. |
| |
| To fix version mismatch, you can either update your system packages to match the R package version, |
| or set the environment variable `ARROW_USE_PKG_CONFIG=FALSE` |
| to tell the configure script not to look for system Arrow packages. |
| (The latter is the default of `install_arrow()`.) |
| System packages are available corresponding to all CRAN releases |
| but not for nightly or dev versions, so depending on the R package version you're installing, |
| system packages may not be an option. |
| |
| Note also that once you have a working R package installation based on system (shared) libraries, |
| if you update your system Arrow, you'll need to reinstall the R package to match its version. |
| Similarly, if you're using Arrow system libraries, running `update.packages()` |
| after a new release of the `arrow` package will likely fail unless you first |
| update the system packages. |
| |
| ## Using a local Arrow C++ build |
| |
| If you've built the Arrow C++ libraries locally from source |
| but haven't installed them where `pkg-config` will find them, |
| there are a few options for telling the R package how to locate them. |
| You can set `PKG_CONFIG_PATH` to `/path/to/your/installation/pkgconfig` |
| (that is, `PKG_CONFIG_PATH=${CMAKE_INSTALL_PREFIX}/${CMAKE_INSTALL_LIBDIR}/pkgconfig`, |
| if you've set those variables). |
| Alternatively, you can set the `INCLUDE_DIR` and `LIB_DIR` environment variables |
| to point to their location. |
| |
| If the package fails to install/load with an error like this: |
| |
| ``` |
| ** testing if installed package can be loaded from temporary location |
| Error: package or namespace load failed for 'arrow' in dyn.load(file, DLLpath = DLLpath, ...): |
| unable to load shared object '/Users/you/R/00LOCK-r/00new/arrow/libs/arrow.so': |
| dlopen(/Users/you/R/00LOCK-r/00new/arrow/libs/arrow.so, 6): Library not loaded: @rpath/libarrow.14.dylib |
| ``` |
| |
| try setting the environment variable `R_LD_LIBRARY_PATH` to wherever Arrow C++ |
| was put in `make install`, e.g. `export R_LD_LIBRARY_PATH=/usr/local/lib`, and |
| retry installing the R package. |
| |
| ## Using prebuilt binaries |
| |
| If the R package finds and downloads a prebuilt binary of the C++ library, |
| but then the `arrow` package can't be loaded, perhaps with "undefined symbols" errors, |
| please [report an issue](https://issues.apache.org/jira/projects/ARROW/issues). |
| This is likely a compiler mismatch and may be resolvable by setting some |
| environment variables to instruct R to compile the packages to match the C++ library. |
| |
| A workaround would be to set the environment variable `LIBARROW_BINARY=FALSE` |
| and retry installation: this value instructs the package to build the C++ library from source |
| instead of downloading the prebuilt binary. |
| That should guarantee that the compiler settings match. |
| |
| If a prebuilt binary wasn't found for your operating system but you think it should have been, |
| check the logs for a message that says `*** Unable to identify current OS/version`, |
| or a message that says `*** No C++ binaries found for` an invalid OS. |
| If you see either, please [report an issue](https://issues.apache.org/jira/projects/ARROW/issues). |
| You may also set the environment variable `ARROW_R_DEV=TRUE` for additional |
| debug messages. |
| |
| A workaround would be to set the environment variable `LIBARROW_BINARY` |
| to a `distribution-version` that exists in the Ursa Labs repository. |
| Setting `LIBARROW_BINARY` is also an option when there's not an exact match |
| for your OS but a similar version would work, |
| such as if you're on `ubuntu-18.10` and there's only a binary for `ubuntu-18.04`. |
| |
| If that workaround works for you, and you believe that it should work for everyone else too, |
| you may propose [adding an entry to this lookup table](https://github.com/ursa-labs/arrow-r-nightly/edit/master/linux/distro-map.csv). |
| This table is checked during the installation process |
| and tells the script to use binaries built on a different operating system/version |
| because they're known to work. |
| |
| ## Building C++ from source |
| |
| If building the C++ library from source fails, check the error message. |
| The install script attempts to install any necessary build dependencies, |
| but it's possible that some operating systems may require additional ones. |
| You may be able to install them and retry. |
| Regardless, if the C++ library fails to compile, |
| please [report an issue](https://issues.apache.org/jira/projects/ARROW/issues) |
| so that we can attempt to improve the script. |
| |
| ## Known installation issues |
| |
| * On CentOS, if you are using a more modern `devtoolset`, you may need to set |
| the environment variables `CC=/usr/bin/gcc CXX=/usr/bin/g++`. |
| See discussion [here](https://issues.apache.org/jira/browse/ARROW-8586). |
| |
| * If you have multiple versions of `zstd` installed on your system, |
| installation by building the C++ from source may fail with an undefined symbols |
| error. Workarounds include (1) setting `LIBARROW_BINARY` to use a C++ binary; (2) |
| setting `ARROW_WITH_ZSTD=OFF` to build without `zstd`; or (3) uninstalling |
| the conflicting `zstd`. |
| See discussion [here](https://issues.apache.org/jira/browse/ARROW-8556). |
| |
| ## Summary of build environment variables |
| |
| By default, these are all unset. All boolean variables are case-insensitive. |
| |
| * `ARROW_USE_PKG_CONFIG`: If set to `false`, the configure script |
| won't look for Arrow libraries on your system and instead will look to download/build them. |
| Use this if you have a version mismatch between installed system libraries |
| and the version of the R package you're installing. |
| * `LIBARROW_DOWNLOAD`: Unless set to `false`, the build script |
| will attempt to download C++ binary or source bundles. |
| If you're in a checkout of the `apache/arrow` git repository |
| and want to build the C++ library from the local source, make this `false`. |
| * `LIBARROW_BINARY`: If set to `true`, the script will try to download a binary |
| C++ library built for your operating system. |
| You may also set it to some other string, |
| a related "distro-version" that has binaries built that work for your OS. |
| If no binary is found, installation will fall back to building C++ |
| dependencies from source. |
| * `LIBARROW_BUILD`: If set to `false`, the build script |
| will not attempt to build the C++ from source. This means you will only get |
| a working `arrow` R package if a prebuilt binary is found. |
| Use this if you want to avoid compiling the C++ library, which may be slow |
| and resource-intensive, and ensure that you only use a prebuilt binary. |
| * `LIBARROW_MINIMAL`: If set to `false`, the build script |
| will enable some optional features, including compression libraries and the |
| `jemalloc` memory allocator. This will increase the source build time but |
| results in a more fully functional library. |
| * `NOT_CRAN`: If this variable is set to `true`, as the `devtools` package does, |
| the build script will set `LIBARROW_BINARY=true` and `LIBARROW_MINIMAL=false` |
| unless those environment variables are already set. This provides for a more |
| complete and fast installation experience for users who already have |
| `NOT_CRAN=true` as part of their workflow, without requiring additional |
| environment variables to be set. |
| * `ARROW_R_DEV`: If set to `true`, more verbose messaging will be printed |
| in the build script. This variable also is needed if you're modifying `Rcpp` |
| code in the package: see "Editing Rcpp code" in the README. |
| * `LIBARROW_DEBUG_DIR`: If the C++ library building from source fails (`cmake`), |
| there may be messages telling you to check some log file in the build directory. |
| However, when the library is built during R package installation, |
| that location is in a temp directory that is already deleted. |
| To capture those logs, set this variable to an absolute (not relative) path |
| and the log files will be copied there. |
| The directory will be created if it does not exist. |
| * `CMAKE`: When building the C++ library from source, you can specify a |
| `/path/to/cmake` to use a different version than whatever is found on the `$PATH` |
| |
| # Contributing |
| |
| As mentioned above, please [report an issue](https://issues.apache.org/jira/projects/ARROW/issues) |
| if you encounter ways to improve this. If you find that your Linux distribution |
| or version is not supported, we welcome the contribution of Docker images |
| (hosted on Docker Hub) that we can use in our continuous integration. These |
| Docker images should be minimal, containing only R and the dependencies it |
| requires. (For reference, see the images that |
| [R-hub](https://github.com/r-hub/rhub-linux-builders) uses.) |
| |
| You can test the `arrow` R package installation using the `docker-compose` |
| setup included in the `apache/arrow` git repository. For example, |
| |
| ``` |
| R_ORG=rhub R_IMAGE=ubuntu-gcc-release R_TAG=latest docker-compose build r |
| R_ORG=rhub R_IMAGE=ubuntu-gcc-release R_TAG=latest docker-compose run r |
| ``` |
| |
| installs the `arrow` R package, including the C++ source build, on the |
| [rhub/ubuntu-gcc-release](https://hub.docker.com/r/rhub/ubuntu-gcc-release) |
| image. |