| .. Licensed to the Apache Software Foundation (ASF) under one |
| .. or more contributor license agreements. See the NOTICE file |
| .. distributed with this work for additional information |
| .. regarding copyright ownership. The ASF licenses this file |
| .. to you under the Apache License, Version 2.0 (the |
| .. "License"); you may not use this file except in compliance |
| .. with the License. You may obtain a copy of the License at |
| |
| .. http://www.apache.org/licenses/LICENSE-2.0 |
| |
| .. Unless required by applicable law or agreed to in writing, |
| .. software distributed under the License is distributed on an |
| .. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| .. KIND, either express or implied. See the License for the |
| .. specific language governing permissions and limitations |
| .. under the License. |
| |
| .. _building-arrow-cpp: |
| |
| ================== |
| Building Arrow C++ |
| ================== |
| |
| System setup |
| ============ |
| |
| Arrow uses CMake as a build configuration system. We recommend building |
| out-of-source. If you are not familiar with this terminology: |
| |
| * **In-source build**: ``cmake`` is invoked directly from the ``cpp`` |
| directory. This can be inflexible when you wish to maintain multiple build |
| environments (e.g. one for debug builds and another for release builds) |
| * **Out-of-source build**: ``cmake`` is invoked from another directory, |
| creating an isolated build environment that does not interact with any other |
| build environment. For example, you could create ``cpp/build-debug`` and |
| invoke ``cmake $CMAKE_ARGS ..`` from this directory |
| |
| Building requires: |
| |
| * A C++11-enabled compiler. On Linux, gcc 4.8 and higher should be |
| sufficient. For Windows, at least Visual Studio 2015 is required. |
| * CMake 3.5 or higher |
| * On Linux and macOS, either ``make`` or ``ninja`` build utilities |
| |
| On Ubuntu/Debian you can install the requirements with: |
| |
| .. code-block:: shell |
| |
| sudo apt-get install \ |
| build-essential \ |
| cmake |
| |
| On Alpine Linux: |
| |
| .. code-block:: shell |
| |
| apk add autoconf \ |
| bash \ |
| cmake \ |
| g++ \ |
| gcc \ |
| make |
| |
| On macOS, you can use `Homebrew <https://brew.sh/>`_. |
| |
| .. code-block:: shell |
| |
| git clone https://github.com/apache/arrow.git |
| cd arrow |
| brew update && brew bundle --file=cpp/Brewfile |
| |
| On MSYS2: |
| |
| .. code-block:: shell |
| |
| pacman --sync --refresh --noconfirm \ |
| ccache \ |
| git \ |
| mingw-w64-${MSYSTEM_CARCH}-boost \ |
| mingw-w64-${MSYSTEM_CARCH}-brotli \ |
| mingw-w64-${MSYSTEM_CARCH}-cmake \ |
| mingw-w64-${MSYSTEM_CARCH}-gcc \ |
| mingw-w64-${MSYSTEM_CARCH}-gflags \ |
| mingw-w64-${MSYSTEM_CARCH}-glog \ |
| mingw-w64-${MSYSTEM_CARCH}-gtest \ |
| mingw-w64-${MSYSTEM_CARCH}-lz4 \ |
| mingw-w64-${MSYSTEM_CARCH}-protobuf \ |
| mingw-w64-${MSYSTEM_CARCH}-python3-numpy \ |
| mingw-w64-${MSYSTEM_CARCH}-rapidjson \ |
| mingw-w64-${MSYSTEM_CARCH}-snappy \ |
| mingw-w64-${MSYSTEM_CARCH}-thrift \ |
| mingw-w64-${MSYSTEM_CARCH}-zlib \ |
| mingw-w64-${MSYSTEM_CARCH}-zstd |
| |
| Building |
| ======== |
| |
| The build system uses ``CMAKE_BUILD_TYPE=release`` by default, so if this |
| argument is omitted then a release build will be produced. |
| |
| .. note:: |
| |
| You need to more options to build on Windows. See |
| :ref:`developers-cpp-windows` for details. |
| |
| Minimal release build: |
| |
| .. code-block:: shell |
| |
| git clone https://github.com/apache/arrow.git |
| cd arrow/cpp |
| mkdir release |
| cd release |
| cmake .. |
| make |
| |
| Minimal debug build with unit tests: |
| |
| .. code-block:: shell |
| |
| git clone https://github.com/apache/arrow.git |
| cd arrow/cpp |
| mkdir debug |
| cd debug |
| cmake -DCMAKE_BUILD_TYPE=Debug -DARROW_BUILD_TESTS=ON .. |
| make unittest |
| |
| The unit tests are not built by default. After building, one can also invoke |
| the unit tests using the ``ctest`` tool provided by CMake (note that ``test`` |
| depends on ``python`` being available). |
| |
| On some Linux distributions, running the test suite might require setting an |
| explicit locale. If you see any locale-related errors, try setting the |
| environment variable (which requires the `locales` package or equivalent): |
| |
| .. code-block:: shell |
| |
| export LC_ALL="en_US.UTF-8" |
| |
| Faster builds with Ninja |
| ~~~~~~~~~~~~~~~~~~~~~~~~ |
| |
| Many contributors use the `Ninja build system <https://ninja-build.org/>`_ to |
| get faster builds. It especially speeds up incremental builds. To use |
| ``ninja``, pass ``-GNinja`` when calling ``cmake`` and then use the ``ninja`` |
| command instead of ``make``. |
| |
| Optional Components |
| ~~~~~~~~~~~~~~~~~~~ |
| |
| By default, the C++ build system creates a fairly minimal build. We have |
| several optional system components which you can opt into building by passing |
| boolean flags to ``cmake``. |
| |
| * ``-DARROW_COMPUTE=ON``: Computational kernel functions and other support |
| * ``-DARROW_CSV=ON``: CSV reader module |
| * ``-DARROW_CUDA=ON``: CUDA integration for GPU development. Depends on NVIDIA |
| CUDA toolkit. The CUDA toolchain used to build the library can be customized |
| by using the ``$CUDA_HOME`` environment variable. |
| * ``-DARROW_DATASET=ON``: Dataset API, implies the Filesystem API |
| * ``-DARROW_FILESYSTEM=ON``: Filesystem API for accessing local and remote |
| filesystems |
| * ``-DARROW_FLIGHT=ON``: Arrow Flight RPC system, which depends at least on |
| gRPC |
| * ``-DARROW_GANDIVA=ON``: Gandiva expression compiler, depends on LLVM, |
| Protocol Buffers, and re2 |
| * ``-DARROW_GANDIVA_JAVA=ON``: Gandiva JNI bindings for Java |
| * ``-DARROW_HDFS=ON``: Arrow integration with libhdfs for accessing the Hadoop |
| Filesystem |
| * ``-DARROW_HIVESERVER2=ON``: Client library for HiveServer2 database protocol |
| * ``-DARROW_JSON=ON``: JSON reader module |
| * ``-DARROW_ORC=ON``: Arrow integration with Apache ORC |
| * ``-DARROW_PARQUET=ON``: Apache Parquet libraries and Arrow integration |
| * ``-DARROW_PLASMA=ON``: Plasma Shared Memory Object Store |
| * ``-DARROW_PLASMA_JAVA_CLIENT=ON``: Build Java client for Plasma |
| * ``-DARROW_PYTHON=ON``: Arrow Python C++ integration library (required for |
| building pyarrow). This library must be built against the same Python version |
| for which you are building pyarrow. NumPy must also be installed. Enabling |
| this option also enables ``ARROW_COMPUTE``, ``ARROW_CSV``, ``ARROW_DATASET``, |
| ``ARROW_FILESYSTEM``, ``ARROW_HDFS``, and ``ARROW_JSON``. |
| * ``-DARROW_S3=ON``: Support for Amazon S3-compatible filesystems |
| * ``-DARROW_WITH_BZ2=ON``: Build support for BZ2 compression |
| * ``-DARROW_WITH_ZLIB=ON``: Build support for zlib (gzip) compression |
| * ``-DARROW_WITH_LZ4=ON``: Build support for lz4 compression |
| * ``-DARROW_WITH_SNAPPY=ON``: Build support for Snappy compression |
| * ``-DARROW_WITH_ZSTD=ON``: Build support for ZSTD compression |
| * ``-DARROW_WITH_BROTLI=ON``: Build support for Brotli compression |
| |
| Some features of the core Arrow shared library can be switched off for improved |
| build times if they are not required for your application: |
| |
| * ``-DARROW_IPC=ON``: build the IPC extensions |
| |
| Optional Targets |
| ~~~~~~~~~~~~~~~~ |
| |
| For development builds, you will often want to enable additional targets in |
| enable to exercise your changes, using the following ``cmake`` options. |
| |
| * ``-DARROW_BUILD_BENCHMARKS=ON``: Build executable benchmarks. |
| * ``-DARROW_BUILD_EXAMPLES=ON``: Build examples of using the Arrow C++ API. |
| * ``-DARROW_BUILD_INTEGRATION=ON``: Build additional executables that are |
| used to exercise protocol interoperability between the different Arrow |
| implementations. |
| * ``-DARROW_BUILD_UTILITIES=ON``: Build executable utilities. |
| * ``-DARROW_BUILD_TESTS=ON``: Build executable unit tests. |
| * ``-DARROW_ENABLE_TIMING_TESTS=ON``: If building unit tests, enable those |
| unit tests that rely on wall-clock timing (this flag is disabled on CI |
| because it can make test results flaky). |
| * ``-DARROW_FUZZING=ON``: Build fuzz targets and related executables. |
| |
| Optional Checks |
| ~~~~~~~~~~~~~~~ |
| |
| The following special checks are available as well. They instrument the |
| generated code in various ways so as to detect select classes of problems |
| at runtime (for example when executing unit tests). |
| |
| * ``-DARROW_USE_ASAN=ON``: Enable Address Sanitizer to check for memory leaks, |
| buffer overflows or other kinds of memory management issues. |
| * ``-DARROW_USE_TSAN=ON``: Enable Thread Sanitizer to check for races in |
| multi-threaded code. |
| * ``-DARROW_USE_UBSAN=ON``: Enable Undefined Behavior Sanitizer to check for |
| situations which trigger C++ undefined behavior. |
| |
| Some of those options are mutually incompatible, so you may have to build |
| several times with different options if you want to exercise all of them. |
| |
| CMake version requirements |
| ~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| |
| While we support CMake 3.5 and higher, some features require a newer version of |
| CMake: |
| |
| * Building the benchmarks requires 3.6 or higher |
| * Building zstd from source requires 3.7 or higher |
| * Building Gandiva JNI bindings requires 3.11 or higher |
| |
| LLVM and Clang Tools |
| ~~~~~~~~~~~~~~~~~~~~ |
| |
| We are currently using LLVM 8 for library builds and for other developer tools |
| such as code formatting with ``clang-format``. LLVM can be installed via most |
| modern package managers (apt, yum, conda, Homebrew, chocolatey). |
| |
| .. _cpp-build-dependency-management: |
| |
| Build Dependency Management |
| =========================== |
| |
| The build system supports a number of third-party dependencies |
| |
| * ``AWSSDK``: for S3 support, requires system cURL even we use the |
| ``BUNDLE`` method described below |
| * ``benchmark``: Google benchmark, for testing |
| * ``Boost``: for cross-platform support |
| * ``Brotli``: for data compression |
| * ``BZip2``: for data compression |
| * ``c-ares``: a dependency of gRPC |
| * ``gflags``: for command line utilities (formerly Googleflags) |
| * ``GLOG``: for logging |
| * ``gRPC``: for remote procedure calls |
| * ``GTest``: Googletest, for testing |
| * ``LLVM``: a dependency of Gandiva |
| * ``Lz4``: for data compression |
| * ``ORC``: for Apache ORC format support |
| * ``re2``: for compute kernels and Gandiva, a dependency of gRPC |
| * ``Protobuf``: Google Protocol Buffers, for data serialization |
| * ``RapidJSON``: for data serialization |
| * ``Snappy``: for data compression |
| * ``Thrift``: Apache Thrift, for data serialization |
| * ``utf8proc``: for compute kernels |
| * ``ZLIB``: for data compression |
| * ``zstd``: for data compression |
| |
| The CMake option ``ARROW_DEPENDENCY_SOURCE`` is a global option that instructs |
| the build system how to resolve each dependency. There are a few options: |
| |
| * ``AUTO``: try to find package in the system default locations and build from |
| source if not found |
| * ``BUNDLED``: Building the dependency automatically from source |
| * ``SYSTEM``: Finding the dependency in system paths using CMake's built-in |
| ``find_package`` function, or using ``pkg-config`` for packages that do not |
| have this feature |
| * ``BREW``: Use Homebrew default paths as an alternative ``SYSTEM`` path |
| * ``CONDA``: Use ``$CONDA_PREFIX`` as alternative ``SYSTEM`` PATH |
| |
| The default method is ``AUTO`` unless you are developing within an active conda |
| environment (detected by presence of the ``$CONDA_PREFIX`` environment |
| variable), in which case it is ``CONDA``. |
| |
| Individual Dependency Resolution |
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| |
| While ``-DARROW_DEPENDENCY_SOURCE=$SOURCE`` sets a global default for all |
| packages, the resolution strategy can be overridden for individual packages by |
| setting ``-D$PACKAGE_NAME_SOURCE=..``. For example, to build Protocol Buffers |
| from source, set |
| |
| .. code-block:: shell |
| |
| -DProtobuf_SOURCE=BUNDLED |
| |
| This variable is unfortunately case-sensitive; the name used for each package |
| is listed above, but the most up-to-date listing can be found in |
| `cpp/cmake_modules/ThirdpartyToolchain.cmake <https://github.com/apache/arrow/blob/master/cpp/cmake_modules/ThirdpartyToolchain.cmake>`_. |
| |
| Bundled Dependency Versions |
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| |
| When using the ``BUNDLED`` method to build a dependency from source, the |
| version number from ``cpp/thirdparty/versions.txt`` is used. There is also a |
| dependency source downloader script (see below), which can be used to set up |
| offline builds. |
| |
| When using ``BUNDLED`` for dependency resolution (and if you use either the |
| jemalloc or mimalloc allocators, which are recommended), statically linking the |
| Arrow libraries in a third party project is more complex. See below for |
| instructions about how to configure your build system in this case. |
| |
| Boost-related Options |
| ~~~~~~~~~~~~~~~~~~~~~ |
| |
| We depend on some Boost C++ libraries for cross-platform support. In most cases, |
| the Boost version available in your package manager may be new enough, and the |
| build system will find it automatically. If you have Boost installed in a |
| non-standard location, you can specify it by passing |
| ``-DBOOST_ROOT=$MY_BOOST_ROOT`` or setting the ``BOOST_ROOT`` environment |
| variable. |
| |
| Offline Builds |
| ~~~~~~~~~~~~~~ |
| |
| If you do not use the above variables to direct the Arrow build system to |
| preinstalled dependencies, they will be built automatically by the Arrow build |
| system. The source archive for each dependency will be downloaded via the |
| internet, which can cause issues in environments with limited access to the |
| internet. |
| |
| To enable offline builds, you can download the source artifacts yourself and |
| use environment variables of the form ``ARROW_$LIBRARY_URL`` to direct the |
| build system to read from a local file rather than accessing the internet. |
| |
| To make this easier for you, we have prepared a script |
| ``thirdparty/download_dependencies.sh`` which will download the correct version |
| of each dependency to a directory of your choosing. It will print a list of |
| bash-style environment variable statements at the end to use for your build |
| script. |
| |
| .. code-block:: shell |
| |
| # Download tarballs into $HOME/arrow-thirdparty |
| $ ./thirdparty/download_dependencies.sh $HOME/arrow-thirdparty |
| |
| You can then invoke CMake to create the build directory and it will use the |
| declared environment variable pointing to downloaded archives instead of |
| downloading them (one for each build dir!). |
| |
| Statically Linking |
| ~~~~~~~~~~~~~~~~~~ |
| |
| When ``-DARROW_BUILD_STATIC=ON``, all build dependencies built as static |
| libraries by the Arrow build system will be merged together to create a static |
| library ``arrow_bundled_dependencies``. In UNIX-like environments (Linux, macOS, |
| MinGW), this is called ``libarrow_bundled_dependencies.a`` and on Windows with |
| Visual Studio ``arrow_bundled_dependencies.lib``. This "dependency bundle" |
| library is installed in the same place as the other Arrow static libraries. |
| |
| If you are using CMake, the bundled dependencies will automatically be included |
| when linking if you use the ``arrow_static`` CMake target. In other build |
| systems, you may need to explicitly link to the dependency bundle. We created |
| an `example CMake-based build configuration |
| <https://github.com/apache/arrow/tree/master/cpp/examples/minimal_build>`_ to |
| show you a working example. |
| |
| On Linux and macOS, if your application does not link to the ``pthread`` |
| library already, you must include ``-pthread`` in your linker setup. In CMake |
| this can be accomplished with the ``Threads`` built-in package: |
| |
| .. code-block:: cmake |
| |
| set(THREADS_PREFER_PTHREAD_FLAG ON) |
| find_package(Threads REQUIRED) |
| target_link_libraries(my_target PRIVATE Threads::Threads) |
| |
| Extra debugging help |
| ~~~~~~~~~~~~~~~~~~~~ |
| |
| If you use the CMake option ``-DARROW_EXTRA_ERROR_CONTEXT=ON`` it will compile |
| the libraries with extra debugging information on error checks inside the |
| ``RETURN_NOT_OK`` macro. In unit tests with ``ASSERT_OK``, this will yield error |
| outputs like: |
| |
| .. code-block:: shell |
| |
| ../src/arrow/ipc/ipc-read-write-test.cc:609: Failure |
| Failed |
| ../src/arrow/ipc/metadata-internal.cc:508 code: TypeToFlatbuffer(fbb, *field.type(), &children, &layout, &type_enum, dictionary_memo, &type_offset) |
| ../src/arrow/ipc/metadata-internal.cc:598 code: FieldToFlatbuffer(fbb, *schema.field(i), dictionary_memo, &offset) |
| ../src/arrow/ipc/metadata-internal.cc:651 code: SchemaToFlatbuffer(fbb, schema, dictionary_memo, &fb_schema) |
| ../src/arrow/ipc/writer.cc:697 code: WriteSchemaMessage(schema_, dictionary_memo_, &schema_fb) |
| ../src/arrow/ipc/writer.cc:730 code: WriteSchema() |
| ../src/arrow/ipc/writer.cc:755 code: schema_writer.Write(&dictionaries_) |
| ../src/arrow/ipc/writer.cc:778 code: CheckStarted() |
| ../src/arrow/ipc/ipc-read-write-test.cc:574 code: writer->WriteRecordBatch(batch) |
| NotImplemented: Unable to convert type: decimal(19, 4) |
| |
| Deprecations and API Changes |
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| |
| We use the compiler definition ``ARROW_NO_DEPRECATED_API`` to disable APIs that |
| have been deprecated. It is a good practice to compile third party applications |
| with this flag to proactively catch and account for API changes. |
| |
| Modular Build Targets |
| ~~~~~~~~~~~~~~~~~~~~~ |
| |
| Since there are several major parts of the C++ project, we have provided |
| modular CMake targets for building each library component, group of unit tests |
| and benchmarks, and their dependencies: |
| |
| * ``make arrow`` for Arrow core libraries |
| * ``make parquet`` for Parquet libraries |
| * ``make gandiva`` for Gandiva (LLVM expression compiler) libraries |
| * ``make plasma`` for Plasma libraries, server |
| |
| .. note:: |
| If you have selected Ninja as CMake generator, replace ``make arrow`` with |
| ``ninja arrow``, and so on. |
| |
| To build the unit tests or benchmarks, add ``-tests`` or ``-benchmarks`` |
| to the target name. So ``make arrow-tests`` will build the Arrow core unit |
| tests. Using the ``-all`` target, e.g. ``parquet-all``, will build everything. |
| |
| If you wish to only build and install one or more project subcomponents, we |
| have provided the CMake option ``ARROW_OPTIONAL_INSTALL`` to only install |
| targets that have been built. For example, if you only wish to build the |
| Parquet libraries, its tests, and its dependencies, you can run: |
| |
| .. code-block:: shell |
| |
| cmake .. -DARROW_PARQUET=ON \ |
| -DARROW_OPTIONAL_INSTALL=ON \ |
| -DARROW_BUILD_TESTS=ON |
| make parquet |
| make install |
| |
| If you omit an explicit target when invoking ``make``, all targets will be |
| built. |
| |
| Debugging with Xcode on macOS |
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| |
| Xcode is the IDE provided with macOS and can be use to develop and debug Arrow |
| by generating an Xcode project: |
| |
| .. code-block:: shell |
| |
| cd cpp |
| mkdir xcode-build |
| cd xcode-build |
| cmake .. -G Xcode -DARROW_BUILD_TESTS=ON -DCMAKE_BUILD_TYPE=DEBUG |
| open arrow.xcodeproj |
| |
| This will generate a project and open it in the Xcode app. As an alternative, |
| the command ``xcodebuild`` will perform a command-line build using the |
| generated project. It is recommended to use the "Automatically Create Schemes" |
| option when first launching the project. Selecting an auto-generated scheme |
| will allow you to build and run a unittest with breakpoints enabled. |