blob: 8df095f7c3b9a9f94802623a047e14e604099a75 [file] [log] [blame]
.. Licensed to the Apache Software Foundation (ASF) under one
.. or more contributor license agreements. See the NOTICE file
.. distributed with this work for additional information
.. regarding copyright ownership. The ASF licenses this file
.. to you under the Apache License, Version 2.0 (the
.. "License"); you may not use this file except in compliance
.. with the License. You may obtain a copy of the License at
.. http://www.apache.org/licenses/LICENSE-2.0
.. Unless required by applicable law or agreed to in writing,
.. software distributed under the License is distributed on an
.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
.. KIND, either express or implied. See the License for the
.. specific language governing permissions and limitations
.. under the License.
.. currentmodule:: pyarrow
.. highlight:: console
.. _python-development:
==================
Python Development
==================
This page provides general Python development guidelines and source build
instructions for all platforms.
Coding Style
============
We follow a similar PEP8-like coding style to the `pandas project
<https://github.com/pandas-dev/pandas>`_. To check style issues, use the
:ref:`Archery <archery>` subcommand ``lint``:
.. code-block::
$ pip install -e "arrow/dev/archery[lint]"
.. code-block::
$ archery lint --python
Some of the issues can be automatically fixed by passing the ``--fix`` option:
.. code-block::
$ archery lint --python --fix
.. _python-unit-testing:
Unit Testing
============
We are using `pytest <https://docs.pytest.org/en/latest/>`_ to develop our unit
test suite. After building the project (see below) you can run its unit tests
like so:
.. code-block::
$ pushd arrow/python
$ python -m pytest pyarrow
$ popd
Package requirements to run the unit tests are found in
``requirements-test.txt`` and can be installed if needed with ``pip install -r
requirements-test.txt``.
If you get import errors for ``pyarrow._lib`` or another PyArrow module when
trying to run the tests, run ``python -m pytest arrow/python/pyarrow`` and check
if the editable version of pyarrow was installed correctly.
The project has a number of custom command line options for its test
suite. Some tests are disabled by default, for example. To see all the options,
run
.. code-block::
$ python -m pytest pyarrow --help
and look for the "custom options" section.
.. note::
There are a few low-level tests written directly in C++. These tests are
implemented in `pyarrow/src/python_test.cc <https://github.com/apache/arrow/blob/main/python/pyarrow/src/python_test.cc>`_,
but they are also wrapped in a ``pytest``-based
`test module <https://github.com/apache/arrow/blob/main/python/pyarrow/tests/test_cpp_internals.py>`_
run automatically as part of the PyArrow test suite.
Test Groups
-----------
We have many tests that are grouped together using pytest marks. Some of these
are disabled by default. To enable a test group, pass ``--$GROUP_NAME``,
e.g. ``--parquet``. To disable a test group, prepend ``disable``, so
``--disable-parquet`` for example. To run **only** the unit tests for a
particular group, prepend ``only-`` instead, for example ``--only-parquet``.
The test groups currently include:
* ``dataset``: Apache Arrow Dataset tests
* ``flight``: Flight RPC tests
* ``gandiva``: tests for Gandiva expression compiler (uses LLVM)
* ``hdfs``: tests that use libhdfs to access the Hadoop filesystem
* ``hypothesis``: tests that use the ``hypothesis`` module for generating
random test cases. Note that ``--hypothesis`` doesn't work due to a quirk
with pytest, so you have to pass ``--enable-hypothesis``
* ``large_memory``: Test requiring a large amount of system RAM
* ``orc``: Apache ORC tests
* ``parquet``: Apache Parquet tests
* ``plasma``: Plasma Object Store tests (deprecated since Arrow 10.0.0,
will be removed in 12.0.0 or so)
* ``s3``: Tests for Amazon S3
* ``tensorflow``: Tests that involve TensorFlow
Doctest
-------
We are using `doctest <https://docs.python.org/3/library/doctest.html>`_
to check that docstring examples are up-to-date and correct. You can
also do that locally by running:
.. code-block::
$ pushd arrow/python
$ python -m pytest --doctest-modules
$ python -m pytest --doctest-modules path/to/module.py # checking single file
$ popd
for ``.py`` files or
.. code-block::
$ pushd arrow/python
$ python -m pytest --doctest-cython
$ python -m pytest --doctest-cython path/to/module.pyx # checking single file
$ popd
for ``.pyx`` and ``.pxi`` files. In this case you will also need to
install the `pytest-cython <https://github.com/lgpage/pytest-cython>`_ plugin.
Benchmarking
------------
For running the benchmarks, see :ref:`python-benchmarks`.
.. _build_pyarrow:
Building on Linux and macOS
===========================
System Requirements
-------------------
On macOS, any modern XCode (6.4 or higher; the current version is 13) or
Xcode Command Line Tools (``xcode-select --install``) are sufficient.
On Linux, for this guide, we require a minimum of gcc 4.8 or clang 3.7.
You can check your version by running
.. code-block::
$ gcc --version
If the system compiler is older than gcc 4.8, it can be set to a newer version
using the ``$CC`` and ``$CXX`` environment variables:
.. code-block::
$ export CC=gcc-4.8
$ export CXX=g++-4.8
Environment Setup and Build
---------------------------
First, let's clone the Arrow git repository:
.. code-block::
$ git clone https://github.com/apache/arrow.git
Pull in the test data and setup the environment variables:
.. code-block::
$ pushd arrow
$ git submodule update --init
$ export PARQUET_TEST_DATA="${PWD}/cpp/submodules/parquet-testing/data"
$ export ARROW_TEST_DATA="${PWD}/testing/data"
$ popd
Using Conda
~~~~~~~~~~~
The `conda <https://conda.io/>`_ package manager allows installing build-time
dependencies for Arrow C++ and PyArrow as pre-built binaries, which can make
Arrow development easier and faster.
Let's create a conda environment with all the C++ build and Python dependencies
from conda-forge, targeting development for Python 3.10:
On Linux and macOS:
.. code-block::
$ conda create -y -n pyarrow-dev -c conda-forge \
--file arrow/ci/conda_env_unix.txt \
--file arrow/ci/conda_env_cpp.txt \
--file arrow/ci/conda_env_python.txt \
--file arrow/ci/conda_env_gandiva.txt \
compilers \
python=3.10 \
pandas
As of January 2019, the ``compilers`` package is needed on many Linux
distributions to use packages from conda-forge.
With this out of the way, you can now activate the conda environment
.. code-block::
$ conda activate pyarrow-dev
For Windows, see the `Building on Windows`_ section below.
We need to set some environment variables to let Arrow's build system know
about our build toolchain:
.. code-block::
$ export ARROW_HOME=$CONDA_PREFIX
Using system and bundled dependencies
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. warning::
If you installed Python using the Anaconda distribution or `Miniconda
<https://conda.io/miniconda.html>`_, you cannot currently use a
pip-based virtual environment. Please follow the conda-based development
instructions instead.
If not using conda, you must arrange for your system to provide the required
build tools and dependencies. Note that if some dependencies are absent,
the Arrow C++ build chain may still be able to download and compile them
on the fly, but this will take a longer time than with pre-installed binaries.
.. _python-homebrew:
On macOS, use Homebrew to install all dependencies required for
building Arrow C++:
.. code-block::
$ brew update && brew bundle --file=arrow/cpp/Brewfile
See :ref:`here <cpp-build-dependency-management>` for a list of dependencies you
may need.
On Debian/Ubuntu, you need the following minimal set of dependencies:
.. code-block::
$ sudo apt-get install build-essential cmake python3-dev
Now, let's create a Python virtual environment with all Python dependencies
in the same folder as the repositories, and a target installation folder:
.. code-block::
$ python3 -m venv pyarrow-dev
$ source ./pyarrow-dev/bin/activate
$ pip install -r arrow/python/requirements-build.txt
$ # This is the folder where we will install the Arrow libraries during
$ # development
$ mkdir dist
If your CMake version is too old on Linux, you could get a newer one via
``pip install cmake``.
We need to set some environment variables to let Arrow's build system know
about our build toolchain:
.. code-block::
$ export ARROW_HOME=$(pwd)/dist
$ export LD_LIBRARY_PATH=$(pwd)/dist/lib:$LD_LIBRARY_PATH
$ export CMAKE_PREFIX_PATH=$ARROW_HOME:$CMAKE_PREFIX_PATH
Build and test
--------------
Now build the Arrow C++ libraries and install them into the directory we
created above (stored in ``$ARROW_HOME``):
.. code-block::
$ mkdir arrow/cpp/build
$ pushd arrow/cpp/build
$ cmake -DCMAKE_INSTALL_PREFIX=$ARROW_HOME \
-DCMAKE_INSTALL_LIBDIR=lib \
-DCMAKE_BUILD_TYPE=Debug \
-DARROW_BUILD_TESTS=ON \
-DARROW_COMPUTE=ON \
-DARROW_CSV=ON \
-DARROW_DATASET=ON \
-DARROW_FILESYSTEM=ON \
-DARROW_HDFS=ON \
-DARROW_JSON=ON \
-DARROW_PARQUET=ON \
-DARROW_WITH_BROTLI=ON \
-DARROW_WITH_BZ2=ON \
-DARROW_WITH_LZ4=ON \
-DARROW_WITH_SNAPPY=ON \
-DARROW_WITH_ZLIB=ON \
-DARROW_WITH_ZSTD=ON \
-DPARQUET_REQUIRE_ENCRYPTION=ON \
..
$ make -j4
$ make install
$ popd
There are a number of optional components that can be switched ON by
adding flags with ``ON``:
* ``ARROW_CUDA``: Support for CUDA-enabled GPUs
* ``ARROW_DATASET``: Support for Apache Arrow Dataset
* ``ARROW_FLIGHT``: Flight RPC framework
* ``ARROW_GANDIVA``: LLVM-based expression compiler
* ``ARROW_ORC``: Support for Apache ORC file format
* ``ARROW_PARQUET``: Support for Apache Parquet file format
* ``PARQUET_REQUIRE_ENCRYPTION``: Support for Parquet Modular Encryption
* ``ARROW_PLASMA``: Shared memory object store (deprecated since Arrow 10.0.0,
will be removed in 12.0.0 or so)
Anything set to ``ON`` above can also be turned off. Note that some compression
libraries are recommended for full Parquet support.
You may choose between different kinds of C++ build types:
* ``-DCMAKE_BUILD_TYPE=Release`` (the default) produces a build with optimizations
enabled and debugging information disabled;
* ``-DCMAKE_BUILD_TYPE=Debug`` produces a build with optimizations
disabled and debugging information enabled;
* ``-DCMAKE_BUILD_TYPE=RelWithDebInfo`` produces a build with both optimizations
and debugging information enabled.
.. seealso::
:ref:`Building Arrow C++ <cpp-building-building>`.
If multiple versions of Python are installed in your environment, you may have
to pass additional parameters to CMake so that it can find the right
executable, headers and libraries. For example, specifying
``-DPython3_EXECUTABLE=<path/to/bin/python>`` lets CMake choose the
Python executable which you are using.
.. note::
On Linux systems with support for building on multiple architectures,
``make`` may install libraries in the ``lib64`` directory by default. For
this reason we recommend passing ``-DCMAKE_INSTALL_LIBDIR=lib`` because the
Python build scripts assume the library directory is ``lib``
.. note::
If you have conda installed but are not using it to manage dependencies,
and you have trouble building the C++ library, you may need to set
``-DARROW_DEPENDENCY_SOURCE=AUTO`` or some other value (described
:ref:`here <cpp-build-dependency-management>`)
to explicitly tell CMake not to use conda.
.. note::
With older versions of CMake (<3.15) you might need to pass ``-DPYTHON_EXECUTABLE``
instead of ``-DPython3_EXECUTABLE``. See `cmake documentation <https://cmake.org/cmake/help/latest/module/FindPython3.html#artifacts-specification>`_
for more details.
For any other C++ build challenges, see :ref:`cpp-development`.
In case you may need to rebuild the C++ part due to errors in the process it is
advisable to delete the build folder with command ``rm -rf arrow/cpp/build``.
If the build has passed successfully and you need to rebuild due to latest pull
from git main, then this step is not needed.
Now, build pyarrow:
.. code-block::
$ pushd arrow/python
$ export PYARROW_WITH_PARQUET=1
$ export PYARROW_WITH_DATASET=1
$ export PYARROW_PARALLEL=4
$ python setup.py build_ext --inplace
$ popd
If you did build one of the optional components (in C++), you need to set the
corresponding ``PYARROW_WITH_$COMPONENT`` environment variable to 1.
Similarly, if you built with ``PARQUET_REQUIRE_ENCRYPTION`` (in C++), you
need to set the corresponding ``PYARROW_WITH_PARQUET_ENCRYPTION`` environment
variable to 1.
To set the number of threads used to compile PyArrow's C++/Cython components,
set the ``PYARROW_PARALLEL`` environment variable.
If you wish to delete stale PyArrow build artifacts before rebuilding, navigate
to the ``arrow/python`` folder and run ``git clean -Xfd .``.
Now you are ready to install test dependencies and run `Unit Testing`_, as
described above.
To build a self-contained wheel (including the Arrow and Parquet C++
libraries), one can set ``--bundle-arrow-cpp``:
.. code-block::
$ pip install wheel # if not installed
$ python setup.py build_ext --build-type=$ARROW_BUILD_TYPE \
--bundle-arrow-cpp bdist_wheel
.. note::
To install an editable PyArrow build run ``pip install -e . --no-build-isolation``
in the ``arrow/python`` directory.
Docker examples
~~~~~~~~~~~~~~~
If you are having difficulty building the Python library from source, take a
look at the ``python/examples/minimal_build`` directory which illustrates a
complete build and test from source both with the conda- and pip-based build
methods.
Debugging
---------
Since pyarrow depends on the Arrow C++ libraries, debugging can
frequently involve crossing between Python and C++ shared libraries.
Using gdb on Linux
~~~~~~~~~~~~~~~~~~
To debug the C++ libraries with gdb while running the Python unit
tests, first start pytest with gdb:
.. code-block::
$ gdb --args python -m pytest pyarrow/tests/test_to_run.py -k $TEST_TO_MATCH
To set a breakpoint, use the same gdb syntax that you would when
debugging a C++ program, for example:
.. code-block::
(gdb) b src/arrow/python/arrow_to_pandas.cc:1874
No source file named src/arrow/python/arrow_to_pandas.cc.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (src/arrow/python/arrow_to_pandas.cc:1874) pending.
.. seealso::
The :ref:`GDB extension for Arrow C++ <cpp_gdb_extension>`.
.. _build_pyarrow_win:
Building on Windows
===================
Building on Windows requires one of the following compilers to be installed:
- `Build Tools for Visual Studio 2017 <https://download.visualstudio.microsoft.com/download/pr/3e542575-929e-4297-b6c6-bef34d0ee648/639c868e1219c651793aff537a1d3b77/vs_buildtools.exe>`_
- Visual Studio 2017
During the setup of Build Tools, ensure at least one Windows SDK is selected.
We bootstrap a conda environment similar to above, but skipping some of the
Linux/macOS-only packages:
First, starting from a fresh clone of Apache Arrow:
.. code-block::
$ git clone https://github.com/apache/arrow.git
.. code-block::
$ conda create -y -n pyarrow-dev -c conda-forge ^
--file arrow\ci\conda_env_cpp.txt ^
--file arrow\ci\conda_env_python.txt ^
--file arrow\ci\conda_env_gandiva.txt ^
python=3.10
$ conda activate pyarrow-dev
Now, we build and install Arrow C++ libraries.
We set the path of the installation directory of the Arrow C++ libraries as
``ARROW_HOME``. When using a conda environment, Arrow C++ is installed
in the environment directory, which path is saved in the
`CONDA_PREFIX <https://docs.conda.io/projects/conda-build/en/latest/user-guide/environment-variables.html#environment-variables-that-affect-the-build-process>`_
environment variable.
.. code-block::
$ set ARROW_HOME=%CONDA_PREFIX%\Library
Let's configure, build and install the Arrow C++ libraries:
.. code-block::
$ mkdir arrow\cpp\build
$ pushd arrow\cpp\build
$ cmake -G "Ninja" ^
-DCMAKE_INSTALL_PREFIX=%ARROW_HOME% ^
-DCMAKE_UNITY_BUILD=ON ^
-DARROW_COMPUTE=ON ^
-DARROW_CSV=ON ^
-DARROW_CXXFLAGS="/WX /MP" ^
-DARROW_DATASET=ON ^
-DARROW_FILESYSTEM=ON ^
-DARROW_HDFS=ON ^
-DARROW_JSON=ON ^
-DARROW_PARQUET=ON ^
-DARROW_WITH_LZ4=ON ^
-DARROW_WITH_SNAPPY=ON ^
-DARROW_WITH_ZLIB=ON ^
-DARROW_WITH_ZSTD=ON ^
..
$ cmake --build . --target install --config Release
$ popd
Now, we can build pyarrow:
.. code-block::
$ pushd arrow\python
$ set PYARROW_WITH_PARQUET=1
$ set CONDA_DLL_SEARCH_MODIFICATION_ENABLE=1
$ python setup.py build_ext --inplace
$ popd
.. note::
For building pyarrow, the above defined environment variables need to also
be set. Remember this if to want to re-build ``pyarrow`` after your initial build.
.. note::
If you are using Conda with Python 3.9 or earlier, you must
set ``CONDA_DLL_SEARCH_MODIFICATION_ENABLE=1``.
Then run the unit tests with:
.. code-block::
$ pushd arrow\python
$ python -m pytest pyarrow
$ popd
.. note::
With the above instructions the Arrow C++ libraries are not bundled with
the Python extension. This is recommended for development as it allows the
C++ libraries to be re-built separately.
If you are using the conda package manager then conda will ensure the Arrow C++
libraries are found. In case you are *not* using conda then you have to:
* add the path of installed DLL libraries to ``PATH`` every time before
importing ``pyarrow``, or
* bundle the Arrow C++ libraries with ``pyarrow``.
If you want to bundle the Arrow C++ libraries with ``pyarrow``, set the
``PYARROW_BUNDLE_ARROW_CPP`` environment variable before building ``pyarrow``:
.. code-block::
$ set PYARROW_BUNDLE_ARROW_CPP=1
$ python setup.py build_ext --inplace
Note that bundled Arrow C++ libraries will not be automatically
updated when rebuilding Arrow C++.
Caveats
-------
The Plasma component is not supported on Windows.
Deleting stale build artifacts
==============================
When there have been changes to the structure of the Arrow C++ library or PyArrow,
a thorough cleaning is recommended as a first attempt to fixing build errors.
.. note::
It is not necessarily intuitive from the error itself that the problem is due to stale artifacts.
Example of a build error from stale artifacts is "Unknown CMake command "arrow_keep_backward_compatibility"".
To delete stale Arrow C++ build artifacts:
.. code-block::
$ rm -rf arrow/cpp/build
To delete stale PyArrow build artifacts:
.. code-block::
$ git clean -Xfd python
If using a Conda environment, there are some build artifacts that get installed in
``$ARROW_HOME`` (aka ``$CONDA_PREFIX``). For example, ``$ARROW_HOME/lib/cmake/Arrow*``,
``$ARROW_HOME/include/arrow``, ``$ARROW_HOME/lib/libarrow*``, etc.
These files can be manually deleted. If unsure which files to erase, one approach
is to recreate the Conda environment.
Either delete the current one, and start fresh:
.. code-block::
$ conda deactivate
$ conda remove -n pyarrow-dev
Or, less destructively, create a different environment with a different name.
Installing Nightly Packages
===========================
.. warning::
These packages are not official releases. Use them at your own risk.
PyArrow has nightly wheels and Conda packages for testing purposes.
These may be suitable for downstream libraries in their continuous integration
setup to maintain compatibility with the upcoming PyArrow features,
deprecations and/or feature removals.
Install the development version of PyArrow from `arrow-nightlies
<https://anaconda.org/arrow-nightlies/pyarrow>`_ conda channel:
.. code-block:: bash
conda install -c arrow-nightlies pyarrow
Install the development version from an `alternative PyPI
<https://gemfury.com/arrow-nightlies>`_ index:
.. code-block:: bash
pip install --extra-index-url https://pypi.fury.io/arrow-nightlies/ \
--prefer-binary --pre pyarrow