blob: 33c2b274c7a93b14cf45901933a1ff1fde992864 [file] [log] [blame]
.. Licensed to the Apache Software Foundation (ASF) under one
.. or more contributor license agreements. See the NOTICE file
.. distributed with this work for additional information
.. regarding copyright ownership. The ASF licenses this file
.. to you under the Apache License, Version 2.0 (the
.. "License"); you may not use this file except in compliance
.. with the License. You may obtain a copy of the License at
.. http://www.apache.org/licenses/LICENSE-2.0
.. Unless required by applicable law or agreed to in writing,
.. software distributed under the License is distributed on an
.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
.. KIND, either express or implied. See the License for the
.. specific language governing permissions and limitations
.. under the License.
Introduction
============
We welcome and encourage contributions of all kinds, such as:
1. Tickets with issue reports of feature requests
2. Documentation improvements
3. Code, both PR and (especially) PR Review.
In addition to submitting new PRs, we have a healthy tradition of community members reviewing each others PRs.
Doing so is a great way to help the community as well as get more familiar with Rust and the relevant codebases.
Before opening a pull request that touches PyO3 bindings, please review the
:ref:`PyO3 class mutability guidelines <ffi_pyclass_mutability>` so you can flag missing
``#[pyclass(frozen)]`` annotations during development and review.
How to develop
--------------
This assumes that you have rust and cargo installed. We use the workflow recommended by
`pyo3 <https://github.com/PyO3/pyo3>`_ and `maturin <https://github.com/PyO3/maturin>`_. We recommend using
`uv <https://docs.astral.sh/uv/>`_ for python package management.
By default `uv` will attempt to build the datafusion python package. For our development we prefer to build manually. This means
that when creating your virtual environment using `uv sync` you need to pass in the additional `--no-install-package datafusion`
and for `uv run` commands the additional parameter `--no-project`
Bootstrap:
.. code-block:: shell
# fetch this repo
git clone git@github.com:apache/datafusion-python.git
# create the virtual environment
uv sync --dev --no-install-package datafusion
# activate the environment
source .venv/bin/activate
The tests rely on test data in git submodules.
.. code-block:: shell
git submodule init
git submodule update
Whenever rust code changes (your changes or via `git pull`):
.. code-block:: shell
# make sure you activate the venv using "source .venv/bin/activate" first
maturin develop -uv
python -m pytest
Running & Installing pre-commit hooks
-------------------------------------
arrow-datafusion-python takes advantage of `pre-commit <https://pre-commit.com/>`_ to assist developers with code linting to help reduce the number of commits that ultimately fail in CI due to linter errors. Using the pre-commit hooks is optional for the developer but certainly helpful for keeping PRs clean and concise.
Our pre-commit hooks can be installed by running :code:`pre-commit install`, which will install the configurations in your ARROW_DATAFUSION_PYTHON_ROOT/.github directory and run each time you perform a commit, failing to complete the commit if an offending lint is found allowing you to make changes locally before pushing.
The pre-commit hooks can also be run adhoc without installing them by simply running :code:`pre-commit run --all-files`
Guidelines for Separating Python and Rust Code
----------------------------------------------
Version 40 of ``datafusion-python`` introduced ``python`` wrappers around the ``pyo3`` generated code to vastly improve the user experience. (See the `blog post <https://datafusion.apache.org/blog/2024/08/20/python-datafusion-40.0.0/>`_ and `pull request <https://github.com/apache/datafusion-python/pull/750>`_ for more details.)
Mostly, the ``python`` code is limited to pure wrappers with type hints and good docstrings, but there are a few reasons for when the code does more:
1. Trivial aliases like :py:func:`~datafusion.functions.array_append` and :py:func:`~datafusion.functions.list_append`.
2. Simple type conversion, like from a ``path`` to a ``string`` of the path or from ``number`` to ``lit(number)``.
3. The additional code makes an API **much** more pythonic, like we do for :py:func:`~datafusion.functions.named_struct` (see `source code <https://github.com/apache/datafusion-python/blob/a0913c728f5f323c1eb4913e614c9d996083e274/python/datafusion/functions.py#L1040-L1046>`_).
Update Dependencies
-------------------
To change test dependencies, change the ``pyproject.toml`` and run
To update dependencies, run
.. code-block:: shell
uv sync --dev --no-install-package datafusion
Improving Build Speed
---------------------
The `pyo3 <https://github.com/PyO3/pyo3>`_ dependency of this project contains a ``build.rs`` file which
can cause it to rebuild frequently. You can prevent this from happening by defining a ``PYO3_CONFIG_FILE``
environment variable that points to a file with your build configuration. Whenever your build configuration
changes, such as during some major version updates, you will need to regenerate this file. This variable
should point to a fully resolved path on your build machine.
To generate this file, use the following command:
.. code-block:: shell
PYO3_PRINT_CONFIG=1 cargo build
This will generate some output that looks like the following. You will want to copy these contents intro
a file. If you place this file in your project directory with filename ``.pyo3_build_config`` it will
be ignored by ``git``.
.. code-block::
implementation=CPython
version=3.9
shared=true
abi3=true
lib_name=python3.12
lib_dir=/opt/homebrew/opt/python@3.12/Frameworks/Python.framework/Versions/3.12/lib
executable=/Users/myusername/src/datafusion-python/.venv/bin/python
pointer_width=64
build_flags=
suppress_build_script_link_lines=false
Add the environment variable to your system.
.. code-block:: shell
export PYO3_CONFIG_FILE="/Users//myusername/src/datafusion-python/.pyo3_build_config"
If you are on a Mac and you use VS Code for your IDE, you will want to add these variables
to your settings. You can find the appropriate rust flags by looking in the
``.cargo/config.toml`` file.
.. code-block::
"rust-analyzer.cargo.extraEnv": {
"RUSTFLAGS": "-C link-arg=-undefined -C link-arg=dynamic_lookup",
"PYO3_CONFIG_FILE": "/Users/myusername/src/datafusion-python/.pyo3_build_config"
},
"rust-analyzer.runnables.extraEnv": {
"RUSTFLAGS": "-C link-arg=-undefined -C link-arg=dynamic_lookup",
"PYO3_CONFIG_FILE": "/Users/myusername/src/personal/datafusion-python/.pyo3_build_config"
}