[branch-0.8] Merge from main (#201)

* changelog (#188)

* Add Python wrapper for LogicalPlan::Sort (#196)

* Add Python wrapper for LogicalPlan::Aggregate (#195)

* Add Python wrapper for LogicalPlan::Limit (#193)

* Add Python wrapper for LogicalPlan::Filter (#192)

* Add Python wrapper for LogicalPlan::Filter

* clippy

* clippy

* Update src/expr/filter.rs

Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>

---------

Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>

* Add tests for recently added functionality (#199)

* Add experimental support for executing SQL with Polars and Pandas (#190)

* Run `maturin develop` instead of `cargo build` in verification script (#200)

* Implement `to_pandas()` (#197)

* Implement to_pandas()

* Update documentation

* Write unit test

* Add support for cudf as a physical execution engine (#205)

* Update README in preparation for 0.8 release (#206)

* Analyze table bindings (#204)

* method for getting the internal LogicalPlan instance

* Add explain plan method

* Add bindings for analyze table

* Add to_variant

* cargo fmt

* blake and flake formatting

* changelog (#209)

---------

Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Co-authored-by: Dejan Simic <10134699+simicd@users.noreply.github.com>
Co-authored-by: Jeremy Dyer <jdye64@gmail.com>
35 files changed
tree: 67111b4fccf80c99f942c31a832edd7432deefa6
  1. .github/
  2. ci/
  3. conda/
  4. datafusion/
  5. dev/
  6. docs/
  7. examples/
  8. src/
  9. .asf.yaml
  10. .gitignore
  11. .gitmodules
  12. Cargo.lock
  13. Cargo.toml
  14. CHANGELOG.md
  15. LICENSE.txt
  16. pyproject.toml
  17. README.md
  18. requirements-310.txt
  19. requirements-37.txt
  20. requirements.in
  21. requirements.txt
README.md

DataFusion in Python

Python test Python Release Build

This is a Python library that binds to Apache Arrow in-memory query engine DataFusion.

DataFusion's Python bindings can be used as an end-user tool as well as providing a foundation for building new systems.

Features

  • Execute queries using SQL or DataFrames against CSV, Parquet, and JSON data sources
  • Queries are optimized using DataFusion's query optimizer
  • Execute user-defined Python code from SQL
  • Exchange data with Pandas and other DataFrame libraries that support PyArrow
  • Serialize and deserialize query plans in Substrait format
  • Experimental support for executing SQL queries against Polars, Pandas and cuDF

Comparison with other projects

Here is a comparison with similar projects that may help understand when DataFusion might be suitable and unsuitable for your needs:

  • DuckDB is an open source, in-process analytic database. Like DataFusion, it supports very fast execution, both from its custom file format and directly from Parquet files. Unlike DataFusion, it is written in C/C++ and it is primarily used directly by users as a serverless database and query system rather than as a library for building such database systems.

  • Polars is one of the fastest DataFrame libraries at the time of writing. Like DataFusion, it is also written in Rust and uses the Apache Arrow memory model, but unlike DataFusion it does not provide full SQL support, nor as many extension points.

Example Usage

The following example demonstrates running a SQL query against a Parquet file using DataFusion, storing the results in a Pandas DataFrame, and then plotting a chart.

The Parquet file used in this example can be downloaded from the following page:

from datafusion import SessionContext

# Create a DataFusion context
ctx = SessionContext()

# Register table with context
ctx.register_parquet('taxi', 'yellow_tripdata_2021-01.parquet')

# Execute SQL
df = ctx.sql("select passenger_count, count(*) "
             "from taxi "
             "where passenger_count is not null "
             "group by passenger_count "
             "order by passenger_count")

# convert to Pandas
pandas_df = df.to_pandas()

# create a chart
fig = pandas_df.plot(kind="bar", title="Trip Count by Number of Passengers").get_figure()
fig.savefig('chart.png')

This produces the following chart:

Chart

More Examples

See examples for more information.

Executing Queries with DataFusion

Running User-Defined Python Code

Substrait Support

Executing SQL against DataFrame Libraries (Experimental)

How to install (from pip)

Pip

pip install datafusion
# or
python -m pip install datafusion

Conda

conda install -c conda-forge datafusion

You can verify the installation by running:

>>> import datafusion
>>> datafusion.__version__
'0.6.0'

How to develop

This assumes that you have rust and cargo installed. We use the workflow recommended by pyo3 and maturin.

The Maturin tools used in this workflow can be installed either via Conda or Pip. Both approaches should offer the same experience. Multiple approaches are only offered to appease developer preference. Bootstrapping for both Conda and Pip are as follows.

Bootstrap (Conda):

# fetch this repo
git clone git@github.com:apache/arrow-datafusion-python.git
# create the conda environment for dev
conda env create -f ./conda/environments/datafusion-dev.yaml -n datafusion-dev
# activate the conda environment
conda activate datafusion-dev

Bootstrap (Pip):

# fetch this repo
git clone git@github.com:apache/arrow-datafusion-python.git
# prepare development environment (used to build wheel / install in development)
python3 -m venv venv
# activate the venv
source venv/bin/activate
# update pip itself if necessary
python -m pip install -U pip
# install dependencies (for Python 3.8+)
python -m pip install -r requirements-310.txt

The tests rely on test data in git submodules.

git submodule init
git submodule update

Whenever rust code changes (your changes or via git pull):

# make sure you activate the venv using "source venv/bin/activate" first
maturin develop
python -m pytest

How to update dependencies

To change test dependencies, change the requirements.in and run

# install pip-tools (this can be done only once), also consider running in venv
python -m pip install pip-tools
python -m piptools compile --generate-hashes -o requirements-310.txt

To update dependencies, run with -U

python -m piptools compile -U --generate-hashes -o requirements-310.txt

More details here