| commit | 12347d9134c1ba1f1961e4e7ef8c1a6f240c2721 | [log] [tgz] |
|---|---|---|
| author | Andy Grove <andygrove73@gmail.com> | Wed Feb 22 05:42:24 2023 -0700 |
| committer | GitHub <noreply@github.com> | Wed Feb 22 05:42:24 2023 -0700 |
| tree | 67111b4fccf80c99f942c31a832edd7432deefa6 | |
| parent | d7471cdc4d24ebcb0a9fba0f8c097d689b6cfbc1 [diff] |
[branch-0.8] Merge from main (#201) * changelog (#188) * Add Python wrapper for LogicalPlan::Sort (#196) * Add Python wrapper for LogicalPlan::Aggregate (#195) * Add Python wrapper for LogicalPlan::Limit (#193) * Add Python wrapper for LogicalPlan::Filter (#192) * Add Python wrapper for LogicalPlan::Filter * clippy * clippy * Update src/expr/filter.rs Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com> --------- Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com> * Add tests for recently added functionality (#199) * Add experimental support for executing SQL with Polars and Pandas (#190) * Run `maturin develop` instead of `cargo build` in verification script (#200) * Implement `to_pandas()` (#197) * Implement to_pandas() * Update documentation * Write unit test * Add support for cudf as a physical execution engine (#205) * Update README in preparation for 0.8 release (#206) * Analyze table bindings (#204) * method for getting the internal LogicalPlan instance * Add explain plan method * Add bindings for analyze table * Add to_variant * cargo fmt * blake and flake formatting * changelog (#209) --------- Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Co-authored-by: Dejan Simic <10134699+simicd@users.noreply.github.com> Co-authored-by: Jeremy Dyer <jdye64@gmail.com>
This is a Python library that binds to Apache Arrow in-memory query engine DataFusion.
DataFusion's Python bindings can be used as an end-user tool as well as providing a foundation for building new systems.
Here is a comparison with similar projects that may help understand when DataFusion might be suitable and unsuitable for your needs:
DuckDB is an open source, in-process analytic database. Like DataFusion, it supports very fast execution, both from its custom file format and directly from Parquet files. Unlike DataFusion, it is written in C/C++ and it is primarily used directly by users as a serverless database and query system rather than as a library for building such database systems.
Polars is one of the fastest DataFrame libraries at the time of writing. Like DataFusion, it is also written in Rust and uses the Apache Arrow memory model, but unlike DataFusion it does not provide full SQL support, nor as many extension points.
The following example demonstrates running a SQL query against a Parquet file using DataFusion, storing the results in a Pandas DataFrame, and then plotting a chart.
The Parquet file used in this example can be downloaded from the following page:
from datafusion import SessionContext # Create a DataFusion context ctx = SessionContext() # Register table with context ctx.register_parquet('taxi', 'yellow_tripdata_2021-01.parquet') # Execute SQL df = ctx.sql("select passenger_count, count(*) " "from taxi " "where passenger_count is not null " "group by passenger_count " "order by passenger_count") # convert to Pandas pandas_df = df.to_pandas() # create a chart fig = pandas_df.plot(kind="bar", title="Trip Count by Number of Passengers").get_figure() fig.savefig('chart.png')
This produces the following chart:
See examples for more information.
pip install datafusion # or python -m pip install datafusion
conda install -c conda-forge datafusion
You can verify the installation by running:
>>> import datafusion >>> datafusion.__version__ '0.6.0'
This assumes that you have rust and cargo installed. We use the workflow recommended by pyo3 and maturin.
The Maturin tools used in this workflow can be installed either via Conda or Pip. Both approaches should offer the same experience. Multiple approaches are only offered to appease developer preference. Bootstrapping for both Conda and Pip are as follows.
Bootstrap (Conda):
# fetch this repo git clone git@github.com:apache/arrow-datafusion-python.git # create the conda environment for dev conda env create -f ./conda/environments/datafusion-dev.yaml -n datafusion-dev # activate the conda environment conda activate datafusion-dev
Bootstrap (Pip):
# fetch this repo git clone git@github.com:apache/arrow-datafusion-python.git # prepare development environment (used to build wheel / install in development) python3 -m venv venv # activate the venv source venv/bin/activate # update pip itself if necessary python -m pip install -U pip # install dependencies (for Python 3.8+) python -m pip install -r requirements-310.txt
The tests rely on test data in git submodules.
git submodule init
git submodule update
Whenever rust code changes (your changes or via git pull):
# make sure you activate the venv using "source venv/bin/activate" first maturin develop python -m pytest
To change test dependencies, change the requirements.in and run
# install pip-tools (this can be done only once), also consider running in venv python -m pip install pip-tools python -m piptools compile --generate-hashes -o requirements-310.txt
To update dependencies, run with -U
python -m piptools compile -U --generate-hashes -o requirements-310.txt
More details here