Apache Arrow DataFusion is an extensible query execution framework, written in Rust, that uses Apache Arrow as its in-memory format.
When you want to extend your Rust project with SQL support, a DataFrame API, or the ability to read and process Parquet, JSON, Avro or CSV data, DataFusion is definitely worth checking out. DataFusion's pluggable design makes creating extensions at various points particular easy to build.
DataFusion's SQL, DataFrame
, and manual PlanBuilder
API let users access a sophisticated query optimizer and execution engine capable of fast, resource efficient, and parallel execution that takes optimal advantage of todays multicore hardware. Being written in Rust means DataFusion can offer both the safety of dynamic languages as well as the resource efficiency of a compiled language.
The DataFusion team is pleased to announce the creation of the DataFusion-Contrib GitHub organization to support and accelerate other projects. While the core DataFusion library remains under Apache governance, the contrib organization provides a more flexible testing ground for new DataFusion features and a home for DataFusion extensions. With this announcement, we are pleased to introduce the following inaugural DataFusion-Contrib repositories.
This project provides Python bindings to the core Rust implementation of DataFusion, which allows users to:
The team is focusing on exposing more features from the underlying Rust implementation of DataFusion and improving documentation.
From pip
pip install datafusion
Or
python -m pip install datafusion
This crate provides an ObjectStore
implementation for querying data stored in S3 or S3 compatible storage. This makes it almost as easy to query data that lives on S3 as lives in local files
S3FileSystem
to register as part of DataFusion ExecutionContext
ctx.register_listing_table
The current priority is adding python bindings for S3FileSystem
. After that there will be async improvements as DataFusion adopts more of that functionality and we are looking into S3 Select functionality.
Add the below to your Cargo.toml
in your Rust Project with DataFusion.
datafusion-objectstore-s3 = "0.1.0"
Substrait is an emerging standard that provides a cross-language serialization format for relational algebra (e.g. expressions and query plans).
This crate provides a Substrait producer and consumer for DataFusion. A producer converts a DataFusion logical plan into a Substrait protobuf and a consumer does the reverse.
Examples of how to use this crate can be found here.
This crate implements Bigtable as a data source and physical executor for DataFusion queries. It currently supports both UTF-8 string and 64-bit big-endian signed integers in Bigtable. From a SQL perspective it supports both simple and composite row keys with =
, IN
, and BETWEEN
operators as well as projection pushdown. The physical execution for queries is handled by this crate while any subsequent aggregation, group bys, or joins are handled in DataFusion.
Add the below to your Cargo.toml
in your Rust Project with DataFusion.
datafusion-bigtable = "0.1.0"
This crate introduces HadoopFileSystem
as a remote ObjectStore
which provides the ability to query HDFS files. For HDFS access the fs-hdfs library is used.
This crate provides an e-graph based DataFusion optimization framework based on the Rust egg library. An e-graph is a data structure that powers the equality saturation optimization technique.
As context, the optimizer framework within DataFusion is currently under review with the objective of implementing a more strategic long term solution that is more efficient and simpler to develop.
Some of the benefits of using egg
within DataFusion are:
This is an exciting new area for DataFusion with lots of opportunity for community involvement!
DataFusion-tui aka dft
provides a feature rich terminal application for using DataFusion. It has drawn inspiration and several features from datafusion-cli
. In contrast to datafusion-cli
the objective of this tool is to provide a light SQL IDE experience for querying data with DataFusion. This includes features such as the following which are currently implemented:
ExecutionContext
information, and logsExecutionContext
informationExecutionConfig
settings are setdft
, DataFusion, and any dependent librariesObjectStore
s~/.datafusionrc
to enable having local “database” available at startupExecutionContext
TableProvider
s such as DeltaTable or BigTableDataFusion-Stream is a new testing ground for creating a StreamProvider
in DataFusion that will enable querying streaming data sources such as Apache Kafka. The implementation for this feature is currently being designed and is under active review. Once the design is finalized the trait and attendant data structures will be added back to the core DataFusion crate.
This project created an initial set of Java bindings to DataFusion. The project is currently in maintenance mode and is looking for maintainers to drive future development.
If you are interested in contributing to DataFusion, and learning about state of the art query processing, we would love to have you join us on the journey! You can help by trying out DataFusion on some of your own data and projects and let us know how it goes or contribute a PR with documentation, tests or code. A list of open issues suitable for beginners is here
The best way to find out about creating new extensions within DataFusion-Contrib is reaching out on the #arrow-rust
channel of the Apache Software Foundation Slack workspace.
You can also check out our new Communication Doc on more ways to engage with the community.
Links for each DataFusion-Contrib repository are provided above if you would like to contribute to those.