DataFusion

DataFusion is an in-memory query engine that uses Apache Arrow as the memory model. It supports executing SQL queries against CSV and Parquet files as well as querying directly against in-memory data.

Using DataFusion as a library

DataFusion can be used as a library by adding the following to your Cargo.toml file.

[dependencies]
datafusion = "1.0.0"

Using DataFusion as a binary

DataFusion includes a simple command-line interactive SQL utility. See the CLI reference for more information.

Status

General

[x] SQL Parser
[x] SQL Query Planner
[x] Query Optimizer
[x] Projection push down
[x] Projection push down
[ ] Predicate push down
[x] Type coercion
[x] Parallel query execution

SQL Support

[x] Projection
[x] Selection
[x] Limit
[x] Aggregate
[x] UDFs
[x] Common math functions
[ ] Common string functions
[ ] Common date/time functions
[x] Sorting
[ ] Nested types
[ ] Lists
[ ] Subqueries
[ ] Joins

Data Sources

[x] CSV
[x] Parquet primitive types
[ ] Parquet nested types

Supported SQL

This library currently supports the following SQL constructs:

CREATE EXTERNAL TABLE X STORED AS PARQUET LOCATION '...'; to register a table's locations
SELECT ... FROM ... together with any expression
ALIAS to name an expression
CAST to change types, including e.g. Timestamp(Nanosecond, None)
most mathematical unary and binary expressions such as +, /, sqrt, tan, >=.
WHERE to filter
GROUP BY together with one of the following aggregations: MIN, MAX, COUNT, SUM, AVG
ORDER BY together with an expression and optional DESC