DataFusion is an in-memory query engine that uses Apache Arrow as the memory model. It supports executing SQL queries against CSV and Parquet files as well as querying directly against in-memory data.
DataFusion can be used as a library by adding the following to your Cargo.toml
file.
[dependencies] datafusion = "2.0.0"
DataFusion includes a simple command-line interactive SQL utility. See the CLI reference for more information.
This library currently supports the following SQL constructs:
CREATE EXTERNAL TABLE X STORED AS PARQUET LOCATION '...';
to register a table's locationsSELECT ... FROM ...
together with any expressionALIAS
to name an expressionCAST
to change types, including e.g. Timestamp(Nanosecond, None)
+
, /
, sqrt
, tan
, >=
.WHERE
to filterGROUP BY
together with one of the following aggregations: MIN
, MAX
, COUNT
, SUM
, AVG
ORDER BY
together with an expression and optional ASC
or DESC
and also optional NULLS FIRST
or NULLS LAST
DataFusion uses Arrow, and thus the Arrow type system, for query execution. The SQL types from sqlparser-rs are mapped to Arrow types according to the following table
SQL Data Type | Arrow DataType |
---|---|
CHAR | Utf8 |
VARCHAR | Utf8 |
UUID | Not yet supported |
CLOB | Not yet supported |
BINARY | Not yet supported |
VARBINARY | Not yet supported |
DECIMAL | Float64 |
FLOAT | Float32 |
SMALLINT | Int16 |
INT | Int32 |
BIGINT | Int64 |
REAL | Float64 |
DOUBLE | Float64 |
BOOLEAN | Boolean |
DATE | Date64(DateUnit::Day) |
TIME | Time64(TimeUnit::Millisecond) |
TIMESTAMP | Date64(DateUnit::Millisecond) |
INTERVAL | Not yet supported |
REGCLASS | Not yet supported |
TEXT | Not yet supported |
BYTEA | Not yet supported |
CUSTOM | Not yet supported |
ARRAY | Not yet supported |
This section describes how you can get started at developing DataFusion.
DataFusion is written in Rust and it uses a standard rust toolkit:
cargo build
cargo fmt
to format the codecargo test
to testBelow is a checklist of what you need to do to add a new scalar function to DataFusion:
BuiltinScalarFunction
FromStr
with the name of the function as called by SQLreturn_type
with the expected return type of the function, given an incoming typesignature
with the signature of the function (number and types of its arguments)create_physical_expr
mapping the built-in to the implementationBelow is a checklist of what you need to do to add a new aggregate function to DataFusion:
Accumulator
and AggregateExpr
:BuiltinAggregateFunction
FromStr
with the name of the function as called by SQLreturn_type
with the expected return type of the function, given an incoming typesignature
with the signature of the function (number and types of its arguments)create_aggregate_expr
mapping the built-in to the implementation