DataFusion is an in-memory query engine that uses Apache Arrow as the memory model. It supports executing SQL queries against CSV and Parquet files as well as querying directly against in-memory data.
DataFusion can be used as a library by adding the following to your Cargo.toml
file.
[dependencies] datafusion = "4.0.0-SNAPSHOT"
DataFusion includes a simple command-line interactive SQL utility. See the CLI reference for more information.
This library currently supports the following SQL constructs:
CREATE EXTERNAL TABLE X STORED AS PARQUET LOCATION '...';
to register a table's locationsSELECT ... FROM ...
together with any expressionALIAS
to name an expressionCAST
to change types, including e.g. Timestamp(Nanosecond, None)
+
, /
, sqrt
, tan
, >=
.WHERE
to filterGROUP BY
together with one of the following aggregations: MIN
, MAX
, COUNT
, SUM
, AVG
ORDER BY
together with an expression and optional ASC
or DESC
and also optional NULLS FIRST
or NULLS LAST
DataFusion strives to implement a subset of the PostgreSQL SQL dialect where possible. We explicitly choose a single dialect to maximize interoperability with other tools and allow reuse of the PostgreSQL documents and tutorials as much as possible.
Currently, only a subset of the PosgreSQL dialect is implemented, and we will document any deviations.
DataFusion uses Arrow, and thus the Arrow type system, for query execution. The SQL types from sqlparser-rs are mapped to Arrow types according to the following table
SQL Data Type | Arrow DataType |
---|---|
CHAR | Utf8 |
VARCHAR | Utf8 |
UUID | Not yet supported |
CLOB | Not yet supported |
BINARY | Not yet supported |
VARBINARY | Not yet supported |
DECIMAL | Float64 |
FLOAT | Float32 |
SMALLINT | Int16 |
INT | Int32 |
BIGINT | Int64 |
REAL | Float64 |
DOUBLE | Float64 |
BOOLEAN | Boolean |
DATE | Date32 |
TIME | Time64(TimeUnit::Millisecond) |
TIMESTAMP | Date64 |
INTERVAL | Not yet supported |
REGCLASS | Not yet supported |
TEXT | Not yet supported |
BYTEA | Not yet supported |
CUSTOM | Not yet supported |
ARRAY | Not yet supported |
This section describes how you can get started at developing DataFusion.
DataFusion is written in Rust and it uses a standard rust toolkit:
cargo build
cargo fmt
to format the codecargo test
to testBelow is a checklist of what you need to do to add a new scalar function to DataFusion:
BuiltinScalarFunction
FromStr
with the name of the function as called by SQLreturn_type
with the expected return type of the function, given an incoming typesignature
with the signature of the function (number and types of its arguments)create_physical_expr
mapping the built-in to the implementationunary_scalar_expr!
macro for the new function.pub use expr::{}
set.Below is a checklist of what you need to do to add a new aggregate function to DataFusion:
Accumulator
and AggregateExpr
:BuiltinAggregateFunction
FromStr
with the name of the function as called by SQLreturn_type
with the expected return type of the function, given an incoming typesignature
with the signature of the function (number and types of its arguments)create_aggregate_expr
mapping the built-in to the implementationThe query plans represented by LogicalPlan
nodes can be graphically rendered using Graphviz.
To do so, save the output of the display_graphviz
function to a file.:
// Create plan somehow... let mut output = File::create("/tmp/plan.dot")?; write!(output, "{}", plan.display_graphviz());
Then, use the dot
command line tool to render it into a file that can be displayed. For example, the following command creates a /tmp/plan.pdf
file:
dot -Tpdf < /tmp/plan.dot > /tmp/plan.pdf