tree: 5d1d4b2650ae634212ad7b37b24a45ada6d0a8cb [path history] [tgz]
  1. benches/
  2. docs/
  3. examples/
  4. src/
  5. tests/
  6. Cargo.toml
  7. Dockerfile
  8. README.md
rust/datafusion/README.md

DataFusion

DataFusion is an in-memory query engine that uses Apache Arrow as the memory model. It supports executing SQL queries against CSV and Parquet files as well as querying directly against in-memory data.

Using DataFusion as a library

DataFusion can be used as a library by adding the following to your Cargo.toml file.

[dependencies]
datafusion = "4.0.0-SNAPSHOT"

Using DataFusion as a binary

DataFusion includes a simple command-line interactive SQL utility. See the CLI reference for more information.

Status

General

  • [x] SQL Parser
  • [x] SQL Query Planner
  • [x] Query Optimizer
  • [x] Projection push down
  • [x] Predicate push down
  • [x] Type coercion
  • [x] Parallel query execution

SQL Support

  • [x] Projection
  • [x] Filter (WHERE)
  • [x] Filter post-aggregate (HAVING)
  • [x] Limit
  • [x] Aggregate
  • [x] UDFs (user-defined functions)
  • [x] UDAFs (user-defined aggregate functions)
  • [x] Common math functions
  • String functions
    • [x] bit_Length
    • [x] btrim
    • [x] char_length
    • [x] character_length
    • [x] concat
    • [x] concat_ws
    • [x] length
    • [x] left
    • [x] lpad
    • [x] ltrim
    • [x] octet_length
    • [x] right
    • [x] rpad
    • [x] rtrim
    • [x] substr
    • [x] trim
  • Miscellaneous/Boolean functions
    • [x] nullif
  • Common date/time functions
    • [ ] Basic date functions
    • [ ] Basic time functions
    • [x] Basic timestamp functions
  • nested functions
    • [x] Array of columns
  • [x] Sorting
  • [ ] Nested types
  • [ ] Lists
  • [x] Subqueries
  • [ ] Joins
  • [ ] Window

Data Sources

  • [x] CSV
  • [x] Parquet primitive types
  • [ ] Parquet nested types

Supported SQL

This library currently supports the following SQL constructs:

  • CREATE EXTERNAL TABLE X STORED AS PARQUET LOCATION '...'; to register a table's locations
  • SELECT ... FROM ... together with any expression
  • ALIAS to name an expression
  • CAST to change types, including e.g. Timestamp(Nanosecond, None)
  • most mathematical unary and binary expressions such as +, /, sqrt, tan, >=.
  • WHERE to filter
  • GROUP BY together with one of the following aggregations: MIN, MAX, COUNT, SUM, AVG
  • ORDER BY together with an expression and optional ASC or DESC and also optional NULLS FIRST or NULLS LAST

Supported Functions

DataFusion strives to implement a subset of the PostgreSQL SQL dialect where possible. We explicitly choose a single dialect to maximize interoperability with other tools and allow reuse of the PostgreSQL documents and tutorials as much as possible.

Currently, only a subset of the PosgreSQL dialect is implemented, and we will document any deviations.

Supported Data Types

DataFusion uses Arrow, and thus the Arrow type system, for query execution. The SQL types from sqlparser-rs are mapped to Arrow types according to the following table

SQL Data TypeArrow DataType
CHARUtf8
VARCHARUtf8
UUIDNot yet supported
CLOBNot yet supported
BINARYNot yet supported
VARBINARYNot yet supported
DECIMALFloat64
FLOATFloat32
SMALLINTInt16
INTInt32
BIGINTInt64
REALFloat64
DOUBLEFloat64
BOOLEANBoolean
DATEDate32
TIMETime64(TimeUnit::Millisecond)
TIMESTAMPDate64
INTERVALNot yet supported
REGCLASSNot yet supported
TEXTNot yet supported
BYTEANot yet supported
CUSTOMNot yet supported
ARRAYNot yet supported

Developer's guide

This section describes how you can get started at developing DataFusion.

Bootstrap environment

DataFusion is written in Rust and it uses a standard rust toolkit:

  • cargo build
  • cargo fmt to format the code
  • cargo test to test
  • etc.

How to add a new scalar function

Below is a checklist of what you need to do to add a new scalar function to DataFusion:

  • Add the actual implementation of the function:
    • here for string functions
    • here for math functions
    • here for datetime functions
    • create a new module here for other functions
  • In src/physical_plan/functions, add:
    • a new variant to BuiltinScalarFunction
    • a new entry to FromStr with the name of the function as called by SQL
    • a new line in return_type with the expected return type of the function, given an incoming type
    • a new line in signature with the signature of the function (number and types of its arguments)
    • a new line in create_physical_expr mapping the built-in to the implementation
    • tests to the function.
  • In tests/sql.rs, add a new test where the function is called through SQL against well known data and returns the expected result.
  • In src/logical_plan/expr, add:
    • a new entry of the unary_scalar_expr! macro for the new function.
  • In src/logical_plan/mod, add:
    • a new entry in the pub use expr::{} set.

How to add a new aggregate function

Below is a checklist of what you need to do to add a new aggregate function to DataFusion:

  • Add the actual implementation of an Accumulator and AggregateExpr:
    • here for string functions
    • here for math functions
    • here for datetime functions
    • create a new module here for other functions
  • In src/physical_plan/aggregates, add:
    • a new variant to BuiltinAggregateFunction
    • a new entry to FromStr with the name of the function as called by SQL
    • a new line in return_type with the expected return type of the function, given an incoming type
    • a new line in signature with the signature of the function (number and types of its arguments)
    • a new line in create_aggregate_expr mapping the built-in to the implementation
    • tests to the function.
  • In tests/sql.rs, add a new test where the function is called through SQL against well known data and returns the expected result.

How to display plans graphically

The query plans represented by LogicalPlan nodes can be graphically rendered using Graphviz.

To do so, save the output of the display_graphviz function to a file.:

// Create plan somehow...
let mut output = File::create("/tmp/plan.dot")?;
write!(output, "{}", plan.display_graphviz());

Then, use the dot command line tool to render it into a file that can be displayed. For example, the following command creates a /tmp/plan.pdf file:

dot -Tpdf < /tmp/plan.dot > /tmp/plan.pdf