HOWTOs

How to update the version of Rust used in CI tests

Make a PR to update the rust-toolchain file in the root of the repository.

Adding new functions

Implementation

Function type	Location to implement	Trait to implement	Macros to use	Example
Scalar	functions	`ScalarUDFImpl`	`make_udf_function!()` and `export_functions!()`	`advanced_udf.rs`
Nested	functions-nested	`ScalarUDFImpl`	`make_udf_expr_and_func!()`
Aggregate	functions-aggregate	`AggregateUDFImpl` and an `Accumulator`	`make_udaf_expr_and_func!()`	`advanced_udaf.rs`
Window	functions-window	`WindowUDFImpl` and a `PartitionEvaluator`	`define_udwf_and_expr!()`	`advanced_udwf.rs`
Table	functions-table	`TableFunctionImpl` and a `TableProvider`	`create_udtf_function!()`	`simple_udtf.rs`

The macros are to simplify some boilerplate such as ensuring a DataFrame API compatible function is also created
Ensure new functions are properly exported through the subproject mod.rs or lib.rs.
Functions should preferably provide documentation via the #[user_doc(...)] attribute so their documentation can be included in the SQL reference documentation (see below section)
Scalar functions are further grouped into modules for families of functions (e.g. string, math, datetime). Functions should be added to the relevant module; if a new module needs to be created then a new Rust feature should also be added to allow DataFusion users to conditionally compile the modules as needed
Aggregate functions can optionally implement a GroupsAccumulator for better performance

Spark compatible functions are located in separate crate but otherwise follow the same steps, though all function types (e.g. scalar, nested, aggregate) are grouped together in the single location.

Testing

Prefer adding sqllogictest integration tests where the function is called via SQL against well known data and returns an expected result. See the existing test files if there is an appropriate file to add test cases to, otherwise create a new file. See the sqllogictest documentation for details on how to construct these tests. Ensure edge case, null input cases are considered in these tests.

If a behaviour cannot be tested via sqllogictest (e.g. testing simplify(), needs to be tested in isolation from the optimizer, difficult to construct exact input via sqllogictest) then tests can be added as Rust unit tests in the implementation module, though these should be kept minimal where possible

Documentation

Run documentation update script ./dev/update_function_docs.sh which will update the relevant markdown document here (see the documents for scalar, aggregate and window functions)

You should not manually update the markdown document after running the script as those manual changes would be overwritten on next execution
Reference GitHub issue which introduced this behaviour

How to display plans graphically

The query plans represented by LogicalPlan nodes can be graphically rendered using Graphviz.

To do so, save the output of the display_graphviz function to a file.:

// Create plan somehow...
let mut output = File::create("/tmp/plan.dot")?;
write!(output, "{}", plan.display_graphviz());

Then, use the dot command line tool to render it into a file that can be displayed. For example, the following command creates a /tmp/plan.pdf file:

dot -Tpdf < /tmp/plan.dot > /tmp/plan.pdf

How to format `.md` documents

We use prettier to format .md files.

You can either use npm i -g prettier to install it globally or use npx to run it as a standalone binary. Using npx requires a working node environment. Upgrading to the latest prettier is recommended (by adding --upgrade to the npm command).

$ prettier --version
2.3.0

After you've confirmed your prettier version, you can format all the .md files:

prettier -w {datafusion,datafusion-cli,datafusion-examples,dev,docs}/**/*.md

How to format `.toml` files

We use taplo to format .toml files.

To install via cargo:

cargo install taplo-cli --locked

Refer to the taplo installation documentation for other ways to install it.

$ taplo --version
taplo 0.9.0

After you've confirmed your taplo version, you can format all the .toml files:

taplo fmt

How to update protobuf/gen dependencies

For the proto and proto-common crates, the prost/tonic code is generated by running their respective ./regen.sh scripts, which in turn invokes the Rust binary located in ./gen.

This is necessary after modifying the protobuf definitions or altering the dependencies of ./gen, and requires a valid installation of protoc (see installation instructions for details).

# From repository root
# proto-common
./datafusion/proto-common/regen.sh
# proto
./datafusion/proto/regen.sh