| <!--- |
| Licensed to the Apache Software Foundation (ASF) under one |
| or more contributor license agreements. See the NOTICE file |
| distributed with this work for additional information |
| regarding copyright ownership. The ASF licenses this file |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this file except in compliance |
| with the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, |
| software distributed under the License is distributed on an |
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| KIND, either express or implied. See the License for the |
| specific language governing permissions and limitations |
| under the License. |
| --> |
| |
| # HOWTOs |
| |
| ## How to update the version of Rust used in CI tests |
| |
| Make a PR to update the [rust-toolchain] file in the root of the repository. |
| |
| [rust-toolchain]: https://github.com/apache/datafusion/blob/main/rust-toolchain.toml |
| |
| ## Adding new functions |
| |
| **Implementation** |
| |
| | Function type | Location to implement | Trait to implement | Macros to use | Example | |
| | ------------- | ------------------------- | ---------------------------------------------- | ------------------------------------------------ | -------------------- | |
| | Scalar | [functions][df-functions] | [`ScalarUDFImpl`] | `make_udf_function!()` and `export_functions!()` | [`advanced_udf.rs`] | |
| | Nested | [functions-nested] | [`ScalarUDFImpl`] | `make_udf_expr_and_func!()` | | |
| | Aggregate | [functions-aggregate] | [`AggregateUDFImpl`] and an [`Accumulator`] | `make_udaf_expr_and_func!()` | [`advanced_udaf.rs`] | |
| | Window | [functions-window] | [`WindowUDFImpl`] and a [`PartitionEvaluator`] | `define_udwf_and_expr!()` | [`advanced_udwf.rs`] | |
| | Table | [functions-table] | [`TableFunctionImpl`] and a [`TableProvider`] | `create_udtf_function!()` | [`simple_udtf.rs`] | |
| |
| - The macros are to simplify some boilerplate such as ensuring a DataFrame API compatible function is also created |
| - Ensure new functions are properly exported through the subproject |
| `mod.rs` or `lib.rs`. |
| - Functions should preferably provide documentation via the `#[user_doc(...)]` attribute so their documentation |
| can be included in the SQL reference documentation (see below section) |
| - Scalar functions are further grouped into modules for families of functions (e.g. string, math, datetime). |
| Functions should be added to the relevant module; if a new module needs to be created then a new [Rust feature] |
| should also be added to allow DataFusion users to conditionally compile the modules as needed |
| - Aggregate functions can optionally implement a [`GroupsAccumulator`] for better performance |
| |
| Spark compatible functions are [located in separate crate][df-spark] but otherwise follow the same steps, though all |
| function types (e.g. scalar, nested, aggregate) are grouped together in the single location. |
| |
| [df-functions]: https://github.com/apache/datafusion/tree/main/datafusion/functions |
| [functions-nested]: https://github.com/apache/datafusion/tree/main/datafusion/functions-nested |
| [functions-aggregate]: https://github.com/apache/datafusion/tree/main/datafusion/functions-aggregate |
| [functions-window]: https://github.com/apache/datafusion/tree/main/datafusion/functions-window |
| [functions-table]: https://github.com/apache/datafusion/tree/main/datafusion/functions-table |
| [df-spark]: https://github.com/apache/datafusion/tree/main/datafusion/spark |
| [`scalarudfimpl`]: https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.ScalarUDFImpl.html |
| [`aggregateudfimpl`]: https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.AggregateUDFImpl.html |
| [`accumulator`]: https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.Accumulator.html |
| [`groupsaccumulator`]: https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.GroupsAccumulator.html |
| [`windowudfimpl`]: https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.WindowUDFImpl.html |
| [`partitionevaluator`]: https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.PartitionEvaluator.html |
| [`tablefunctionimpl`]: https://docs.rs/datafusion/latest/datafusion/catalog/trait.TableFunctionImpl.html |
| [`tableprovider`]: https://docs.rs/datafusion/latest/datafusion/catalog/trait.TableProvider.html |
| [`advanced_udf.rs`]: https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/advanced_udf.rs |
| [`advanced_udaf.rs`]: https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/advanced_udaf.rs |
| [`advanced_udwf.rs`]: https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/advanced_udwf.rs |
| [`simple_udtf.rs`]: https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/simple_udtf.rs |
| [rust feature]: https://doc.rust-lang.org/cargo/reference/features.html |
| |
| **Testing** |
| |
| Prefer adding `sqllogictest` integration tests where the function is called via SQL against |
| well known data and returns an expected result. See the existing [test files][slt-test-files] if |
| there is an appropriate file to add test cases to, otherwise create a new file. See the |
| [`sqllogictest` documentation][slt-readme] for details on how to construct these tests. |
| Ensure edge case, `null` input cases are considered in these tests. |
| |
| If a behaviour cannot be tested via `sqllogictest` (e.g. testing `simplify()`, needs to be |
| tested in isolation from the optimizer, difficult to construct exact input via `sqllogictest`) |
| then tests can be added as Rust unit tests in the implementation module, though these should be |
| kept minimal where possible |
| |
| [slt-test-files]: https://github.com/apache/datafusion/tree/main/datafusion/sqllogictest/test_files |
| [slt-readme]: https://github.com/apache/datafusion/blob/main/datafusion/sqllogictest/README.md |
| |
| **Documentation** |
| |
| Run documentation update script `./dev/update_function_docs.sh` which will update the relevant |
| markdown document [here][fn-doc-home] (see the documents for [scalar][fn-doc-scalar], |
| [aggregate][fn-doc-aggregate] and [window][fn-doc-window] functions) |
| |
| - You _should not_ manually update the markdown document after running the script as those manual |
| changes would be overwritten on next execution |
| - Reference [GitHub issue] which introduced this behaviour |
| |
| [fn-doc-home]: https://github.com/apache/datafusion/blob/main/docs/source/user-guide/sql |
| [fn-doc-scalar]: https://github.com/apache/datafusion/blob/main/docs/source/user-guide/sql/scalar_functions.md |
| [fn-doc-aggregate]: https://github.com/apache/datafusion/blob/main/docs/source/user-guide/sql/aggregate_functions.md |
| [fn-doc-window]: https://github.com/apache/datafusion/blob/main/docs/source/user-guide/sql/window_functions.md |
| [github issue]: https://github.com/apache/datafusion/issues/12740 |
| |
| ## How to display plans graphically |
| |
| The query plans represented by `LogicalPlan` nodes can be graphically |
| rendered using [Graphviz](https://www.graphviz.org/). |
| |
| To do so, save the output of the `display_graphviz` function to a file.: |
| |
| ```rust |
| // Create plan somehow... |
| let mut output = File::create("/tmp/plan.dot")?; |
| write!(output, "{}", plan.display_graphviz()); |
| ``` |
| |
| Then, use the `dot` command line tool to render it into a file that |
| can be displayed. For example, the following command creates a |
| `/tmp/plan.pdf` file: |
| |
| ```bash |
| dot -Tpdf < /tmp/plan.dot > /tmp/plan.pdf |
| ``` |
| |
| ## How to format `.md` documents |
| |
| We use [`prettier`] to format `.md` files. |
| |
| You can either use `npm i -g prettier` to install it globally or use `npx` to run it as a standalone binary. |
| Using `npx` requires a working node environment. Upgrading to the latest prettier is recommended (by adding |
| `--upgrade` to the `npm` command). |
| |
| ```bash |
| $ prettier --version |
| 2.3.0 |
| ``` |
| |
| After you've confirmed your prettier version, you can format all the `.md` files: |
| |
| ```bash |
| prettier -w {datafusion,datafusion-cli,datafusion-examples,dev,docs}/**/*.md |
| ``` |
| |
| [`prettier`]: https://prettier.io/ |
| |
| ## How to format `.toml` files |
| |
| We use [`taplo`] to format `.toml` files. |
| |
| To install via cargo: |
| |
| ```sh |
| cargo install taplo-cli --locked |
| ``` |
| |
| > Refer to the [taplo installation documentation][taplo-install] for other ways to install it. |
| |
| ```bash |
| $ taplo --version |
| taplo 0.9.0 |
| ``` |
| |
| After you've confirmed your `taplo` version, you can format all the `.toml` files: |
| |
| ```bash |
| taplo fmt |
| ``` |
| |
| [`taplo`]: https://taplo.tamasfe.dev/ |
| [taplo-install]: https://taplo.tamasfe.dev/cli/installation/binary.html |
| |
| ## How to update protobuf/gen dependencies |
| |
| For the `proto` and `proto-common` crates, the prost/tonic code is generated by running their respective `./regen.sh` scripts, |
| which in turn invokes the Rust binary located in `./gen`. |
| |
| This is necessary after modifying the protobuf definitions or altering the dependencies of `./gen`, and requires a |
| valid installation of [protoc] (see [installation instructions] for details). |
| |
| ```bash |
| # From repository root |
| # proto-common |
| ./datafusion/proto-common/regen.sh |
| # proto |
| ./datafusion/proto/regen.sh |
| ``` |
| |
| [protoc]: https://github.com/protocolbuffers/protobuf#protocol-compiler-installation |
| [installation instructions]: https://datafusion.apache.org/contributor-guide/getting_started.html#protoc-installation |