blob: eecf7f5bde6e1eb85cab3ec0e3ad5e738fabc078 [file] [log] [blame] [view]
<!---
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
# Crate Configuration
This section contains information on how to configure builds of DataFusion in
your Rust project. The [Configuration Settings] section lists options that
control additional aspects DataFusion's runtime behavior.
[configuration settings]: configs.md
## Using the nightly DataFusion builds
DataFusion changes are published to `crates.io` according to the [release schedule](https://github.com/apache/datafusion/blob/main/dev/release/README.md#release-process)
If you would like to use or test versions of the DataFusion code which are
merged but not yet published, you can use Cargo's [support for adding
dependencies] directly to a GitHub branch:
```toml
datafusion = { git = "https://github.com/apache/datafusion", branch = "main"}
```
Also it works on the package level
```toml
datafusion-common = { git = "https://github.com/apache/datafusion", branch = "main", package = "datafusion-common"}
```
And with features
```toml
datafusion = { git = "https://github.com/apache/datafusion", branch = "main", default-features = false, features = ["unicode_expressions"] }
```
More on [Cargo dependencies](https://doc.rust-lang.org/cargo/reference/specifying-dependencies.html#specifying-dependencies)
## Optimizing Builds
Here are several suggestions to get the Rust compler to produce faster code when
compiling DataFusion. Note that these changes may increase compile time and
binary size.
### Generate Code with CPU Specific Instructions
By default, the Rust compiler produces code that runs on a wide range of CPUs,
but may not take advantage of all the features of your specific CPU (such as
certain [SIMD instructions]). This is especially true for x86_64 CPUs, where the
default target is `x86_64-unknown-linux-gnu`, which only guarantees support for
the `SSE2` instruction set. DataFusion can benefit from the more advanced
instructions in the `AVX2` and `AVX512` to speed up operations like filtering,
aggregation, and joins. To tell the Rust compiler to use these instructions, set
the `RUSTFLAGS` environment variable to specify a more specific target CPU.
We recommend setting `target-cpu` or at least `avx2`, or preferably at least
`native` (whatever the current CPU is). For example, to build and run DataFusion
with optimizations for your current CPU:
```shell
RUSTFLAGS='-C target-cpu=native' cargo run --release
```
[simd instructions]: https://en.wikipedia.org/wiki/SIMD
### Enable Link Time Optimization / Single Codegen Unit
You can potentially improve your performance by compiling DataFusion into a
single codegen unit which gives the Rust compiler more opportunity to optimize
across crate boundaries. To do so, modify your projects' `Cargo.toml` to include
`lto = true` and `codegen-units = 1` as shown below. Beware that using a single
codegen unit _significantly_ increases `--release` build times.
```toml
[profile.release]
lto = true
codegen-units = 1
```
### Alternate Allocator: `snmalloc`
You can also use [snmalloc-rs](https://crates.io/crates/snmalloc-rs) crate as
the memory allocator for DataFusion to improve performance. To do so, add the
dependency to your `Cargo.toml` as shown below.
```toml
[dependencies]
snmalloc-rs = "0.3"
```
Then, in `main.rs.` update the memory allocator with the below after your imports:
<!-- Note can't include snmalloc-rs in a runnable example, because it takes over the global allocator -->
```no-run
use datafusion::prelude::*;
#[global_allocator]
static ALLOC: snmalloc_rs::SnMalloc = snmalloc_rs::SnMalloc;
#[tokio::main]
async fn main() -> datafusion::error::Result<()> {
Ok(())
}
```
## Enable Backtraces
By default, Datafusion returns errors as a plain text message. You can enable more verbose details about the error,
such as backtraces by enabling the `backtrace` feature to your `Cargo.toml` file like this:
```toml
datafusion = { version = "31.0.0", features = ["backtrace"]}
```
Set environment [variables](https://doc.rust-lang.org/std/backtrace/index.html#environment-variables)
```bash
RUST_BACKTRACE=1 ./target/debug/datafusion-cli
DataFusion CLI v31.0.0
> select row_numer() over (partition by a order by a) from (select 1 a);
Error during planning: Invalid function 'row_numer'.
Did you mean 'ROW_NUMBER'?
backtrace: 0: std::backtrace_rs::backtrace::libunwind::trace
at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/std/src/../../backtrace/src/backtrace/libunwind.rs:93:5
1: std::backtrace_rs::backtrace::trace_unsynchronized
at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
2: std::backtrace::Backtrace::create
at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/std/src/backtrace.rs:332:13
3: std::backtrace::Backtrace::capture
at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/std/src/backtrace.rs:298:9
4: datafusion_common::error::DataFusionError::get_back_trace
at /datafusion/datafusion/common/src/error.rs:436:30
5: datafusion_sql::expr::function::<impl datafusion_sql::planner::SqlToRel<S>>::sql_function_to_expr
............
```
The backtraces are useful when debugging code. If there is a test in `datafusion/core/src/physical_planner.rs`
```rust
#[tokio::test]
async fn test_get_backtrace_for_failed_code() -> Result<()> {
let ctx = SessionContext::new();
let sql = "
select row_numer() over (partition by a order by a) from (select 1 a);
";
let _ = ctx.sql(sql).await?.collect().await?;
Ok(())
}
```
To obtain a backtrace:
```bash
cargo build --features=backtrace
RUST_BACKTRACE=1 cargo test --features=backtrace --package datafusion --lib -- physical_planner::tests::test_get_backtrace_for_failed_code --exact --nocapture
running 1 test
Error: Plan("Invalid function 'row_numer'.\nDid you mean 'ROW_NUMBER'?\n\nbacktrace: 0: std::backtrace_rs::backtrace::libunwind::trace\n at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/../../backtrace/src/backtrace/libunwind.rs:105:5\n 1: std::backtrace_rs::backtrace::trace_unsynchronized\n...
```
Note: The backtrace wrapped into systems calls, so some steps on top of the backtrace can be ignored
To show the backtrace in a pretty-printed format use `eprintln!("{e}");`.
```rust
#[tokio::test]
async fn test_get_backtrace_for_failed_code() -> Result<()> {
let ctx = SessionContext::new();
let sql = "select row_numer() over (partition by a order by a) from (select 1 a);";
let _ = match ctx.sql(sql).await {
Ok(result) => result.show().await?,
Err(e) => {
eprintln!("{e}");
}
};
Ok(())
}
```
Then run the test:
```bash
$ RUST_BACKTRACE=1 cargo test --features=backtrace --package datafusion --lib -- physical_planner::tests::test_get_backtrace_for_failed_code --exact --nocapture
running 1 test
Error during planning: Invalid function 'row_numer'.
Did you mean 'ROW_NUMBER'?
backtrace: 0: std::backtrace_rs::backtrace::libunwind::trace
at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/../../backtrace/src/backtrace/libunwind.rs:105:5
1: std::backtrace_rs::backtrace::trace_unsynchronized
at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
2: std::backtrace::Backtrace::create
at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/backtrace.rs:331:13
3: std::backtrace::Backtrace::capture
...
```