blob: 90bbc5d3bad06cbd10b573f05010b0ad4fb143ec [file] [log] [blame]
#!/bin/bash
#
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
#
set -e
SOURCE_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
cd "${SOURCE_DIR}/../" && pwd
TARGET_FILE="docs/source/user-guide/configs.md"
PRINT_CONFIG_DOCS_COMMAND="cargo run --manifest-path datafusion/core/Cargo.toml --bin print_config_docs"
PRINT_RUNTIME_CONFIG_DOCS_COMMAND="cargo run --manifest-path datafusion/core/Cargo.toml --bin print_runtime_config_docs"
echo "Inserting header"
cat <<'EOF' > "$TARGET_FILE"
<!---
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
<!---
NOTE: This file was generated by the dev/update_config_docs.sh script.
Do not edit it manually as changes will be overwritten.
Instead, edit dev/update_config_docs.sh or the docstrings in datafusion/core/src/config.rs.
-->
# Configuration Settings
DataFusion configurations control various aspects of DataFusion planning and execution
## Setting Configuration Options
### Programmatically
You can set the options programmatically via the [`ConfigOptions`] object. For
example, to configure the `datafusion.execution.target_partitions` using the API:
```rust
use datafusion::common::config::ConfigOptions;
let mut config = ConfigOptions::new();
config.execution.target_partitions = 1;
```
### Via Environment Variables
You can also set configuration options via environment variables using
[`ConfigOptions::from_env`], for example
```shell
DATAFUSION_EXECUTION_TARGET_PARTITIONS=1 ./your_program
```
### Via SQL
You can also set configuration options via SQL using the `SET` command. For
example, to configure `datafusion.execution.target_partitions`:
```sql
SET datafusion.execution.target_partitions = '1';
```
[`ConfigOptions`]: https://docs.rs/datafusion/latest/datafusion/common/config/struct.ConfigOptions.html
[`ConfigOptions::from_env`]: https://docs.rs/datafusion/latest/datafusion/common/config/struct.ConfigOptions.html#method.from_env
The following configuration settings are available:
EOF
echo "Running CLI and inserting config docs table"
$PRINT_CONFIG_DOCS_COMMAND >> "$TARGET_FILE"
echo "Inserting runtime config header"
cat <<'EOF' >> "$TARGET_FILE"
# Runtime Configuration Settings
DataFusion runtime configurations can be set via SQL using the `SET` command.
For example, to configure `datafusion.runtime.memory_limit`:
```sql
SET datafusion.runtime.memory_limit = '2G';
```
The following runtime configuration settings are available:
EOF
echo "Running CLI and inserting runtime config docs table"
$PRINT_RUNTIME_CONFIG_DOCS_COMMAND >> "$TARGET_FILE"
cat <<'EOF' >> "$TARGET_FILE"
# Tuning Guide
## Short Queries
By default DataFusion will attempt to maximize parallelism and use all cores --
For example, if you have 32 cores, each plan will split the data into 32
partitions. However, if your data is small, the overhead of splitting the data
to enable parallelization can dominate the actual computation.
You can find out how many cores are being used via the [`EXPLAIN`] command and look
at the number of partitions in the plan.
[`EXPLAIN`]: sql/explain.md
The `datafusion.optimizer.repartition_file_min_size` option controls the minimum file size the
[`ListingTable`] provider will attempt to repartition. However, this
does not apply to user defined data sources and only works when DataFusion has accurate statistics.
If you know your data is small, you can set the `datafusion.execution.target_partitions`
option to a smaller number to reduce the overhead of repartitioning. For very small datasets (e.g. less
than 1MB), we recommend setting `target_partitions` to 1 to avoid repartitioning altogether.
```sql
SET datafusion.execution.target_partitions = '1';
```
[`ListingTable`]: https://docs.rs/datafusion/latest/datafusion/datasource/listing/struct.ListingTable.html
## Memory-limited Queries
When executing a memory-consuming query under a tight memory limit, DataFusion
will spill intermediate results to disk.
When the [`FairSpillPool`] is used, memory is divided evenly among partitions.
The higher the value of `datafusion.execution.target_partitions`, the less memory
is allocated to each partition, and the out-of-core execution path may trigger
more frequently, possibly slowing down execution.
Additionally, while spilling, data is read back in `datafusion.execution.batch_size` size batches.
The larger this value, the fewer spilled sorted runs can be merged. Decreasing this setting
can help reduce the number of subsequent spills required.
In conclusion, for queries under a very tight memory limit, it's recommended to
set `target_partitions` and `batch_size` to smaller values.
```sql
-- Query still gets parallelized, but each partition will have more memory to use
SET datafusion.execution.target_partitions = 4;
-- Smaller than the default '8192', while still keep the benefit of vectorized execution
SET datafusion.execution.batch_size = 1024;
```
[`FairSpillPool`]: https://docs.rs/datafusion/latest/datafusion/execution/memory_pool/struct.FairSpillPool.html
## Join Queries
Currently Apache Datafusion supports the following join algorithms:
- Nested Loop Join
- Sort Merge Join
- Hash Join
- Symmetric Hash Join
- Piecewise Merge Join (experimental)
The physical planner will choose the appropriate algorithm based on the statistics + join
condition of the two tables.
# Join Algorithm Optimizer Configurations
You can modify join optimization behavior in your queries by setting specific configuration values.
Use the following command to update a configuration:
``` sql
SET datafusion.optimizer.<configuration_name>;
```
Example
``` sql
SET datafusion.optimizer.prefer_hash_join = false;
```
Adjusting the following configuration values influences how the optimizer selects the join algorithm
used to execute your SQL query:
## Join Optimizer Configurations
Adjusting the following configuration values influences how the optimizer selects the join algorithm
used to execute your SQL query.
### allow_symmetric_joins_without_pruning (bool, default = true)
Controls whether symmetric hash joins are allowed for unbounded data sources even when their inputs
lack ordering or filtering.
- If disabled, the `SymmetricHashJoin` operator cannot prune its internal buffers to be produced only at the end of execution.
### prefer_hash_join (bool, default = true)
Determines whether the optimizer prefers Hash Join over Sort Merge Join during physical plan selection.
- true: favors HashJoin for faster execution when sufficient memory is available.
- false: allows SortMergeJoin to be chosen when more memory-efficient execution is needed.
### enable_piecewise_merge_join (bool, default = false)
Enables the experimental Piecewise Merge Join algorithm.
- When enabled, the physical planner may select PiecewiseMergeJoin if there is exactly one range
filter in the join condition.
- Piecewise Merge Join is faster than Nested Loop Join performance wise for single range filter
except for cases where it is joining two large tables (num_rows > 100,000) that are approximately
equal in size.
EOF
echo "Running prettier"
npx prettier@2.3.2 --write "$TARGET_FILE"
echo "'$TARGET_FILE' successfully updated!"