| #!/bin/bash |
| # |
| # Licensed to the Apache Software Foundation (ASF) under one |
| # or more contributor license agreements. See the NOTICE file |
| # distributed with this work for additional information |
| # regarding copyright ownership. The ASF licenses this file |
| # to you under the Apache License, Version 2.0 (the |
| # "License"); you may not use this file except in compliance |
| # with the License. You may obtain a copy of the License at |
| # |
| # http://www.apache.org/licenses/LICENSE-2.0 |
| # |
| # Unless required by applicable law or agreed to in writing, |
| # software distributed under the License is distributed on an |
| # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| # KIND, either express or implied. See the License for the |
| # specific language governing permissions and limitations |
| # under the License. |
| # |
| |
| set -e |
| |
| SOURCE_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" |
| cd "${SOURCE_DIR}/../" && pwd |
| |
| TARGET_FILE="docs/source/user-guide/configs.md" |
| PRINT_CONFIG_DOCS_COMMAND="cargo run --manifest-path datafusion/core/Cargo.toml --bin print_config_docs" |
| PRINT_RUNTIME_CONFIG_DOCS_COMMAND="cargo run --manifest-path datafusion/core/Cargo.toml --bin print_runtime_config_docs" |
| |
| |
| echo "Inserting header" |
| cat <<'EOF' > "$TARGET_FILE" |
| <!--- |
| Licensed to the Apache Software Foundation (ASF) under one |
| or more contributor license agreements. See the NOTICE file |
| distributed with this work for additional information |
| regarding copyright ownership. The ASF licenses this file |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this file except in compliance |
| with the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, |
| software distributed under the License is distributed on an |
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| KIND, either express or implied. See the License for the |
| specific language governing permissions and limitations |
| under the License. |
| --> |
| |
| <!--- |
| NOTE: This file was generated by the dev/update_config_docs.sh script. |
| Do not edit it manually as changes will be overwritten. |
| Instead, edit dev/update_config_docs.sh or the docstrings in datafusion/core/src/config.rs. |
| --> |
| |
| # Configuration Settings |
| |
| DataFusion configurations control various aspects of DataFusion planning and execution |
| |
| ## Setting Configuration Options |
| |
| ### Programmatically |
| You can set the options programmatically via the [`ConfigOptions`] object. For |
| example, to configure the `datafusion.execution.target_partitions` using the API: |
| |
| ```rust |
| use datafusion::common::config::ConfigOptions; |
| let mut config = ConfigOptions::new(); |
| config.execution.target_partitions = 1; |
| ``` |
| |
| ### Via Environment Variables |
| |
| You can also set configuration options via environment variables using |
| [`ConfigOptions::from_env`], for example |
| |
| ```shell |
| DATAFUSION_EXECUTION_TARGET_PARTITIONS=1 ./your_program |
| ``` |
| |
| ### Via SQL |
| |
| You can also set configuration options via SQL using the `SET` command. For |
| example, to configure `datafusion.execution.target_partitions`: |
| |
| ```sql |
| SET datafusion.execution.target_partitions = '1'; |
| ``` |
| |
| [`ConfigOptions`]: https://docs.rs/datafusion/latest/datafusion/common/config/struct.ConfigOptions.html |
| [`ConfigOptions::from_env`]: https://docs.rs/datafusion/latest/datafusion/common/config/struct.ConfigOptions.html#method.from_env |
| |
| The following configuration settings are available: |
| |
| EOF |
| |
| echo "Running CLI and inserting config docs table" |
| $PRINT_CONFIG_DOCS_COMMAND >> "$TARGET_FILE" |
| |
| echo "Inserting runtime config header" |
| cat <<'EOF' >> "$TARGET_FILE" |
| |
| # Runtime Configuration Settings |
| |
| DataFusion runtime configurations can be set via SQL using the `SET` command. |
| |
| For example, to configure `datafusion.runtime.memory_limit`: |
| |
| ```sql |
| SET datafusion.runtime.memory_limit = '2G'; |
| ``` |
| |
| The following runtime configuration settings are available: |
| |
| EOF |
| |
| echo "Running CLI and inserting runtime config docs table" |
| $PRINT_RUNTIME_CONFIG_DOCS_COMMAND >> "$TARGET_FILE" |
| |
| cat <<'EOF' >> "$TARGET_FILE" |
| |
| # Tuning Guide |
| |
| ## Short Queries |
| |
| By default DataFusion will attempt to maximize parallelism and use all cores -- |
| For example, if you have 32 cores, each plan will split the data into 32 |
| partitions. However, if your data is small, the overhead of splitting the data |
| to enable parallelization can dominate the actual computation. |
| |
| You can find out how many cores are being used via the [`EXPLAIN`] command and look |
| at the number of partitions in the plan. |
| |
| [`EXPLAIN`]: sql/explain.md |
| |
| The `datafusion.optimizer.repartition_file_min_size` option controls the minimum file size the |
| [`ListingTable`] provider will attempt to repartition. However, this |
| does not apply to user defined data sources and only works when DataFusion has accurate statistics. |
| |
| If you know your data is small, you can set the `datafusion.execution.target_partitions` |
| option to a smaller number to reduce the overhead of repartitioning. For very small datasets (e.g. less |
| than 1MB), we recommend setting `target_partitions` to 1 to avoid repartitioning altogether. |
| |
| ```sql |
| SET datafusion.execution.target_partitions = '1'; |
| ``` |
| |
| [`ListingTable`]: https://docs.rs/datafusion/latest/datafusion/datasource/listing/struct.ListingTable.html |
| |
| ## Memory-limited Queries |
| |
| When executing a memory-consuming query under a tight memory limit, DataFusion |
| will spill intermediate results to disk. |
| |
| When the [`FairSpillPool`] is used, memory is divided evenly among partitions. |
| The higher the value of `datafusion.execution.target_partitions`, the less memory |
| is allocated to each partition, and the out-of-core execution path may trigger |
| more frequently, possibly slowing down execution. |
| |
| Additionally, while spilling, data is read back in `datafusion.execution.batch_size` size batches. |
| The larger this value, the fewer spilled sorted runs can be merged. Decreasing this setting |
| can help reduce the number of subsequent spills required. |
| |
| In conclusion, for queries under a very tight memory limit, it's recommended to |
| set `target_partitions` and `batch_size` to smaller values. |
| |
| ```sql |
| -- Query still gets parallelized, but each partition will have more memory to use |
| SET datafusion.execution.target_partitions = 4; |
| -- Smaller than the default '8192', while still keep the benefit of vectorized execution |
| SET datafusion.execution.batch_size = 1024; |
| ``` |
| |
| [`FairSpillPool`]: https://docs.rs/datafusion/latest/datafusion/execution/memory_pool/struct.FairSpillPool.html |
| |
| ## Join Queries |
| |
| Currently Apache Datafusion supports the following join algorithms: |
| |
| - Nested Loop Join |
| - Sort Merge Join |
| - Hash Join |
| - Symmetric Hash Join |
| - Piecewise Merge Join (experimental) |
| |
| The physical planner will choose the appropriate algorithm based on the statistics + join |
| condition of the two tables. |
| |
| # Join Algorithm Optimizer Configurations |
| |
| You can modify join optimization behavior in your queries by setting specific configuration values. |
| Use the following command to update a configuration: |
| |
| ``` sql |
| SET datafusion.optimizer.<configuration_name>; |
| ``` |
| |
| Example |
| |
| ``` sql |
| SET datafusion.optimizer.prefer_hash_join = false; |
| ``` |
| |
| Adjusting the following configuration values influences how the optimizer selects the join algorithm |
| used to execute your SQL query: |
| |
| ## Join Optimizer Configurations |
| |
| Adjusting the following configuration values influences how the optimizer selects the join algorithm |
| used to execute your SQL query. |
| |
| ### allow_symmetric_joins_without_pruning (bool, default = true) |
| |
| Controls whether symmetric hash joins are allowed for unbounded data sources even when their inputs |
| lack ordering or filtering. |
| |
| - If disabled, the `SymmetricHashJoin` operator cannot prune its internal buffers to be produced only at the end of execution. |
| |
| ### prefer_hash_join (bool, default = true) |
| |
| Determines whether the optimizer prefers Hash Join over Sort Merge Join during physical plan selection. |
| |
| - true: favors HashJoin for faster execution when sufficient memory is available. |
| - false: allows SortMergeJoin to be chosen when more memory-efficient execution is needed. |
| |
| ### enable_piecewise_merge_join (bool, default = false) |
| |
| Enables the experimental Piecewise Merge Join algorithm. |
| |
| - When enabled, the physical planner may select PiecewiseMergeJoin if there is exactly one range |
| filter in the join condition. |
| - Piecewise Merge Join is faster than Nested Loop Join performance wise for single range filter |
| except for cases where it is joining two large tables (num_rows > 100,000) that are approximately |
| equal in size. |
| |
| EOF |
| |
| |
| echo "Running prettier" |
| npx prettier@2.3.2 --write "$TARGET_FILE" |
| |
| echo "'$TARGET_FILE' successfully updated!" |