dev/update_config_docs.sh - datafusion - Git at Google

 #!/bin/bash
 #
 # Licensed to the Apache Software Foundation (ASF) under one
 # or more contributor license agreements.  See the NOTICE file
 # distributed with this work for additional information
 # regarding copyright ownership.  The ASF licenses this file
 # to you under the Apache License, Version 2.0 (the
 # "License"); you may not use this file except in compliance
 # with the License.  You may obtain a copy of the License at
 #
 #   http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing,
 # software distributed under the License is distributed on an
 # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 # KIND, either express or implied.  See the License for the
 # specific language governing permissions and limitations
 # under the License.
 #

 set -e

 SOURCE_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
 cd "${SOURCE_DIR}/../" && pwd

 TARGET_FILE="docs/source/user-guide/configs.md"
 PRINT_CONFIG_DOCS_COMMAND="cargo run --manifest-path datafusion/core/Cargo.toml --bin print_config_docs"
 PRINT_RUNTIME_CONFIG_DOCS_COMMAND="cargo run --manifest-path datafusion/core/Cargo.toml --bin print_runtime_config_docs"


 echo "Inserting header"
 cat <<'EOF' > "$TARGET_FILE"
 <!---
   Licensed to the Apache Software Foundation (ASF) under one
   or more contributor license agreements.  See the NOTICE file
   distributed with this work for additional information
   regarding copyright ownership.  The ASF licenses this file
   to you under the Apache License, Version 2.0 (the
   "License"); you may not use this file except in compliance
   with the License.  You may obtain a copy of the License at

     http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing,
   software distributed under the License is distributed on an
   "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
   KIND, either express or implied.  See the License for the
   specific language governing permissions and limitations
   under the License.
 -->

 <!---
 NOTE: This file was generated by the dev/update_config_docs.sh script.
 Do not edit it manually as changes will be overwritten.
 Instead, edit dev/update_config_docs.sh or the docstrings in datafusion/core/src/config.rs.
 -->

 # Configuration Settings

 DataFusion configurations control various aspects of DataFusion planning and execution

 ## Setting Configuration Options

 ### Programmatically
 You can set the options programmatically via the [`ConfigOptions`] object. For
 example, to configure the `datafusion.execution.target_partitions` using the API:

 ```rust
 use datafusion::common::config::ConfigOptions;
 let mut config = ConfigOptions::new();
 config.execution.target_partitions = 1;
 ```

 ### Via Environment Variables

 You can also set configuration options via environment variables using
 [`ConfigOptions::from_env`], for example

 ```shell
 DATAFUSION_EXECUTION_TARGET_PARTITIONS=1 ./your_program
 ```

 ### Via SQL

 You can also set configuration options via SQL using the `SET` command. For
 example, to configure `datafusion.execution.target_partitions`:

 ```sql
 SET datafusion.execution.target_partitions = '1';
 ```

 [`ConfigOptions`]: https://docs.rs/datafusion/latest/datafusion/common/config/struct.ConfigOptions.html
 [`ConfigOptions::from_env`]: https://docs.rs/datafusion/latest/datafusion/common/config/struct.ConfigOptions.html#method.from_env

 The following configuration settings are available:

 EOF

 echo "Running CLI and inserting config docs table"
 $PRINT_CONFIG_DOCS_COMMAND >> "$TARGET_FILE"

 echo "Inserting runtime config header"
 cat <<'EOF' >> "$TARGET_FILE"

 # Runtime Configuration Settings

 DataFusion runtime configurations can be set via SQL using the `SET` command.

 For example, to configure `datafusion.runtime.memory_limit`:

 ```sql
 SET datafusion.runtime.memory_limit = '2G';
 ```

 The following runtime configuration settings are available:

 EOF

 echo "Running CLI and inserting runtime config docs table"
 $PRINT_RUNTIME_CONFIG_DOCS_COMMAND >> "$TARGET_FILE"

 cat <<'EOF' >> "$TARGET_FILE"

 # Tuning Guide

 ## Short Queries

 By default DataFusion will attempt to maximize parallelism and use all cores --
 For example, if you have 32 cores, each plan will split the data into 32
 partitions. However, if your data is small, the overhead of splitting the data
 to enable parallelization can dominate the actual computation.

 You can find out how many cores are being used via the [`EXPLAIN`] command and look
 at the number of partitions in the plan.

 [`EXPLAIN`]: sql/explain.md

 The `datafusion.optimizer.repartition_file_min_size` option controls the minimum file size the
 [`ListingTable`] provider will attempt to repartition. However, this
 does not apply to user defined data sources and only works when DataFusion has accurate statistics.

 If you know your data is small, you can set the `datafusion.execution.target_partitions`
 option to a smaller number to reduce the overhead of repartitioning. For very small datasets (e.g. less
 than 1MB), we recommend setting `target_partitions` to 1 to avoid repartitioning altogether.

 ```sql
 SET datafusion.execution.target_partitions = '1';
 ```

 [`ListingTable`]: https://docs.rs/datafusion/latest/datafusion/datasource/listing/struct.ListingTable.html

 ## Memory-limited Queries

 When executing a memory-consuming query under a tight memory limit, DataFusion
 will spill intermediate results to disk.

 When the [`FairSpillPool`] is used, memory is divided evenly among partitions.
 The higher the value of `datafusion.execution.target_partitions`, the less memory
 is allocated to each partition, and the out-of-core execution path may trigger
 more frequently, possibly slowing down execution.

 Additionally, while spilling, data is read back in `datafusion.execution.batch_size` size batches.
 The larger this value, the fewer spilled sorted runs can be merged. Decreasing this setting
 can help reduce the number of subsequent spills required.

 In conclusion, for queries under a very tight memory limit, it's recommended to
 set `target_partitions` and `batch_size` to smaller values.

 ```sql
 -- Query still gets parallelized, but each partition will have more memory to use
 SET datafusion.execution.target_partitions = 4;
 -- Smaller than the default '8192', while still keep the benefit of vectorized execution
 SET datafusion.execution.batch_size = 1024;
 ```

 [`FairSpillPool`]: https://docs.rs/datafusion/latest/datafusion/execution/memory_pool/struct.FairSpillPool.html

 ## Join Queries

 Currently Apache Datafusion supports the following join algorithms:

 - Nested Loop Join
 - Sort Merge Join
 - Hash Join
 - Symmetric Hash Join
 - Piecewise Merge Join (experimental)

 The physical planner will choose the appropriate algorithm based on the statistics + join
 condition of the two tables.

 # Join Algorithm Optimizer Configurations

 You can modify join optimization behavior in your queries by setting specific configuration values.
 Use the following command to update a configuration:

 ``` sql
 SET datafusion.optimizer.<configuration_name>;
 ```

 Example

 ``` sql
 SET datafusion.optimizer.prefer_hash_join = false;
 ```

 Adjusting the following configuration values influences how the optimizer selects the join algorithm
 used to execute your SQL query:

 ## Join Optimizer Configurations

 Adjusting the following configuration values influences how the optimizer selects the join algorithm
 used to execute your SQL query.

 ### allow_symmetric_joins_without_pruning (bool, default = true)

 Controls whether symmetric hash joins are allowed for unbounded data sources even when their inputs
 lack ordering or filtering.

 - If disabled, the `SymmetricHashJoin` operator cannot prune its internal buffers to be produced only at the end of execution.

 ### prefer_hash_join (bool, default = true)

 Determines whether the optimizer prefers Hash Join over Sort Merge Join during physical plan selection.

 - true: favors HashJoin for faster execution when sufficient memory is available.
 - false: allows SortMergeJoin to be chosen when more memory-efficient execution is needed.

 ### enable_piecewise_merge_join (bool, default = false)

 Enables the experimental Piecewise Merge Join algorithm.

 - When enabled, the physical planner may select PiecewiseMergeJoin if there is exactly one range
   filter in the join condition.
 - Piecewise Merge Join is faster than Nested Loop Join performance wise for single range filter
   except for cases where it is joining two large tables (num_rows > 100,000) that are approximately
   equal in size.

 EOF


 echo "Running prettier"
 npx prettier@2.3.2 --write "$TARGET_FILE"

 echo "'$TARGET_FILE' successfully updated!"
	#!/bin/bash
	#
	# Licensed to the Apache Software Foundation (ASF) under one
	# or more contributor license agreements. See the NOTICE file
	# distributed with this work for additional information
	# regarding copyright ownership. The ASF licenses this file
	# to you under the Apache License, Version 2.0 (the
	# "License"); you may not use this file except in compliance
	# with the License. You may obtain a copy of the License at
	#
	# http://www.apache.org/licenses/LICENSE-2.0
	#
	# Unless required by applicable law or agreed to in writing,
	# software distributed under the License is distributed on an
	# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	# KIND, either express or implied. See the License for the
	# specific language governing permissions and limitations
	# under the License.
	#

	set -e

	SOURCE_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
	cd "${SOURCE_DIR}/../" && pwd

	TARGET_FILE="docs/source/user-guide/configs.md"
	PRINT_CONFIG_DOCS_COMMAND="cargo run --manifest-path datafusion/core/Cargo.toml --bin print_config_docs"
	PRINT_RUNTIME_CONFIG_DOCS_COMMAND="cargo run --manifest-path datafusion/core/Cargo.toml --bin print_runtime_config_docs"


	echo "Inserting header"
	cat <<'EOF' > "$TARGET_FILE"
	<!---
	Licensed to the Apache Software Foundation (ASF) under one
	or more contributor license agreements. See the NOTICE file
	distributed with this work for additional information
	regarding copyright ownership. The ASF licenses this file
	to you under the Apache License, Version 2.0 (the
	"License"); you may not use this file except in compliance
	with the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing,
	software distributed under the License is distributed on an
	"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	KIND, either express or implied. See the License for the
	specific language governing permissions and limitations
	under the License.
	-->

	<!---
	NOTE: This file was generated by the dev/update_config_docs.sh script.
	Do not edit it manually as changes will be overwritten.
	Instead, edit dev/update_config_docs.sh or the docstrings in datafusion/core/src/config.rs.
	-->

	# Configuration Settings

	DataFusion configurations control various aspects of DataFusion planning and execution

	## Setting Configuration Options

	### Programmatically
	You can set the options programmatically via the [`ConfigOptions`] object. For
	example, to configure the `datafusion.execution.target_partitions` using the API:

	```rust
	use datafusion::common::config::ConfigOptions;
	let mut config = ConfigOptions::new();
	config.execution.target_partitions = 1;
	```

	### Via Environment Variables

	You can also set configuration options via environment variables using
	[`ConfigOptions::from_env`], for example

	```shell
	DATAFUSION_EXECUTION_TARGET_PARTITIONS=1 ./your_program
	```

	### Via SQL

	You can also set configuration options via SQL using the `SET` command. For
	example, to configure `datafusion.execution.target_partitions`:

	```sql
	SET datafusion.execution.target_partitions = '1';
	```

	[`ConfigOptions`]: https://docs.rs/datafusion/latest/datafusion/common/config/struct.ConfigOptions.html
	[`ConfigOptions::from_env`]: https://docs.rs/datafusion/latest/datafusion/common/config/struct.ConfigOptions.html#method.from_env

	The following configuration settings are available:

	EOF

	echo "Running CLI and inserting config docs table"
	$PRINT_CONFIG_DOCS_COMMAND >> "$TARGET_FILE"

	echo "Inserting runtime config header"
	cat <<'EOF' >> "$TARGET_FILE"

	# Runtime Configuration Settings

	DataFusion runtime configurations can be set via SQL using the `SET` command.

	For example, to configure `datafusion.runtime.memory_limit`:

	```sql
	SET datafusion.runtime.memory_limit = '2G';
	```

	The following runtime configuration settings are available:

	EOF

	echo "Running CLI and inserting runtime config docs table"
	$PRINT_RUNTIME_CONFIG_DOCS_COMMAND >> "$TARGET_FILE"

	cat <<'EOF' >> "$TARGET_FILE"

	# Tuning Guide

	## Short Queries

	By default DataFusion will attempt to maximize parallelism and use all cores --
	For example, if you have 32 cores, each plan will split the data into 32
	partitions. However, if your data is small, the overhead of splitting the data
	to enable parallelization can dominate the actual computation.

	You can find out how many cores are being used via the [`EXPLAIN`] command and look
	at the number of partitions in the plan.

	[`EXPLAIN`]: sql/explain.md

	The `datafusion.optimizer.repartition_file_min_size` option controls the minimum file size the
	[`ListingTable`] provider will attempt to repartition. However, this
	does not apply to user defined data sources and only works when DataFusion has accurate statistics.

	If you know your data is small, you can set the `datafusion.execution.target_partitions`
	option to a smaller number to reduce the overhead of repartitioning. For very small datasets (e.g. less
	than 1MB), we recommend setting `target_partitions` to 1 to avoid repartitioning altogether.

	```sql
	SET datafusion.execution.target_partitions = '1';
	```

	[`ListingTable`]: https://docs.rs/datafusion/latest/datafusion/datasource/listing/struct.ListingTable.html

	## Memory-limited Queries

	When executing a memory-consuming query under a tight memory limit, DataFusion
	will spill intermediate results to disk.

	When the [`FairSpillPool`] is used, memory is divided evenly among partitions.
	The higher the value of `datafusion.execution.target_partitions`, the less memory
	is allocated to each partition, and the out-of-core execution path may trigger
	more frequently, possibly slowing down execution.

	Additionally, while spilling, data is read back in `datafusion.execution.batch_size` size batches.
	The larger this value, the fewer spilled sorted runs can be merged. Decreasing this setting
	can help reduce the number of subsequent spills required.

	In conclusion, for queries under a very tight memory limit, it's recommended to
	set `target_partitions` and `batch_size` to smaller values.

	```sql
	-- Query still gets parallelized, but each partition will have more memory to use
	SET datafusion.execution.target_partitions = 4;
	-- Smaller than the default '8192', while still keep the benefit of vectorized execution
	SET datafusion.execution.batch_size = 1024;
	```

	[`FairSpillPool`]: https://docs.rs/datafusion/latest/datafusion/execution/memory_pool/struct.FairSpillPool.html

	## Join Queries

	Currently Apache Datafusion supports the following join algorithms:

	- Nested Loop Join
	- Sort Merge Join
	- Hash Join
	- Symmetric Hash Join
	- Piecewise Merge Join (experimental)

	The physical planner will choose the appropriate algorithm based on the statistics + join
	condition of the two tables.

	# Join Algorithm Optimizer Configurations

	You can modify join optimization behavior in your queries by setting specific configuration values.
	Use the following command to update a configuration:

	``` sql
	SET datafusion.optimizer.<configuration_name>;
	```

	Example

	``` sql
	SET datafusion.optimizer.prefer_hash_join = false;
	```

	Adjusting the following configuration values influences how the optimizer selects the join algorithm
	used to execute your SQL query:

	## Join Optimizer Configurations

	Adjusting the following configuration values influences how the optimizer selects the join algorithm
	used to execute your SQL query.

	### allow_symmetric_joins_without_pruning (bool, default = true)

	Controls whether symmetric hash joins are allowed for unbounded data sources even when their inputs
	lack ordering or filtering.

	- If disabled, the `SymmetricHashJoin` operator cannot prune its internal buffers to be produced only at the end of execution.

	### prefer_hash_join (bool, default = true)

	Determines whether the optimizer prefers Hash Join over Sort Merge Join during physical plan selection.

	- true: favors HashJoin for faster execution when sufficient memory is available.
	- false: allows SortMergeJoin to be chosen when more memory-efficient execution is needed.

	### enable_piecewise_merge_join (bool, default = false)

	Enables the experimental Piecewise Merge Join algorithm.

	- When enabled, the physical planner may select PiecewiseMergeJoin if there is exactly one range
	filter in the join condition.
	- Piecewise Merge Join is faster than Nested Loop Join performance wise for single range filter
	except for cases where it is joining two large tables (num_rows > 100,000) that are approximately
	equal in size.

	EOF


	echo "Running prettier"
	npx prettier@2.3.2 --write "$TARGET_FILE"

	echo "'$TARGET_FILE' successfully updated!"