docs/source/user-guide/configuration.rst - datafusion-python - Git at Google

 .. Licensed to the Apache Software Foundation (ASF) under one
 .. or more contributor license agreements.  See the NOTICE file
 .. distributed with this work for additional information
 .. regarding copyright ownership.  The ASF licenses this file
 .. to you under the Apache License, Version 2.0 (the
 .. "License"); you may not use this file except in compliance
 .. with the License.  You may obtain a copy of the License at

 ..   http://www.apache.org/licenses/LICENSE-2.0

 .. Unless required by applicable law or agreed to in writing,
 .. software distributed under the License is distributed on an
 .. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 .. KIND, either express or implied.  See the License for the
 .. specific language governing permissions and limitations
 .. under the License.

 .. _configuration:

 Configuration
 =============

 Let's look at how we can configure DataFusion. When creating a :py:class:`~datafusion.context.SessionContext`, you can pass in
 a :py:class:`~datafusion.context.SessionConfig` and :py:class:`~datafusion.context.RuntimeEnvBuilder` object. These two cover a wide range of options.

 .. code-block:: python

     from datafusion import RuntimeEnvBuilder, SessionConfig, SessionContext

     # create a session context with default settings
     ctx = SessionContext()
     print(ctx)

     # create a session context with explicit runtime and config settings
     runtime = RuntimeEnvBuilder().with_disk_manager_os().with_fair_spill_pool(10000000)
     config = (
         SessionConfig()
         .with_create_default_catalog_and_schema(True)
         .with_default_catalog_and_schema("foo", "bar")
         .with_target_partitions(8)
         .with_information_schema(True)
         .with_repartition_joins(False)
         .with_repartition_aggregations(False)
         .with_repartition_windows(False)
         .with_parquet_pruning(False)
         .set("datafusion.execution.parquet.pushdown_filters", "true")
     )
     ctx = SessionContext(config, runtime)
     print(ctx)

 Maximizing CPU Usage
 --------------------

 DataFusion uses partitions to parallelize work. For small queries the
 default configuration (number of CPU cores) is often sufficient, but to
 fully utilize available hardware you can tune how many partitions are
 created and when DataFusion will repartition data automatically.

 Configure a ``SessionContext`` with a higher partition count:

 .. code-block:: python

     from datafusion import SessionConfig, SessionContext

     # allow up to 16 concurrent partitions
     config = SessionConfig().with_target_partitions(16)
     ctx = SessionContext(config)

 Automatic repartitioning for joins, aggregations, window functions and
 other operations can be enabled to increase parallelism:

 .. code-block:: python

     config = (
         SessionConfig()
         .with_target_partitions(16)
         .with_repartition_joins(True)
         .with_repartition_aggregations(True)
         .with_repartition_windows(True)
     )

 Manual repartitioning is available on DataFrames when you need precise
 control:

 .. code-block:: python

     from datafusion import col

     df = ctx.read_parquet("data.parquet")

     # Evenly divide into 16 partitions
     df = df.repartition(16)

     # Or partition by the hash of a column
     df = df.repartition_by_hash(col("a"), num=16)

     result = df.collect()


 Benchmark Example
 ^^^^^^^^^^^^^^^^^

 The repository includes a benchmark script that demonstrates how to maximize CPU usage
 with DataFusion. The :code:`benchmarks/max_cpu_usage.py` script shows a practical example
 of configuring DataFusion for optimal parallelism.

 You can run the benchmark script to see the impact of different configuration settings:

 .. code-block:: bash

     # Run with default settings (uses all CPU cores)
     python benchmarks/max_cpu_usage.py

     # Run with specific number of rows and partitions
     python benchmarks/max_cpu_usage.py --rows 5000000 --partitions 16

     # See all available options
     python benchmarks/max_cpu_usage.py --help

 Here's an example showing the performance difference between single and multiple partitions:

 .. code-block:: bash

     # Single partition - slower processing
     $ python benchmarks/max_cpu_usage.py --rows=10000000 --partitions 1
     Processed 10000000 rows using 1 partitions in 0.107s

     # Multiple partitions - faster processing
     $ python benchmarks/max_cpu_usage.py --rows=10000000 --partitions 10
     Processed 10000000 rows using 10 partitions in 0.038s

 This example demonstrates nearly 3x performance improvement (0.107s vs 0.038s) when using
 10 partitions instead of 1, showcasing how proper partitioning can significantly improve
 CPU utilization and query performance.

 The script demonstrates several key optimization techniques:

 1. **Higher target partition count**: Uses :code:`with_target_partitions()` to set the number of concurrent partitions
 2. **Automatic repartitioning**: Enables repartitioning for joins, aggregations, and window functions
 3. **Manual repartitioning**: Uses :code:`repartition()` to ensure all partitions are utilized
 4. **CPU-intensive operations**: Performs aggregations that can benefit from parallelization

 The benchmark creates synthetic data and measures the time taken to perform a sum aggregation
 across the specified number of partitions. This helps you understand how partition configuration
 affects performance on your specific hardware.

 Important Considerations
 """"""""""""""""""""""""

 The provided benchmark script demonstrates partitioning concepts using synthetic in-memory data
 and simple aggregation operations. While useful for understanding basic configuration principles,
 actual performance in production environments may vary significantly based on numerous factors:

 **Data Sources and I/O Characteristics:**

 - **Table providers**: Performance differs greatly between Parquet files, CSV files, databases, and cloud storage
 - **Storage type**: Local SSD, network-attached storage, and cloud storage have vastly different characteristics
 - **Network latency**: Remote data sources introduce additional latency considerations
 - **File sizes and distribution**: Large files may benefit differently from partitioning than many small files

 **Query and Workload Characteristics:**

 - **Operation complexity**: Simple aggregations versus complex joins, window functions, or nested queries
 - **Data distribution**: Skewed data may not partition evenly, affecting parallel efficiency
 - **Memory usage**: Large datasets may require different memory management strategies
 - **Concurrent workloads**: Multiple queries running simultaneously affect resource allocation

 **Hardware and Environment Factors:**

 - **CPU architecture**: Different processors have varying parallel processing capabilities
 - **Available memory**: Limited RAM may require different optimization strategies
 - **System load**: Other applications competing for resources affect DataFusion performance

 **Recommendations for Production Use:**

 To optimize DataFusion for your specific use case, it is strongly recommended to:

 1. **Create custom benchmarks** using your actual data sources, formats, and query patterns
 2. **Test with representative data volumes** that match your production workloads
 3. **Measure end-to-end performance** including data loading, processing, and result handling
 4. **Evaluate different configuration combinations** for your specific hardware and workload
 5. **Monitor resource utilization** (CPU, memory, I/O) to identify bottlenecks in your environment

 This approach will provide more accurate insights into how DataFusion configuration options
 will impact your particular applications and infrastructure.

 For more information about available :py:class:`~datafusion.context.SessionConfig` options, see the `rust DataFusion Configuration guide <https://arrow.apache.org/datafusion/user-guide/configs.html>`_,
 and about :code:`RuntimeEnvBuilder` options in the rust `online API documentation <https://docs.rs/datafusion/latest/datafusion/execution/runtime_env/struct.RuntimeEnvBuilder.html>`_.
	.. Licensed to the Apache Software Foundation (ASF) under one
	.. or more contributor license agreements. See the NOTICE file
	.. distributed with this work for additional information
	.. regarding copyright ownership. The ASF licenses this file
	.. to you under the Apache License, Version 2.0 (the
	.. "License"); you may not use this file except in compliance
	.. with the License. You may obtain a copy of the License at

	.. http://www.apache.org/licenses/LICENSE-2.0

	.. Unless required by applicable law or agreed to in writing,
	.. software distributed under the License is distributed on an
	.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	.. KIND, either express or implied. See the License for the
	.. specific language governing permissions and limitations
	.. under the License.

	.. _configuration:

	Configuration
	=============

	Let's look at how we can configure DataFusion. When creating a :py:class:`~datafusion.context.SessionContext`, you can pass in
	a :py:class:`~datafusion.context.SessionConfig` and :py:class:`~datafusion.context.RuntimeEnvBuilder` object. These two cover a wide range of options.

	.. code-block:: python

	from datafusion import RuntimeEnvBuilder, SessionConfig, SessionContext

	# create a session context with default settings
	ctx = SessionContext()
	print(ctx)

	# create a session context with explicit runtime and config settings
	runtime = RuntimeEnvBuilder().with_disk_manager_os().with_fair_spill_pool(10000000)
	config = (
	SessionConfig()
	.with_create_default_catalog_and_schema(True)
	.with_default_catalog_and_schema("foo", "bar")
	.with_target_partitions(8)
	.with_information_schema(True)
	.with_repartition_joins(False)
	.with_repartition_aggregations(False)
	.with_repartition_windows(False)
	.with_parquet_pruning(False)
	.set("datafusion.execution.parquet.pushdown_filters", "true")
	)
	ctx = SessionContext(config, runtime)
	print(ctx)

	Maximizing CPU Usage
	--------------------

	DataFusion uses partitions to parallelize work. For small queries the
	default configuration (number of CPU cores) is often sufficient, but to
	fully utilize available hardware you can tune how many partitions are
	created and when DataFusion will repartition data automatically.

	Configure a ``SessionContext`` with a higher partition count:

	.. code-block:: python

	from datafusion import SessionConfig, SessionContext

	# allow up to 16 concurrent partitions
	config = SessionConfig().with_target_partitions(16)
	ctx = SessionContext(config)

	Automatic repartitioning for joins, aggregations, window functions and
	other operations can be enabled to increase parallelism:

	.. code-block:: python

	config = (
	SessionConfig()
	.with_target_partitions(16)
	.with_repartition_joins(True)
	.with_repartition_aggregations(True)
	.with_repartition_windows(True)
	)

	Manual repartitioning is available on DataFrames when you need precise
	control:

	.. code-block:: python

	from datafusion import col

	df = ctx.read_parquet("data.parquet")

	# Evenly divide into 16 partitions
	df = df.repartition(16)

	# Or partition by the hash of a column
	df = df.repartition_by_hash(col("a"), num=16)

	result = df.collect()


	Benchmark Example
	^^^^^^^^^^^^^^^^^

	The repository includes a benchmark script that demonstrates how to maximize CPU usage
	with DataFusion. The :code:`benchmarks/max_cpu_usage.py` script shows a practical example
	of configuring DataFusion for optimal parallelism.

	You can run the benchmark script to see the impact of different configuration settings:

	.. code-block:: bash

	# Run with default settings (uses all CPU cores)
	python benchmarks/max_cpu_usage.py

	# Run with specific number of rows and partitions
	python benchmarks/max_cpu_usage.py --rows 5000000 --partitions 16

	# See all available options
	python benchmarks/max_cpu_usage.py --help

	Here's an example showing the performance difference between single and multiple partitions:

	.. code-block:: bash

	# Single partition - slower processing
	$ python benchmarks/max_cpu_usage.py --rows=10000000 --partitions 1
	Processed 10000000 rows using 1 partitions in 0.107s

	# Multiple partitions - faster processing
	$ python benchmarks/max_cpu_usage.py --rows=10000000 --partitions 10
	Processed 10000000 rows using 10 partitions in 0.038s

	This example demonstrates nearly 3x performance improvement (0.107s vs 0.038s) when using
	10 partitions instead of 1, showcasing how proper partitioning can significantly improve
	CPU utilization and query performance.

	The script demonstrates several key optimization techniques:

	1. Higher target partition count: Uses :code:`with_target_partitions()` to set the number of concurrent partitions
	2. Automatic repartitioning: Enables repartitioning for joins, aggregations, and window functions
	3. Manual repartitioning: Uses :code:`repartition()` to ensure all partitions are utilized
	4. CPU-intensive operations: Performs aggregations that can benefit from parallelization

	The benchmark creates synthetic data and measures the time taken to perform a sum aggregation
	across the specified number of partitions. This helps you understand how partition configuration
	affects performance on your specific hardware.

	Important Considerations
	""""""""""""""""""""""""

	The provided benchmark script demonstrates partitioning concepts using synthetic in-memory data
	and simple aggregation operations. While useful for understanding basic configuration principles,
	actual performance in production environments may vary significantly based on numerous factors:

	Data Sources and I/O Characteristics:

	- Table providers: Performance differs greatly between Parquet files, CSV files, databases, and cloud storage
	- Storage type: Local SSD, network-attached storage, and cloud storage have vastly different characteristics
	- Network latency: Remote data sources introduce additional latency considerations
	- File sizes and distribution: Large files may benefit differently from partitioning than many small files

	Query and Workload Characteristics:

	- Operation complexity: Simple aggregations versus complex joins, window functions, or nested queries
	- Data distribution: Skewed data may not partition evenly, affecting parallel efficiency
	- Memory usage: Large datasets may require different memory management strategies
	- Concurrent workloads: Multiple queries running simultaneously affect resource allocation

	Hardware and Environment Factors:

	- CPU architecture: Different processors have varying parallel processing capabilities
	- Available memory: Limited RAM may require different optimization strategies
	- System load: Other applications competing for resources affect DataFusion performance

	Recommendations for Production Use:

	To optimize DataFusion for your specific use case, it is strongly recommended to:

	1. Create custom benchmarks using your actual data sources, formats, and query patterns
	2. Test with representative data volumes that match your production workloads
	3. Measure end-to-end performance including data loading, processing, and result handling
	4. Evaluate different configuration combinations for your specific hardware and workload
	5. Monitor resource utilization (CPU, memory, I/O) to identify bottlenecks in your environment

	This approach will provide more accurate insights into how DataFusion configuration options
	will impact your particular applications and infrastructure.

	For more information about available :py:class:`~datafusion.context.SessionConfig` options, see the `rust DataFusion Configuration guide <https://arrow.apache.org/datafusion/user-guide/configs.html>`_,
	and about :code:`RuntimeEnvBuilder` options in the rust `online API documentation <https://docs.rs/datafusion/latest/datafusion/execution/runtime_env/struct.RuntimeEnvBuilder.html>`_.