python/benchmarks/README.md - spark - Git at Google

 # PySpark Benchmarks

 This directory contains microbenchmarks for PySpark using [ASV (Airspeed Velocity)](https://asv.readthedocs.io/).

 ## Prerequisites

 Install ASV:

 ```bash
 pip install asv
 ```

 For running benchmarks with isolated environments (without `--python=same`), you need an environment manager.
 The default configuration uses `virtualenv`, but ASV also supports `conda`, `mamba`, `uv`, and some others. See the official docs for details.

 ## Running Benchmarks

 All commands below can be run from the Spark root directory using `./python/asv`,
 which is a wrapper that forwards arguments to `asv` in the benchmarks directory.

 ### Quick run (current environment)

 Run benchmarks using your current Python environment (fastest for development):

 ```bash
 ./python/asv run --python=same --quick
 ```

 You can also specify the test class to run:

 ```bash
 ./python/asv run --python=same --quick -b 'bench_arrow.LongArrowToPandasBenchmark'
 ```

 ### Full run against a commit

 Run benchmarks in an isolated virtualenv (builds pyspark from source):

 ```bash
 ./python/asv run master^!          # Run on latest master commit
 ./python/asv run v3.5.0^!          # Run on a specific tag
 ./python/asv run abc123^!          # Run on a specific commit
 ```

 ### Compare two commits

 Compare current branch against upstream/master with 10% threshold:

 ```bash
 ./python/asv continuous -f 1.1 upstream/master HEAD
 ```

 ### Other useful commands

 ```bash
 ./python/asv check          # Validate benchmark syntax
 ```

 ## Writing Benchmarks

 Benchmarks are Python classes with methods prefixed by:
 - `time_*` - Measure execution time
 - `peakmem_*` - Measure peak memory usage
 - `mem_*` - Measure memory usage of returned object

 Example:

 ```python
 class MyBenchmark:
     params = [[1000, 10000], ["option1", "option2"]]
     param_names = ["n_rows", "option"]

     def setup(self, n_rows, option):
         # Called before each benchmark method
         self.data = create_test_data(n_rows, option)

     def time_my_operation(self, n_rows, option):
         # Benchmark timing
         process(self.data)

     def peakmem_my_operation(self, n_rows, option):
         # Benchmark peak memory
         process(self.data)
 ```

 See [ASV documentation](https://asv.readthedocs.io/en/stable/writing_benchmarks.html) for more details.
	# PySpark Benchmarks

	This directory contains microbenchmarks for PySpark using [ASV (Airspeed Velocity)](https://asv.readthedocs.io/).

	## Prerequisites

	Install ASV:

	```bash
	pip install asv
	```

	For running benchmarks with isolated environments (without `--python=same`), you need an environment manager.
	The default configuration uses `virtualenv`, but ASV also supports `conda`, `mamba`, `uv`, and some others. See the official docs for details.

	## Running Benchmarks

	All commands below can be run from the Spark root directory using `./python/asv`,
	which is a wrapper that forwards arguments to `asv` in the benchmarks directory.

	### Quick run (current environment)

	Run benchmarks using your current Python environment (fastest for development):

	```bash
	./python/asv run --python=same --quick
	```

	You can also specify the test class to run:

	```bash
	./python/asv run --python=same --quick -b 'bench_arrow.LongArrowToPandasBenchmark'
	```

	### Full run against a commit

	Run benchmarks in an isolated virtualenv (builds pyspark from source):

	```bash
	./python/asv run master^! # Run on latest master commit
	./python/asv run v3.5.0^! # Run on a specific tag
	./python/asv run abc123^! # Run on a specific commit
	```

	### Compare two commits

	Compare current branch against upstream/master with 10% threshold:

	```bash
	./python/asv continuous -f 1.1 upstream/master HEAD
	```

	### Other useful commands

	```bash
	./python/asv check # Validate benchmark syntax
	```

	## Writing Benchmarks

	Benchmarks are Python classes with methods prefixed by:
	- `time_*` - Measure execution time
	- `peakmem_*` - Measure peak memory usage
	- `mem_*` - Measure memory usage of returned object

	Example:

	```python
	class MyBenchmark:
	params = [[1000, 10000], ["option1", "option2"]]
	param_names = ["n_rows", "option"]

	def setup(self, n_rows, option):
	# Called before each benchmark method
	self.data = create_test_data(n_rows, option)

	def time_my_operation(self, n_rows, option):
	# Benchmark timing
	process(self.data)

	def peakmem_my_operation(self, n_rows, option):
	# Benchmark peak memory
	process(self.data)
	```

	See [ASV documentation](https://asv.readthedocs.io/en/stable/writing_benchmarks.html) for more details.