blob: 38431b22bbae286b29fd17c00ef2a7511b5c431a [file] [log] [blame] [view]
# PySpark Benchmarks
This directory contains microbenchmarks for PySpark using [ASV (Airspeed Velocity)](https://asv.readthedocs.io/).
## Prerequisites
Install ASV:
```bash
pip install asv
```
For running benchmarks with isolated environments (without `--python=same`), you need an environment manager.
The default configuration uses `virtualenv`, but ASV also supports `conda`, `mamba`, `uv`, and some others. See the official docs for details.
## Running Benchmarks
All commands below can be run from the Spark root directory using `./python/asv`,
which is a wrapper that forwards arguments to `asv` in the benchmarks directory.
### Quick run (current environment)
Run benchmarks using your current Python environment (fastest for development):
```bash
./python/asv run --python=same --quick
```
You can also specify the test class to run:
```bash
./python/asv run --python=same --quick -b 'bench_arrow.LongArrowToPandasBenchmark'
```
### Full run against a commit
Run benchmarks in an isolated virtualenv (builds pyspark from source):
```bash
./python/asv run master^! # Run on latest master commit
./python/asv run v3.5.0^! # Run on a specific tag
./python/asv run abc123^! # Run on a specific commit
```
### Compare two commits
Compare current branch against upstream/master with 10% threshold:
```bash
./python/asv continuous -f 1.1 upstream/master HEAD
```
### Other useful commands
```bash
./python/asv check # Validate benchmark syntax
```
## Writing Benchmarks
Benchmarks are Python classes with methods prefixed by:
- `time_*` - Measure execution time
- `peakmem_*` - Measure peak memory usage
- `mem_*` - Measure memory usage of returned object
Example:
```python
class MyBenchmark:
params = [[1000, 10000], ["option1", "option2"]]
param_names = ["n_rows", "option"]
def setup(self, n_rows, option):
# Called before each benchmark method
self.data = create_test_data(n_rows, option)
def time_my_operation(self, n_rows, option):
# Benchmark timing
process(self.data)
def peakmem_my_operation(self, n_rows, option):
# Benchmark peak memory
process(self.data)
```
See [ASV documentation](https://asv.readthedocs.io/en/stable/writing_benchmarks.html) for more details.