blob: 0a6013791ccf8d8afaf0d34fe3d9cb9e074960fa [file] [log] [blame] [view]
# PySpark Benchmarks
This directory contains microbenchmarks for PySpark using [ASV (Airspeed Velocity)](https://asv.readthedocs.io/).
## Prerequisites
Install ASV:
```bash
pip install asv
```
For running benchmarks with isolated environments (without `--python=same`), you need an environment manager.
The default configuration uses `virtualenv`, but ASV also supports `conda`, `mamba`, `uv`, and some others. See the official docs for details.
## Running Benchmarks
### Quick run (current environment)
Run benchmarks using your current Python environment (fastest for development):
```bash
cd python/benchmarks
asv run --python=same --quick
```
### Full run against a commit
Run benchmarks in an isolated virtualenv (builds pyspark from source):
```bash
cd python/benchmarks
asv run master^! # Run on latest master commit
asv run v3.5.0^! # Run on a specific tag
asv run abc123^! # Run on a specific commit
```
### Compare two commits
Compare current branch against upstream/main with 10% threshold:
```bash
asv continuous -f 1.1 upstream/main HEAD
```
### Other useful commands
```bash
asv check # Validate benchmark syntax
```
## Writing Benchmarks
Benchmarks are Python classes with methods prefixed by:
- `time_*` - Measure execution time
- `peakmem_*` - Measure peak memory usage
- `mem_*` - Measure memory usage of returned object
Example:
```python
class MyBenchmark:
params = [[1000, 10000], ["option1", "option2"]]
param_names = ["n_rows", "option"]
def setup(self, n_rows, option):
# Called before each benchmark method
self.data = create_test_data(n_rows, option)
def time_my_operation(self, n_rows, option):
# Benchmark timing
process(self.data)
def peakmem_my_operation(self, n_rows, option):
# Benchmark peak memory
process(self.data)
```
See [ASV documentation](https://asv.readthedocs.io/en/stable/writing_benchmarks.html) for more details.