This directory contains microbenchmarks for PySpark using ASV (Airspeed Velocity).
Install ASV:
pip install asv
For running benchmarks with isolated environments (without --python=same), you need an environment manager. The default configuration uses virtualenv, but ASV also supports conda, mamba, uv, and some others. See the official docs for details.
All commands below can be run from the Spark root directory using ./python/asv, which is a wrapper that forwards arguments to asv in the benchmarks directory.
Run benchmarks using your current Python environment (fastest for development):
./python/asv run --python=same --quick
You can also specify the test class to run:
./python/asv run --python=same --quick -b 'bench_arrow.LongArrowToPandasBenchmark'
Run benchmarks in an isolated virtualenv (builds pyspark from source):
./python/asv run master^! # Run on latest master commit ./python/asv run v3.5.0^! # Run on a specific tag ./python/asv run abc123^! # Run on a specific commit
Compare current branch against upstream/master with 10% threshold:
./python/asv continuous -f 1.1 upstream/master HEAD
./python/asv check # Validate benchmark syntax
Benchmarks are Python classes with methods prefixed by:
time_* - Measure execution timepeakmem_* - Measure peak memory usagemem_* - Measure memory usage of returned objectExample:
class MyBenchmark: params = [[1000, 10000], ["option1", "option2"]] param_names = ["n_rows", "option"] def setup(self, n_rows, option): # Called before each benchmark method self.data = create_test_data(n_rows, option) def time_my_operation(self, n_rows, option): # Benchmark timing process(self.data) def peakmem_my_operation(self, n_rows, option): # Benchmark peak memory process(self.data)
See ASV documentation for more details.