This directory contains microbenchmarks for comparing DataFusion and DuckDB performance on individual SQL functions. Unlike the TPC-H and TPC-DS benchmarks which test full query execution, these microbenchmarks focus on the performance of specific SQL functions and expressions.
The benchmarks generate synthetic data, write it to Parquet format, and then measure the execution time of various SQL functions across both DataFusion and DuckDB. Results include per-function timing comparisons and summary statistics.
Create a virtual environment and install dependencies:
cd microbenchmarks python3 -m venv venv source venv/bin/activate pip install -r requirements.txt
Run a benchmark:
python microbenchmarks.py
| Option | Default | Description |
|---|---|---|
--rows | 1000000 | Number of rows in the generated test data |
--warmup | 2 | Number of warmup iterations before timing |
--iterations | 5 | Number of timed iterations (results are averaged) |
--output | stdout | Output file path for markdown results |
Run the benchmark with default settings:
python microbenchmark.py
Run the benchmark with 10 million rows:
python microbenchmarks.py --rows 10000000
Run the benchmark and save results to a file:
python microbenchmarks.py --output results.md
The benchmark outputs a markdown table comparing execution times:
| Function | DataFusion (ms) | DuckDB (ms) | Speedup | Faster |
|---|---|---|---|---|
| trim | 12.34 | 15.67 | 1.27x | DataFusion |
| lower | 8.90 | 7.50 | 1.19x | DuckDB |
| ... | ... | ... | ... | ... |
A summary section shows overall statistics including how many functions each engine won and total execution times.