Microbenchmarks

This directory contains microbenchmarks for comparing DataFusion and DuckDB performance on individual SQL functions. Unlike the TPC-H and TPC-DS benchmarks which test full query execution, these microbenchmarks focus on the performance of specific SQL functions and expressions.

Overview

The benchmarks generate synthetic data, write it to Parquet format, and then measure the execution time of various SQL functions across both DataFusion and DuckDB. Results include per-function timing comparisons and summary statistics.

Setup

Create a virtual environment and install dependencies:

cd microbenchmarks
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Usage

Run a benchmark:

python microbenchmarks.py

Options

Option	Default	Description
`--rows`	`1000000`	Number of rows in the generated test data
`--warmup`	`2`	Number of warmup iterations before timing
`--iterations`	`5`	Number of timed iterations (results are averaged)
`--output`	stdout	Output file path for markdown results

Examples

Run the benchmark with default settings:

python microbenchmark.py

Run the benchmark with 10 million rows:

python microbenchmarks.py --rows 10000000

Run the benchmark and save results to a file:

python microbenchmarks.py --output results.md

Output

The benchmark outputs a markdown table comparing execution times:

Function	DataFusion (ms)	DuckDB (ms)	Speedup	Faster
trim	12.34	15.67	1.27x	DataFusion
lower	8.90	7.50	1.19x	DuckDB
...	...	...	...	...

A summary section shows overall statistics including how many functions each engine won and total execution times.