tree: f2e617c86484ab06315bae2de3089f2a2141d9cd [path history] [tgz]
  1. microbenchmarks.py
  2. README.md
  3. requirements.txt
microbenchmarks/README.md

Microbenchmarks

This directory contains microbenchmarks for comparing DataFusion and DuckDB performance on individual SQL functions. Unlike the TPC-H and TPC-DS benchmarks which test full query execution, these microbenchmarks focus on the performance of specific SQL functions and expressions.

Overview

The benchmarks generate synthetic data, write it to Parquet format, and then measure the execution time of various SQL functions across both DataFusion and DuckDB. Results include per-function timing comparisons and summary statistics.

Setup

Create a virtual environment and install dependencies:

cd microbenchmarks
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Usage

Run a benchmark:

python microbenchmarks.py

Options

OptionDefaultDescription
--rows1000000Number of rows in the generated test data
--warmup2Number of warmup iterations before timing
--iterations5Number of timed iterations (results are averaged)
--outputstdoutOutput file path for markdown results

Examples

Run the benchmark with default settings:

python microbenchmark.py

Run the benchmark with 10 million rows:

python microbenchmarks.py --rows 10000000

Run the benchmark and save results to a file:

python microbenchmarks.py --output results.md

Output

The benchmark outputs a markdown table comparing execution times:

FunctionDataFusion (ms)DuckDB (ms)SpeedupFaster
trim12.3415.671.27xDataFusion
lower8.907.501.19xDuckDB
...............

A summary section shows overall statistics including how many functions each engine won and total execution times.