Apache DataFusion Benchmarks

Clone this repo:
  1. c472383 chore: support specific query (#23) by Oleks V · 4 weeks ago main
  2. e82732f Fix comparing results example in readme (#20) by Zhen Wang · 7 months ago
  3. 3668e39 fix q35 (#17) by Onur Satici · 12 months ago
  4. 86da77b remove config flag conflict in datafusion runner (#16) by Onur Satici · 12 months ago
  5. 42335ac Add TPC-DS scripts and documentation (#7) by Andy Grove · 1 year ago

Apache DataFusion Benchmarks

Overview

This repository is intended as a central resource for documentation and scripts for running queries derived from the industry standard TPC-H and TPC-DS benchmarks against DataFusion and its subprojects, as well as against other open-source query engines for comparison.

TPC-H and TPC-DS both operate on synthetic data, which can be generated at different “scale factors”. A scale factor of 1 means that approximately 1 GB of CSV data is generated, and a scale factor of 1000 means that approximately 1 TB of data is generated.

TPC Legal Considerations

It is important to know that TPC benchmarks are copyrighted IP of the Transaction Processing Council. Only members of the TPC consortium are allowed to publish TPC benchmark results. Fun fact: only four companies have published official TPC-DS benchmark results so far, and those results can be seen here.

However, anyone is welcome to create derivative benchmarks under the TPC's fair use policy, and that is what we are doing here. We do not aim to run a true TPC benchmark (which is a significant endeavor). We are just running the individual queries and recording the timings.

Throughout this document and when talking about these benchmarks, you will see the term “derived from TPC-H” or “derived from TPC-DS”. We are required to use this terminology and this is explained in the fair-use policy (PDF).

DataFusion benchmarks are a Non-TPC Benchmark. Any comparison between official TPC Results with non-TPC workloads is prohibited by the TPC.

Data Generation

See the benchmark-specific instructions for generating the CSV data for TPC-H and TPC-DS and for converting that data to Parquet format. Although it is valid to run benchmarks against CSV data, this does not really represent how most of the world is running OLAP queries, especially when dealing with large datasets. When benchmarking DataFusion and its subprojects, we typically want to be querying Parquet data. Also, we typically do not want a single file per table, so we also need to repartition the data. The provided scripts take care of this conversion and repartitioning.

Running the Benchmarks with DataFusion

Scripts are available for the following DataFusion projects:

These benchmarking scripts produce JSON files containing query timings.

Comparing Results

The Python script scripts/generate-comparison.py can be used to produce charts comparing results from different benchmark runs.

For example:

python scripts/generate-comparison.py file1.json file2.json --labels "Spark" "Comet" --benchmark "tpch" --title "TPC-H 100GB"

This will create image files in the current directory in PNG format.

Legal Notices

TPC-H is Copyright © 1993-2022 Transaction Processing Performance Council. The full TPC-H specification in PDF format can be found here.

TPC-DS is Copyright © 2021 Transaction Processing Performance Council. The full TPC-DS specification in PDF format can be found here.

TPC, TPC Benchmark, TPC-H, and TPC-DS are trademarks of the Transaction Processing Performance Council.