blob: f20f17870c39a79ccdf454769d8fbd6690faf90c [file] [log] [blame] [view]
<!---
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
# Apache DataFusion Benchmarks
## Overview
This repository is intended as a central resource for documentation and scripts for running queries derived from the
industry standard TPC-H and TPC-DS benchmarks against DataFusion and its subprojects, as well as against other
open-source query engines for comparison.
TPC-H and TPC-DS both operate on synthetic data, which can be generated at different "scale factors". A scale factor
of 1 means that approximately 1 GB of CSV data is generated, and a scale factor of 1000 means that approximately 1 TB
of data is generated.
## TPC Legal Considerations
It is important to know that TPC benchmarks are copyrighted IP of the Transaction Processing Council. Only members
of the TPC consortium are allowed to publish TPC benchmark results. Fun fact: only four companies have published
official TPC-DS benchmark results so far, and those results can be seen [here](https://www.tpc.org/tpcds/results/tpcds_results5.asp?orderby=dbms&version=3).
However, anyone is welcome to create derivative benchmarks under the TPC's fair use policy, and that is what we are
doing here. We do not aim to run a true TPC benchmark (which is a significant endeavor). We are just running the
individual queries and recording the timings.
Throughout this document and when talking about these benchmarks, you will see the term "derived from TPC-H" or
"derived from TPC-DS". We are required to use this terminology and this is explained in the
[fair-use policy (PDF)](https://www.tpc.org/tpc_documents_current_versions/pdf/tpc_fair_use_quick_reference_v1.0.0.pdf).
DataFusion benchmarks are a Non-TPC Benchmark. Any comparison between official TPC Results with non-TPC workloads is
prohibited by the TPC.
## Data Generation
See the benchmark-specific instructions for generating the CSV data for [TPC-H](tpch) and [TPC-DS](tpcds) and for
converting that data to Parquet format. Although it is valid to run benchmarks against CSV data, this does not really
represent how most of the world is running OLAP queries, especially when dealing with large datasets. When benchmarking
DataFusion and its subprojects, we typically want to be querying Parquet data. Also, we typically do not
want a single file per table, so we also need to repartition the data. The provided scripts take care of this conversion
and repartitioning.
## Running the Benchmarks with DataFusion
Scripts are available for the following DataFusion projects:
- [DataFusion Python](./runners/datafusion-python)
- [DataFusion Comet](./runners/datafusion-comet)
These benchmarking scripts produce JSON files containing query timings.
## Comparing Results
The Python script [scripts/generate-comparison.py](scripts/generate-comparison.py) can be used to produce charts
comparing results from different benchmark runs.
For example:
```shell
python scripts/generate-comparison.py file1.json file2.json --labels "Spark" "Comet" --benchmark "tpch" --title "TPC-H 100GB"
```
This will create image files in the current directory in PNG format.
## Legal Notices
TPC-H is Copyright © 1993-2022 Transaction Processing Performance Council. The full TPC-H specification in PDF format
can be found [here](https://www.tpc.org/TPC_Documents_Current_Versions/pdf/TPC-H_v3.0.1.pdf).
TPC-DS is Copyright © 2021 Transaction Processing Performance Council. The full TPC-DS specification in PDF format
can be found [here](https://www.tpc.org/tpc_documents_current_versions/pdf/tpc-ds_v3.2.0.pdf).
TPC, TPC Benchmark, TPC-H, and TPC-DS are trademarks of the Transaction Processing Performance Council.