blob: 7c52c8230b26cc2d84c6458e8e427980840bc840 [file] [log] [blame] [view]
<!---
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
# DataFusion Python Examples for TPC-H
These examples reproduce the problems listed in the Transaction Process Council
TPC-H benchmark. The purpose of these examples is to demonstrate how to use
different aspects of Data Fusion and not necessarily geared towards creating the
most performant queries possible. Within each example is a description of the
problem. For users who are familiar with SQL style commands, you can compare the
approaches in these examples with those listed in the specification.
- https://www.tpc.org/tpch/
The examples provided are based on version 2.18.0 of the TPC-H specification.
## Data Setup
To run these examples, you must first generate a dataset. The `dbgen` tool
provided by TPC can create datasets of arbitrary scale. For testing it is
typically sufficient to create a 1 gigabyte dataset. For convenience, this
repository has a script which uses docker to create this dataset. From the
`benchmarks/tpch` directory execute the following script.
```bash
./tpch-gen.sh 1
```
The examples provided use parquet files for the tables generated by `dbgen`.
A python script is provided to convert the text files from `dbgen` into parquet
files expected by the examples. From the `examples/tpch` directory you can
execute the following command to create the necessary parquet files.
```bash
python convert_data_to_parquet.py
```
## Description of Examples
For easier access, a description of the techniques demonstrated in each file
is in the README.md file in the `examples` directory.