tree: c131120ca80538fc55d3a13d1a9052e980da0d44 [path history] [tgz]
  1. queries/
  2. .gitignore
  3. README.md
  4. tpchgen.py
tpch/README.md

TPC-H

Generating TPC-H data with Spark

Databricks provides tooling for generating TPC-H datasets in a Spark cluster:

https://github.com/databricks/spark-sql-perf

Generating TPC-H data without Spark

For local development and testing, we provide a Python script to generate TPC-H CSV data and convert it into Parquet, using DataFusion.

The script requires Docker to be available because it uses the Docker image ghcr.io/scalytics/tpch-docker to run the TPC-H data generator.

Data can be generated as a single Parquet file per table by specifying --partitions 1.

Data will be generated into a data directory in the current working directory.

python tpchgen.py generate --scale-factor 1 --partitions 1
python tpchgen.py convert --scale-factor 1 --partitions 1

Data can be generated as multiple Parquet files per table by specifying --partitions greater than one.

python tpchgen.py generate --scale-factor 1000 --partitions 64
python tpchgen.py convert --scale-factor 1000 --partitions 64