Databricks provides tooling for generating TPC-H datasets in a Spark cluster:
https://github.com/databricks/spark-sql-perf
For local development and testing, we provide a Python script to generate TPC-H CSV data and convert it into Parquet, using DataFusion.
The script requires Docker to be available because it uses the Docker image ghcr.io/scalytics/tpch-docker
to run the TPC-H data generator.
Data can be generated as a single Parquet file per table by specifying --partitions 1
.
Data will be generated into a data
directory in the current working directory.
python tpchgen.py generate --scale-factor 1 --partitions 1 python tpchgen.py convert --scale-factor 1 --partitions 1
Data can be generated as multiple Parquet files per table by specifying --partitions
greater than one.
python tpchgen.py generate --scale-factor 1000 --partitions 64 python tpchgen.py convert --scale-factor 1000 --partitions 64