Databricks provides tooling for generating TPC-DS datasets in a Spark cluster:
https://github.com/databricks/spark-sql-perf
For local development and testing, we provide a Python script to generate TPC-DS CSV data and convert it into Parquet, using DataFusion.
Download the TPC-DS data generator (tpc-ds-tool.zip) from https://www.tpc.org/tpc_documents_current_versions/current_specifications5.asp and place in this directory.
Note that the TPC-DS generator no longer compiles on modern gcc versions so we need to use a Docker container.
docker build -t datafusion-benchmarks/tpcdsgen .
Run the Docker container in interactive mode.
docker run -it -v `pwd`/data:/data datafusion-benchmarks/tpcdsgen
Use tpctools to generate the data
tpctools generate --benchmark tpcds \ --scale 100 \ --partitions 12 \ --generator-path /DSGen-software-code-3.2.0rc1/tools \ --output /data
Exit the container
exit
Use tpcdsgen.py to convert the data
Paths are hard-coded in the script
Do not run in container
python3 tpcdsgen.py convert --scale-factor 100 --partitions 12