TPC-DS

Generating TPC-DS data with Spark

Databricks provides tooling for generating TPC-DS datasets in a Spark cluster:

For local development and testing, we provide a Python script to generate TPC-DS CSV data and convert it into Parquet, using DataFusion.

Download the TPC-DS data generator (tpc-ds-tool.zip) from https://www.tpc.org/tpc_documents_current_versions/current_specifications5.asp and place in this directory.

Note that the TPC-DS generator no longer compiles on modern gcc versions so we need to use a Docker container.

docker build -t datafusion-benchmarks/tpcdsgen .

Run the Docker container in interactive mode.

docker run -it -v `pwd`/data:/data datafusion-benchmarks/tpcdsgen

Use tpctools to generate the data

tpctools generate --benchmark tpcds \
  --scale 100 \
  --partitions 12 \
  --generator-path /DSGen-software-code-3.2.0rc1/tools \
  --output /data

Exit the container

exit

Use tpcdsgen.py to convert the data

Paths are hard-coded in the script

Do not run in container

python3 tpcdsgen.py convert --scale-factor 100 --partitions 12