tree: 74e8d186d8252790b2f7725879709a8aad3dfaa1 [path history] [tgz]
  1. data/
  2. queries/
  3. queries-spark/
  4. .gitignore
  5. Dockerfile
  6. gen.sh
  7. README.md
  8. tpcdsgen.py
tpcds/README.md

TPC-DS

Generating TPC-DS data with Spark

Databricks provides tooling for generating TPC-DS datasets in a Spark cluster:

https://github.com/databricks/spark-sql-perf

Generating TPC-DS data without Spark

For local development and testing, we provide a Python script to generate TPC-DS CSV data and convert it into Parquet, using DataFusion.

Download the TPC-DS data generator (tpc-ds-tool.zip) from https://www.tpc.org/tpc_documents_current_versions/current_specifications5.asp and place in this directory.

Note that the TPC-DS generator no longer compiles on modern gcc versions so we need to use a Docker container.

Build Image

docker build -t datafusion-benchmarks/tpcdsgen .

Generate Data

Run the Docker container in interactive mode.

docker run -it -v `pwd`/data:/data datafusion-benchmarks/tpcdsgen

Use tpctools to generate the data

tpctools generate --benchmark tpcds \
  --scale 100 \
  --partitions 12 \
  --generator-path /DSGen-software-code-3.2.0rc1/tools \
  --output /data

Exit the container

exit

Convert the CSV data to Parquet

Use tpcdsgen.py to convert the data

Paths are hard-coded in the script

Do not run in container

python3 tpcdsgen.py convert --scale-factor 100 --partitions 12