Parquet format is supported. Here are the steps to generate the testing datasets:
Please refer to the scripts in parquet_dataset directory to generate parquet dataset. Note this script relies on the spark-sql-perf and the tpcds-kit package from Databricks.
In tpcds-datagen-parquet.sh, several parameters should be configured according to the system.
spark_sql_perf_jar=/PATH/TO/spark-sql-perf_2.12-0.5.1-SNAPSHOT.jar ... --num-executors 14 --executor-cores 8 --conf spark.sql.shuffle.partitions=224 ...
In tpcds_datagen_parquet.scala, the parameters and dirs should be configured as well.
val scaleFactor = "100" // scaleFactor defines the size of the dataset to generate (in GB). val numPartitions = 200 // how many dsdgen partitions to run - number of input tasks. ... val rootDir = "/PATH/TO/TPCDS_PARQUET_PATH" // root directory of location to create data in. val dbgenDir = "/PATH/TO/TPCDS_DBGEN" // location of dbgen
Currently, Gluten with Velox can support Parquet file format and three compression codec including snappy, gzip, zstd.
We provide the test queries in TPC-DS Queries. We also provide a Scala script in Run TPC-DS directory about how to run TPC-DS queries.