This crate contains benchmarks based on popular public data sets and open source benchmark suites, making it easy to run real-world benchmarks to help with performance and scalability testing and for comparing performance with other Arrow implementations as well as other query engines.
Currently, only DataFusion benchmarks exist, but the plan is to add benchmarks for the arrow, flight, and parquet crates as well.
These benchmarks are derived from the TPC-H benchmark.
Data for this benchmark can be generated using the tpch-dbgen command-line tool. Run the following commands to clone the repository and build the source code.
git clone git@github.com:databricks/tpch-dbgen.git cd tpch-dbgen make export TPCH_DATA=$(pwd)
Data can now be generated with the following command. Note that -s 1
means use Scale Factor 1 or ~1 GB of data. This value can be increased to generate larger data sets.
./dbgen -vf -s 1
The benchmark can then be run (assuming the data created from dbgen
is in /mnt/tpch-dbgen
) with a command such as:
cargo run --release --bin tpch -- benchmark --iterations 3 --path /mnt/tpch-dbgen --format tbl --query 1 --batch-size 4096
You can enable the features simd
(to use SIMD instructions) and/or mimalloc
or snmalloc
(to use either the mimalloc or snmalloc allocator) as features by passing them in as --features
:
cargo run --release --features "simd mimalloc" --bin tpch -- benchmark --iterations 3 --path /mnt/tpch-dbgen --format tbl --query 1 --batch-size 4096
The benchmark program also supports CSV and Parquet input file formats and a utility is provided to convert from tbl
(generated by the dbgen
utility) to CSV and Parquet.
cargo run --release --bin tpch -- convert --input /mnt/tpch-dbgen --output /mnt/tpch-parquet --format parquet
This utility does not yet provide support for changing the number of partitions when performing the conversion. Another option is to use the following Docker image to perform the conversion from tbl
files to CSV or Parquet.
docker run -it ballistacompute/spark-benchmarks:0.4.0-SNAPSHOT -h, --help Show help message Subcommand: convert-tpch -i, --input <arg> --input-format <arg> -o, --output <arg> --output-format <arg> -p, --partitions <arg> -h, --help Show help message
Note that it is necessary to mount volumes into the Docker container as appropriate so that the file conversion process can access files on the host system.
Here is a full example that assumes that data is stored in the /mnt
path on the host system.
docker run -v /mnt:/mnt -it ballistacompute/spark-benchmarks:0.4.0-SNAPSHOT \ convert-tpch \ --input /mnt/tpch/csv \ --input-format tbl \ --output /mnt/tpch/parquet \ --output-format parquet \ --partitions 64
These benchmarks are based on the New York Taxi and Limousine Commission data set.
cargo run --release --bin nyctaxi -- --iterations 3 --path /mnt/nyctaxi/csv --format csv --batch-size 4096
Example output:
Running benchmarks with the following options: Opt { debug: false, iterations: 3, batch_size: 4096, path: "/mnt/nyctaxi/csv", file_format: "csv" } Executing 'fare_amt_by_passenger' Query 'fare_amt_by_passenger' iteration 0 took 7138 ms Query 'fare_amt_by_passenger' iteration 1 took 7599 ms Query 'fare_amt_by_passenger' iteration 2 took 7969 ms