blob: 2ae035b9fc46a7050ad0216637f0280f9e48a842 [file] [log] [blame] [view]
<!---
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
# Apache Arrow Rust Benchmarks
This crate contains benchmarks based on popular public data sets and open source benchmark suites, making it easy to
run real-world benchmarks to help with performance and scalability testing and for comparing performance with other Arrow
implementations as well as other query engines.
Currently, only DataFusion benchmarks exist, but the plan is to add benchmarks for the arrow, flight, and parquet
crates as well.
## Benchmark derived from TPC-H
These benchmarks are derived from the [TPC-H][1] benchmark.
Data for this benchmark can be generated using the [tpch-dbgen][2] command-line tool. Run the following commands to
clone the repository and build the source code.
```bash
git clone git@github.com:databricks/tpch-dbgen.git
cd tpch-dbgen
make
export TPCH_DATA=$(pwd)
```
Data can now be generated with the following command. Note that `-s 1` means use Scale Factor 1 or ~1 GB of
data. This value can be increased to generate larger data sets.
```bash
./dbgen -vf -s 1
```
The benchmark can then be run (assuming the data created from `dbgen` is in `/mnt/tpch-dbgen`) with a command such as:
```bash
cargo run --release --bin tpch -- benchmark --iterations 3 --path /mnt/tpch-dbgen --format tbl --query 1 --batch-size 4096
```
The benchmark program also supports CSV and Parquet input file formats and a utility is provided to convert from `tbl`
(generated by the `dbgen` utility) to CSV and Parquet.
```bash
cargo run --release --bin tpch -- convert --input /mnt/tpch-dbgen --output /mnt/tpch-parquet --format parquet
```
This utility does not yet provide support for changing the number of partitions when performing the conversion. Another
option is to use the following Docker image to perform the conversion from `tbl` files to CSV or Parquet.
```bash
docker run -it ballistacompute/spark-benchmarks:0.4.0-SNAPSHOT
-h, --help Show help message
Subcommand: convert-tpch
-i, --input <arg>
--input-format <arg>
-o, --output <arg>
--output-format <arg>
-p, --partitions <arg>
-h, --help Show help message
```
Note that it is necessary to mount volumes into the Docker container as appropriate so that the file conversion process
can access files on the host system.
Here is a full example that assumes that data is stored in the `/mnt` path on the host system.
```bash
docker run -v /mnt:/mnt -it ballistacompute/spark-benchmarks:0.4.0-SNAPSHOT \
convert-tpch \
--input /mnt/tpch/csv \
--input-format tbl \
--output /mnt/tpch/parquet \
--output-format parquet \
--partitions 64
```
## NYC Taxi Benchmark
These benchmarks are based on the [New York Taxi and Limousine Commission][3] data set.
```bash
cargo run --release --bin nyctaxi -- --iterations 3 --path /mnt/nyctaxi/csv --format csv --batch-size 4096
```
Example output:
```bash
Running benchmarks with the following options: Opt { debug: false, iterations: 3, batch_size: 4096, path: "/mnt/nyctaxi/csv", file_format: "csv" }
Executing 'fare_amt_by_passenger'
Query 'fare_amt_by_passenger' iteration 0 took 7138 ms
Query 'fare_amt_by_passenger' iteration 1 took 7599 ms
Query 'fare_amt_by_passenger' iteration 2 took 7969 ms
```
[1]: http://www.tpc.org/tpch/
[2]: https://github.com/databricks/tpch-dbgen
[3]: https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page