| <!--- |
| Licensed to the Apache Software Foundation (ASF) under one |
| or more contributor license agreements. See the NOTICE file |
| distributed with this work for additional information |
| regarding copyright ownership. The ASF licenses this file |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this file except in compliance |
| with the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, |
| software distributed under the License is distributed on an |
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| KIND, either express or implied. See the License for the |
| specific language governing permissions and limitations |
| under the License. |
| --> |
| |
| # Apache Arrow Rust Benchmarks |
| |
| This crate contains benchmarks based on popular public data sets and open source benchmark suites, making it easy to |
| run real-world benchmarks to help with performance and scalability testing and for comparing performance with other Arrow |
| implementations as well as other query engines. |
| |
| Currently, only DataFusion benchmarks exist, but the plan is to add benchmarks for the arrow, flight, and parquet |
| crates as well. |
| |
| ## Benchmark derived from TPC-H |
| |
| These benchmarks are derived from the [TPC-H][1] benchmark. |
| |
| Data for this benchmark can be generated using the [tpch-dbgen][2] command-line tool. Run the following commands to |
| clone the repository and build the source code. |
| |
| ```bash |
| git clone git@github.com:databricks/tpch-dbgen.git |
| cd tpch-dbgen |
| make |
| export TPCH_DATA=$(pwd) |
| ``` |
| |
| Data can now be generated with the following command. Note that `-s 1` means use Scale Factor 1 or ~1 GB of |
| data. This value can be increased to generate larger data sets. |
| |
| ```bash |
| ./dbgen -vf -s 1 |
| ``` |
| |
| The benchmark can then be run (assuming the data created from `dbgen` is in `/mnt/tpch-dbgen`) with a command such as: |
| |
| ```bash |
| cargo run --release --bin tpch -- benchmark --iterations 3 --path /mnt/tpch-dbgen --format tbl --query 1 --batch-size 4096 |
| ``` |
| |
| The benchmark program also supports CSV and Parquet input file formats and a utility is provided to convert from `tbl` |
| (generated by the `dbgen` utility) to CSV and Parquet. |
| |
| ```bash |
| cargo run --release --bin tpch -- convert --input /mnt/tpch-dbgen --output /mnt/tpch-parquet --format parquet |
| ``` |
| |
| This utility does not yet provide support for changing the number of partitions when performing the conversion. Another |
| option is to use the following Docker image to perform the conversion from `tbl` files to CSV or Parquet. |
| |
| ```bash |
| docker run -it ballistacompute/spark-benchmarks:0.4.0-SNAPSHOT |
| -h, --help Show help message |
| |
| Subcommand: convert-tpch |
| -i, --input <arg> |
| --input-format <arg> |
| -o, --output <arg> |
| --output-format <arg> |
| -p, --partitions <arg> |
| -h, --help Show help message |
| ``` |
| |
| Note that it is necessary to mount volumes into the Docker container as appropriate so that the file conversion process |
| can access files on the host system. |
| |
| Here is a full example that assumes that data is stored in the `/mnt` path on the host system. |
| |
| ```bash |
| docker run -v /mnt:/mnt -it ballistacompute/spark-benchmarks:0.4.0-SNAPSHOT \ |
| convert-tpch \ |
| --input /mnt/tpch/csv \ |
| --input-format tbl \ |
| --output /mnt/tpch/parquet \ |
| --output-format parquet \ |
| --partitions 64 |
| ``` |
| |
| ## NYC Taxi Benchmark |
| |
| These benchmarks are based on the [New York Taxi and Limousine Commission][3] data set. |
| |
| ```bash |
| cargo run --release --bin nyctaxi -- --iterations 3 --path /mnt/nyctaxi/csv --format csv --batch-size 4096 |
| ``` |
| |
| Example output: |
| |
| ```bash |
| Running benchmarks with the following options: Opt { debug: false, iterations: 3, batch_size: 4096, path: "/mnt/nyctaxi/csv", file_format: "csv" } |
| Executing 'fare_amt_by_passenger' |
| Query 'fare_amt_by_passenger' iteration 0 took 7138 ms |
| Query 'fare_amt_by_passenger' iteration 1 took 7599 ms |
| Query 'fare_amt_by_passenger' iteration 2 took 7969 ms |
| ``` |
| |
| [1]: http://www.tpc.org/tpch/ |
| [2]: https://github.com/databricks/tpch-dbgen |
| [3]: https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page |