This crate contains benchmarks based on popular public data sets and open source benchmark suites, to help with performance and scalability testing of DataFusion.
The benchmarks measure changes to DataFusion itself, rather than its performance against other engines. For competitive benchmarking, DataFusion is included in the benchmark setups for several popular benchmarks that compare performance with other engines. For example:
db-benchmark scripts are in db-benchmarkbench.shThe easiest way to run benchmarks is the bench.sh script. Usage instructions can be found with:
# show usage cd ./benchmarks/ ./bench.sh
You can create / download the data for these benchmarks using the bench.sh script:
Create / download all datasets
./bench.sh data
Create / download a specific dataset (TPCH)
./bench.sh data tpch
Data is placed in the data subdirectory.
Run benchmark for TPC-H dataset
./bench.sh run tpch
or for TPC-H dataset scale 10
./bench.sh run tpch10
To run for specific query, for example Q21
./bench.sh run tpch10 21
Generate the data required for the compile profile helper (TPC-H SF=1):
./bench.sh data compile_profile
Run the benchmark across all default Cargo profiles (dev, release, ci, release-nonlto):
./bench.sh run compile_profile
Limit the run to a single profile:
./bench.sh run compile_profile dev
Or specify a subset of profiles:
./bench.sh run compile_profile dev release
You can also invoke the helper directly if you need to customise arguments further:
./benchmarks/compile_profile.py --profiles dev release --data /path/to/tpch_sf1
The benchmark runs with prefer_hash_join == true by default, which enforces HASH join algorithm. To run TPCH benchmarks with join other than HASH:
PREFER_HASH_JOIN=false ./bench.sh run tpch
Any datafusion options that are provided environment variables are also considered by the benchmarks. The following configuration runs the TPCH benchmark with datafusion configured to not repartition join keys.
DATAFUSION_OPTIMIZER_REPARTITION_JOINS=false ./bench.sh run tpch
You might want to adjust the results location to avoid overwriting previous results. Environment configuration that was picked up by datafusion is logged at info level. To verify that datafusion picked up your configuration, run the benchmarks with RUST_LOG=info or higher.
git checkout main # Create the data ./benchmarks/bench.sh data # Gather baseline data for tpch benchmark ./benchmarks/bench.sh run tpch # Switch to the branch named mybranch and gather data git checkout mybranch ./benchmarks/bench.sh run tpch # Compare results in the two branches: ./bench.sh compare main mybranch
This produces results like:
Comparing main and mybranch -------------------- Benchmark tpch.json -------------------- ┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ Query ┃ main ┃ mybranch ┃ Change ┃ ┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ QQuery 1 │ 2520.52ms │ 2795.09ms │ 1.11x slower │ │ QQuery 2 │ 222.37ms │ 216.01ms │ no change │ │ QQuery 3 │ 248.41ms │ 239.07ms │ no change │ │ QQuery 4 │ 144.01ms │ 129.28ms │ +1.11x faster │ │ QQuery 5 │ 339.54ms │ 327.53ms │ no change │ │ QQuery 6 │ 147.59ms │ 138.73ms │ +1.06x faster │ │ QQuery 7 │ 605.72ms │ 631.23ms │ no change │ │ QQuery 8 │ 326.35ms │ 372.12ms │ 1.14x slower │ │ QQuery 9 │ 579.02ms │ 634.73ms │ 1.10x slower │ │ QQuery 10 │ 403.38ms │ 420.39ms │ no change │ │ QQuery 11 │ 201.94ms │ 212.12ms │ 1.05x slower │ │ QQuery 12 │ 235.94ms │ 254.58ms │ 1.08x slower │ │ QQuery 13 │ 738.40ms │ 789.67ms │ 1.07x slower │ │ QQuery 14 │ 198.73ms │ 206.96ms │ no change │ │ QQuery 15 │ 183.32ms │ 179.53ms │ no change │ │ QQuery 16 │ 168.57ms │ 186.43ms │ 1.11x slower │ │ QQuery 17 │ 2032.57ms │ 2108.12ms │ no change │ │ QQuery 18 │ 1912.80ms │ 2134.82ms │ 1.12x slower │ │ QQuery 19 │ 391.64ms │ 368.53ms │ +1.06x faster │ │ QQuery 20 │ 648.22ms │ 691.41ms │ 1.07x slower │ │ QQuery 21 │ 866.25ms │ 1020.37ms │ 1.18x slower │ │ QQuery 22 │ 115.94ms │ 117.27ms │ no change │ └──────────────┴──────────────┴──────────────┴───────────────┘ -------------------- Benchmark tpch_mem.json -------------------- ┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ Query ┃ main ┃ mybranch ┃ Change ┃ ┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ QQuery 1 │ 2182.44ms │ 2390.39ms │ 1.10x slower │ │ QQuery 2 │ 181.16ms │ 153.94ms │ +1.18x faster │ │ QQuery 3 │ 98.89ms │ 95.51ms │ no change │ │ QQuery 4 │ 61.43ms │ 66.15ms │ 1.08x slower │ │ QQuery 5 │ 260.20ms │ 283.65ms │ 1.09x slower │ │ QQuery 6 │ 24.24ms │ 23.39ms │ no change │ │ QQuery 7 │ 545.87ms │ 653.34ms │ 1.20x slower │ │ QQuery 8 │ 147.48ms │ 136.00ms │ +1.08x faster │ │ QQuery 9 │ 371.53ms │ 363.61ms │ no change │ │ QQuery 10 │ 197.91ms │ 190.37ms │ no change │ │ QQuery 11 │ 197.91ms │ 183.70ms │ +1.08x faster │ │ QQuery 12 │ 100.32ms │ 103.08ms │ no change │ │ QQuery 13 │ 428.02ms │ 440.26ms │ no change │ │ QQuery 14 │ 38.50ms │ 27.11ms │ +1.42x faster │ │ QQuery 15 │ 101.15ms │ 63.25ms │ +1.60x faster │ │ QQuery 16 │ 171.15ms │ 142.44ms │ +1.20x faster │ │ QQuery 17 │ 1885.05ms │ 1953.58ms │ no change │ │ QQuery 18 │ 1549.92ms │ 1914.06ms │ 1.23x slower │ │ QQuery 19 │ 106.53ms │ 104.28ms │ no change │ │ QQuery 20 │ 532.11ms │ 610.62ms │ 1.15x slower │ │ QQuery 21 │ 723.39ms │ 823.34ms │ 1.14x slower │ │ QQuery 22 │ 91.84ms │ 89.89ms │ no change │ └──────────────┴──────────────┴──────────────┴───────────────┘
Assuming data is in the data directory, the tpch benchmark can be run with a command like this:
cargo run --release --bin dfbench -- tpch --iterations 3 --path ./data --format tbl --query 1 --batch-size 4096
See the help for more details.
You can enable mimalloc or snmalloc (to use either the mimalloc or snmalloc allocator) as features by passing them in as --features. For example:
cargo run --release --features "mimalloc" --bin tpch -- benchmark datafusion --iterations 3 --path ./data --format tbl --query 1 --batch-size 4096
The benchmark program also supports CSV and Parquet input file formats and a utility is provided to convert from tbl (generated by the dbgen utility) to CSV and Parquet.
cargo run --release --bin tpch -- convert --input ./data --output /mnt/tpch-parquet --format parquet
Or if you want to verify and run all the queries in the benchmark, you can just run cargo test.
The TPCH tables generated by the dbgen utility are sorted by their first column (their primary key for most tables, the l_orderkey column for the lineitem table.)
To preserve this sorted order information during conversion (useful for benchmarking execution on pre-sorted data) include the --sort flag:
cargo run --release --bin tpch -- convert --input ./data --output /mnt/tpch-sorted-parquet --format parquet --sort
Any dfbench execution with -o <dir> argument will produce a summary JSON in the specified directory. This file contains a serialized form of all the runs that happened and runtime metadata (number of cores, DataFusion version, etc.).
$ git checkout main # generate an output script in /tmp/output_main $ mkdir -p /tmp/output_main $ cargo run --release --bin tpch -- benchmark datafusion --iterations 5 --path ./data --format parquet -o /tmp/output_main/tpch.json # generate an output script in /tmp/output_branch $ mkdir -p /tmp/output_branch $ git checkout my_branch $ cargo run --release --bin tpch -- benchmark datafusion --iterations 5 --path ./data --format parquet -o /tmp/output_branch/tpch.json # compare the results: ./compare.py /tmp/output_main/tpch.json /tmp/output_branch/tpch.json
This will produce output like:
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ Query ┃ /home/alamb… ┃ /home/alamb… ┃ Change ┃ ┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ Q1 │ 16252.56ms │ 16031.82ms │ no change │ │ Q2 │ 3994.56ms │ 4353.75ms │ 1.09x slower │ │ Q3 │ 5572.06ms │ 5620.27ms │ no change │ │ Q4 │ 2144.14ms │ 2194.67ms │ no change │ │ Q5 │ 7796.93ms │ 7646.74ms │ no change │ │ Q6 │ 4382.32ms │ 4327.16ms │ no change │ │ Q7 │ 18702.50ms │ 19922.74ms │ 1.07x slower │ │ Q8 │ 7383.74ms │ 7616.21ms │ no change │ │ Q9 │ 13855.17ms │ 14408.42ms │ no change │ │ Q10 │ 7446.05ms │ 8030.00ms │ 1.08x slower │ │ Q11 │ 3414.81ms │ 3850.34ms │ 1.13x slower │ │ Q12 │ 3027.16ms │ 3085.89ms │ no change │ │ Q13 │ 18859.06ms │ 18627.02ms │ no change │ │ Q14 │ 4157.91ms │ 4140.22ms │ no change │ │ Q15 │ 5293.05ms │ 5369.17ms │ no change │ │ Q16 │ 6512.42ms │ 3011.58ms │ +2.16x faster │ │ Q17 │ 86253.33ms │ 76036.06ms │ +1.13x faster │ │ Q18 │ 45101.99ms │ 49717.76ms │ 1.10x slower │ │ Q19 │ 7323.15ms │ 7409.85ms │ no change │ │ Q20 │ 19902.39ms │ 20965.94ms │ 1.05x slower │ │ Q21 │ 22040.06ms │ 23184.84ms │ 1.05x slower │ │ Q22 │ 2011.87ms │ 2143.62ms │ 1.07x slower │ └──────────────┴──────────────┴──────────────┴───────────────┘
The dfbench program contains subcommands to run the various benchmarks. When benchmarking, it should always be built in release mode using --release.
Full help for each benchmark can be found in the relevant subcommand. For example, to get help for tpch, run:
cargo run --release --bin dfbench -- tpch --help ... dfbench-tpch 45.0.0 Run the tpch benchmark. This benchmarks is derived from the [TPC-H][1] version [2.17.1]. The data and answers are generated using `tpch-gen` from [2]. [1]: http://www.tpc.org/tpch/ [2]: https://github.com/databricks/tpch-dbgen.git, [2.17.1]: https://www.tpc.org/tpc_documents_current_versions/pdf/tpc-h_v2.17.1.pdf USAGE: dfbench tpch [FLAGS] [OPTIONS] --path <path> FLAGS: -d, --debug Activate debug mode to see more details -S, --disable-statistics Whether to disable collection of statistics (and cost based optimizations) or not -h, --help Prints help information ...
The mem_profile program wraps benchmark execution to measure memory usage statistics, such as peak RSS. It runs each benchmark query in a separate subprocess, capturing the child process’s stdout to print structured output.
Subcommands supported by mem_profile are the subset of those in dfbench. Currently supported benchmarks include: Clickbench, H2o, Imdb, SortTpch, Tpch
Before running benchmarks, mem_profile automatically compiles the benchmark binary (dfbench) using cargo build. Note that the build profile used for dfbench is not tied to the profile used for running mem_profile itself. We can explicitly specify the desired build profile using the --bench-profile option (e.g. release-nonlto). By prebuilding the binary and running each query in a separate process, we can ensure accurate memory statistics.
Currently, mem_profile only supports mimalloc as the memory allocator, since it relies on mimalloc's API to collect memory statistics.
Because it runs the compiled binary directly from the target directory, make sure your working directory is the top-level datafusion/ directory, where the target/ is also located.
The benchmark subcommand (e.g., tpch) and all following arguments are passed directly to dfbench. Be sure to specify --bench-profile before the benchmark subcommand.
Example:
datafusion$ cargo run --profile release-nonlto --bin mem_profile -- --bench-profile release-nonlto tpch --path benchmarks/data/tpch_sf1 --partitions 4 --format parquet
Example Output:
Query Time (ms) Peak RSS Peak Commit Major Page Faults ---------------------------------------------------------------- 1 503.42 283.4 MB 3.0 GB 0 2 431.09 240.7 MB 3.0 GB 0 3 594.28 350.1 MB 3.0 GB 0 4 468.90 462.4 MB 3.0 GB 0 5 653.58 385.4 MB 3.0 GB 0 6 296.79 247.3 MB 2.0 GB 0 7 662.32 652.4 MB 3.0 GB 0 8 702.48 396.0 MB 3.0 GB 0 9 774.21 611.5 MB 3.0 GB 0 10 733.62 397.9 MB 3.0 GB 0 11 271.71 209.6 MB 3.0 GB 0 12 512.60 212.5 MB 2.0 GB 0 13 507.83 381.5 MB 2.0 GB 0 14 420.89 313.5 MB 3.0 GB 0 15 539.97 288.0 MB 2.0 GB 0 16 370.91 229.8 MB 3.0 GB 0 17 758.33 467.0 MB 2.0 GB 0 18 1112.32 638.9 MB 3.0 GB 0 19 712.72 280.9 MB 2.0 GB 0 20 620.64 402.9 MB 2.9 GB 0 21 971.63 388.9 MB 2.9 GB 0 22 404.50 164.8 MB 2.0 GB 0
When running benchmarks, mem_profile collects several memory-related statistics using the mimalloc API:
Peak RSS (Resident Set Size): The maximum amount of physical memory used by the process. This is a process-level metric collected via OS-specific mechanisms and is not mimalloc-specific.
Peak Commit: The peak amount of memory committed by the allocator (i.e., total virtual memory reserved). This is mimalloc-specific. It gives a more allocator-aware view of memory usage than RSS.
Major Page Faults: The number of major page faults triggered during execution. This metric is obtained from the operating system and is not mimalloc-specific.
If you want to create or download the data with Rust as part of running the benchmark, see the next section on adding a benchmark subcommand and add code to create or download data as part of its run function.
If you want to create or download the data with shell commands, in benchmarks/bench.sh, define a new function named data_[your benchmark name] and call that function in the data command case as a subcommand case named for your benchmark. Also call the new function in the data all case.
In benchmarks/bench.sh, define a new function named run_[your benchmark name] following the example of existing run_* functions. Call that function in the run command case as a subcommand case named for your benchmark. subcommand for your benchmark. Also call the new function in the run all case. Add documentation for your benchmark to the text in the usage function.
In benchmarks/src/bin/dfbench.rs, add a dfbench subcommand for your benchmark by:
Options enummain function, similar to the other variantsuse datafusion_benchmarks::{} statementIn benchmarks/src/lib.rs, declare the new module you imported in dfbench.rs and create the corresponding file(s) for the module's code.
In the module, following the pattern of other existing benchmarks, define a RunOpt struct with:
--help output for the subcommandrun method that the dfbench main function will call.--path structopt field that the bench.sh script should use with ${DATA_DIR} to define where the input data should be stored.--output structopt field that the bench.sh script should use with "${RESULTS_FILE}" to define where the benchmark's results should be stored.Use the --path structopt field defined on the RunOpt struct to know where to store or look for the data. Generate the data using whatever Rust code you'd like, before the code that will be measuring an operation.
Your benchmark should create and use an instance of BenchmarkRun defined in benchmarks/src/util/run.rs as follows:
start_new_case method with a string that will appear in the “Query” column of the compare output.write_iter to record elapsed times for the behavior you're benchmarking.BenchmarkRun's maybe_write_json method, giving it the value of the --output structopt field on RunOpt.The output of dfbench help includes a description of each benchmark, which is reproduced here for convenience.
The ClickBench1 benchmarks are widely cited in the industry and focus on grouping / aggregation / filtering. This runner uses the scripts and queries from 2.
Test performance of parquet filter pushdown
The queries are executed on a synthetic dataset generated during the benchmark execution and designed to simulate web server access logs.
Example
dfbench parquet-filter --path ./data --scale-factor 1.0
generates the synthetic dataset at ./data/logs.parquet. The size of the dataset can be controlled through the size_factor (with the default value of 1.0 generating a ~1GB parquet file).
For each filter we will run the query using different ParquetScanOption settings.
Example output:
Running benchmarks with the following options: Opt { debug: false, iterations: 3, partitions: 2, path: "./data",
batch_size: 8192, scale_factor: 1.0 }
Generated test dataset with 10699521 rows
Executing with filter 'request_method = Utf8("GET")'
Using scan options ParquetScanOptions { pushdown_filters: false, reorder_predicates: false, enable_page_index: false }
Iteration 0 returned 10699521 rows in 1303 ms
Iteration 1 returned 10699521 rows in 1288 ms
Iteration 2 returned 10699521 rows in 1266 ms
Using scan options ParquetScanOptions { pushdown_filters: true, reorder_predicates: true, enable_page_index: true }
Iteration 0 returned 1781686 rows in 1970 ms
Iteration 1 returned 1781686 rows in 2002 ms
Iteration 2 returned 1781686 rows in 1988 ms
Using scan options ParquetScanOptions { pushdown_filters: true, reorder_predicates: false, enable_page_index: true }
Iteration 0 returned 1781686 rows in 1940 ms
Iteration 1 returned 1781686 rows in 1986 ms
Iteration 2 returned 1781686 rows in 1947 ms
...
Test performance of sorting large datasets
This test sorts a a synthetic dataset generated during the benchmark execution, designed to simulate sorting web server access logs. Such sorting is often done during data transformation steps.
The tests sort the entire dataset using several different sort orders.
Test performance of end-to-end sort SQL queries. (While the Sort benchmark focuses on a single sort executor, this benchmark tests how sorting is executed across multiple CPU cores by benchmarking sorting the whole relational table.)
Sort integration benchmark runs whole table sort queries on TPCH lineitem table, with different characteristics. For example, different number of sort keys, different sort key cardinality, different number of payload columns, etc.
If the TPCH tables have been converted as sorted on their first column (see Sorted Conversion), you can use the --sorted flag to indicate that the input data is pre-sorted, allowing DataFusion to leverage that order during query execution.
Additionally, an optional --limit flag is available for the sort benchmark. When specified, this flag appends a LIMIT n clause to the SQL query, effectively converting the query into a TopK query. Combining the --sorted and --limit options enables benchmarking of TopK queries on pre-sorted inputs.
See sort_tpch.rs for more details.
cargo run --release --bin dfbench -- sort-tpch -p './datafusion/benchmarks/data/tpch_sf1' -o '/tmp/sort_tpch.json'
cargo run --release --bin dfbench -- sort-tpch -p './datafusion/benchmarks/data/tpch_sf1' -o '/tmp/sort_tpch.json' --query 2
cargo run --release --bin dfbench -- sort-tpch --sorted --limit 10 -p './datafusion/benchmarks/data/tpch_sf1' -o '/tmp/sort_tpch.json'
bench.sh script:./bench.sh run sort_tpch
In addition, topk_tpch is available from the bench.sh script:
./bench.sh run topk_tpch
Run Join Order Benchmark (JOB) on IMDB dataset.
The Internet Movie Database (IMDB) dataset contains real-world movie data. Unlike synthetic datasets like TPCH, which assume uniform data distribution and uncorrelated columns, the IMDB dataset includes skewed data and correlated columns (which are common for real dataset), making it more suitable for testing query optimizers, particularly for cardinality estimation.
This benchmark is derived from Join Order Benchmark.
See paper How Good Are Query Optimizers, Really for more details.
Run the tpch benchmark.
This benchmarks is derived from the TPC-H version 2.17.1. The data and answers are generated using tpch-gen from 2.
Run the benchmark for aggregations with limited memory.
When the memory limit is exceeded, the aggregation intermediate results will be spilled to disk, and finally read back with sort-merge.
External aggregation benchmarks run several aggregation queries with different memory limits, on TPCH lineitem table. Queries can be found in external_aggr.rs.
This benchmark is inspired by DuckDB's external aggregation paper, specifically Section VI.
# Under 'benchmarks/' directory cargo run --release --bin external_aggr -- benchmark -n 4 --iterations 3 -p '....../data/tpch_sf1' -o '/tmp/aggr.json'
cargo run --release --bin external_aggr -- benchmark -n 4 --iterations 3 -p '....../data/tpch_sf1' -o '/tmp/aggr.json' --query 1 --memory-limit 30M
bench.sh script:./bench.sh data external_aggr ./bench.sh run external_aggr
The h2o.ai benchmarks are a set of performance tests for groupby and join operations. Beyond the standard h2o benchmark, there is also an extended benchmark for window functions. These benchmarks use synthetic data with configurable sizes (small: 1e7 rows, medium: 1e8 rows, big: 1e9 rows) to evaluate DataFusion's performance across different data scales.
Reference:
There are three options for generating data for h2o benchmarks: small, medium, and big. The data is generated in the data directory.
./bench.sh data h2o_small
./bench.sh data h2o_medium
./bench.sh data h2o_big
There are three options for running h2o benchmarks: small, medium, and big.
./bench.sh run h2o_small
./bench.sh run h2o_medium
./bench.sh run h2o_big
For example, to run query 1 with the small data generated above:
cargo run --release --bin dfbench -- h2o --path ./benchmarks/data/h2o/G1_1e7_1e7_100_0.csv --query 1
There are three options for generating data for h2o benchmarks: small, medium, and big. The data is generated in the data directory.
Here is a example to generate small dataset and run the benchmark. To run other dataset size configuration, change the command similar to the previous example.
# Generate small data (4 table files, the largest is 1e7 rows) ./bench.sh data h2o_small_join # Run the benchmark ./bench.sh run h2o_small_join
To run a specific query with a specific join data paths, the data paths are including 4 table files.
For example, to run query 1 with the small data generated above:
cargo run --release --bin dfbench -- h2o --join-paths ./benchmarks/data/h2o/J1_1e7_NA_0.csv,./benchmarks/data/h2o/J1_1e7_1e1_0.csv,./benchmarks/data/h2o/J1_1e7_1e4_0.csv,./benchmarks/data/h2o/J1_1e7_1e7_NA.csv --queries-path ./benchmarks/queries/h2o/join.sql --query 1
This benchmark extends the h2o benchmark suite to evaluate window function performance. H2o window benchmark uses the same dataset as the h2o join benchmark. There are three options for generating data for h2o benchmarks: small, medium, and big.
Here is a example to generate small dataset and run the benchmark. To run other dataset size configuration, change the command similar to the previous example.
# Generate small data ./bench.sh data h2o_small_window # Run the benchmark ./bench.sh run h2o_small_window
To run a specific query with a specific window data paths, the data paths are including 4 table files (the same as h2o-join dataset)
For example, to run query 1 with the small data generated above:
cargo run --release --bin dfbench -- h2o --join-paths ./benchmarks/data/h2o/J1_1e7_NA_0.csv,./benchmarks/data/h2o/J1_1e7_1e1_0.csv,./benchmarks/data/h2o/J1_1e7_1e4_0.csv,./benchmarks/data/h2o/J1_1e7_1e7_NA.csv --queries-path ./benchmarks/queries/h2o/window.sql --query 1
This benchmark focuses on the performance of queries with nested loop joins, minimizing other overheads such as scanning data sources or evaluating predicates.
Different queries are included to test nested loop joins under various workloads.
# No need to generate data: this benchmark uses table function `range()` as the data source ./bench.sh run nlj
This benchmark focuses on the performance of queries with nested hash joins, minimizing other overheads such as scanning data sources or evaluating predicates.
Several queries are included to test hash joins under various workloads.
# No need to generate data: this benchmark uses table function `range()` as the data source ./bench.sh run hj
Test performance of cancelling queries.
Queries in DataFusion should stop executing “quickly” after they are cancelled (the output stream is dropped).
The queries are executed on a synthetic dataset generated during the benchmark execution that is an anonymized version of a real-world data set.
The query is an anonymized version of a real-world query, and the test starts the query then cancels it and reports how long it takes for the runtime to fully exit.
Example output:
Using 7 files found on disk Starting to load data into in-memory object store Done loading data into in-memory object store in main, sleeping Starting spawned Creating logical plan... Creating physical plan... Executing physical plan... Getting results... cancelling thread done dropping runtime in 83.531417ms