blob: 4396e19378f6eaa04c6bb44dc61c87fb59140596 [file] [log] [blame]
This is a overview of the benchmark workflow and the scripts involved. The workflow is as follows:
1) Create base benchmark schema and load data into these tables.
2) Create extended benchmark schema (different file formats, compression, etc)
and load data by copying from tables created in 1) using INSERT statements.
3) Run the benchmarks using $IMPALA_HOME/bin/run_benchmark.py
The *.sql scripts to create the extended benchmarks schema and data loading are dynamically generated
using the generate_benchmark_statements.rb script. This script reads in files that describe what
combinations of data set, file format, compression algorithm to be used and outputs the query
files.
The input to the generate_benchmark_statements.rb script is generated using the
generate_test_vectors.rb script. This script looks at the different dimension values (defined
in benchmark_dimensions.yaml) such as file format = rcfile, sequence file, text and outputs
a set of test vectors. It outputs both an exhaustive and reduced set of combinations.
Currently, a pre-generated set of vectors is checked in along with the *.sql files so these
scripts don't need to be run unless there is a new dimension added/removed. These can be viewed
at: benchmark_*.vector and create-benchmark*-generated.sql.
For more information about these scripts please view the comments within the scripts themselves.