| This directory contains Impala test data sets. The directory layout is structured as follows: |
| |
| datasets/ |
| <data set>/<data set>_schema_template.sql |
| <data set>/<data files SF1>/data files |
| <data set>/<data files SF2>/data files |
| |
| Where SF is the scale factor controlling data size. This allows for scaling the same schema to |
| different sizes based on the target test environment. |
| |
| The schema template SQL files have the following format: |
| |
| The goal is to provide a single place to define a table + data files |
| and have the schema and data load statements generated for each combination of file |
| format, compression, etc. The way this works is by specifying how to create a |
| 'base table'. The base table can be used to generate tables in other file formats |
| by performing the defined INSERT / SELECT INTO statement. Each new table using the |
| file format/compression combination needs to have a unique name, so all the |
| statements are pameterized on table name. |
| The template file is read in by the 'generate_schema_statements.py' script to |
| to generate all the schema for the Imapla benchmark tests. |
| |
| Each table is defined as a new section in the file with the following format: |
| |
| ==== |
| ---- SECTION NAME |
| section contents |
| ... |
| ---- ANOTHER SECTION |
| ... section contents |
| ---- ... more sections... |
| |
| Note that tables are delimited by '====' and that even the first table in the |
| file must include this header line. |
| |
| The supported section names are: |
| |
| DATASET |
| Data set name - Used to group sets of tables together |
| BASE_TABLE_NAME |
| The name of the table within the database |
| CREATE |
| Explicit CREATE statement used to create the table (executed by Impala) |
| CREATE_HIVE |
| Same as the above, but will be executed by Hive instead. If specified, |
| 'CREATE' must not be specified. |
| CREATE_KUDU |
| Customized CREATE TABLE statement used to create the table for Kudu-specific |
| syntax. |
| |
| COLUMNS |
| PARTITION_COLUMNS |
| ROW_FORMAT |
| HBASE_COLUMN_FAMILIES |
| TABLE_PROPERTIES |
| HBASE_REGION_SPLITS |
| If no explicit CREATE statement is provided, a CREATE statement is generated |
| from these sections (see 'build_table_template' function in |
| 'generate-schema-statements.py' for details) |
| |
| ALTER |
| A set of ALTER statements to be executed after the table is created |
| (typically to add partitions, but may also be used for other settings that |
| cannot be specified directly in the CREATE TABLE statement). |
| |
| These statements are ignored for HBase and Kudu tables. |
| |
| LOAD |
| The statement used to load the base (text) form of the table. This is |
| typically a LOAD DATA statement. |
| |
| DEPENDENT_LOAD |
| DEPENDENT_LOAD_KUDU |
| DEPENDENT_LOAD_HIVE |
| DEPENDENT_LOAD_ACID |
| Statements to be executed during the "dependent load" phase. These statements |
| are run after the initial (base table) load is complete. |
| |
| HIVE_MAJOR_VERSION |
| The required major version of Hive for this table. If the major version |
| of Hive at runtime does not exactly match the version specified in this section, |
| the table will be skipped. |
| |
| NOTE: this is not a _minimum_ version -- if HIVE_MAJOR_VERSION specifies '2', |
| the table will _not_ be loaded/created on Hive 3. |