| bad_parquet_data.parquet: |
| Generated with parquet-mr 1.2.5 |
| Contains 3 single-column rows: |
| "parquet" |
| "is" |
| "fun" |
| |
| bad_compressed_dict_page_size.parquet: |
| Generated by hacking Impala's Parquet writer. |
| Contains a single string column 'col' with one row ("a"). The compressed_page_size field |
| in dict page header is modifed to 0 to test if it is correctly handled. |
| |
| bad_rle_literal_count.parquet: |
| Generated by hacking Impala's Parquet writer. |
| Contains a single bigint column 'c' with the values 1, 3, 7 stored |
| in a single data chunk as dictionary plain. The RLE encoded dictionary |
| indexes are all literals (and not repeated), but the literal count |
| is incorrectly 0 in the file to test that such data corruption is |
| proprly handled. |
| |
| bad_rle_repeat_count.parquet: |
| Generated by hacking Impala's Parquet writer. |
| Contains a single bigint column 'c' with the value 7 repeated 7 times |
| stored in a single data chunk as dictionary plain. The RLE encoded dictionary |
| indexes are a single repeated run (and not literals), but the repeat count |
| is incorrectly 0 in the file to test that such data corruption is proprly |
| handled. |
| |
| zero_rows_zero_row_groups.parquet: |
| Generated by hacking Impala's Parquet writer. |
| The file metadata indicates zero rows and no row groups. |
| |
| zero_rows_one_row_group.parquet: |
| Generated by hacking Impala's Parquet writer. |
| The file metadata indicates zero rows but one row group. |
| |
| huge_num_rows.parquet: |
| Generated by hacking Impala's Parquet writer. |
| The file metadata indicates 2 * MAX_INT32 rows. |
| The single row group also has the same number of rows in the metadata. |
| |
| repeated_values.parquet: |
| Generated with parquet-mr 1.2.5 |
| Contains 3 single-column rows: |
| "parquet" |
| "parquet" |
| "parquet" |
| |
| multiple_rowgroups.parquet: |
| Generated with parquet-mr 1.2.5 |
| Populated with: |
| hive> set parquet.block.size=500; |
| hive> INSERT INTO TABLE tbl |
| SELECT l_comment FROM tpch.lineitem LIMIT 1000; |
| |
| alltypesagg_hive_13_1.parquet: |
| Generated with parquet-mr version 1.5.0-cdh5.4.0-SNAPSHOT |
| hive> create table alltypesagg_hive_13_1 stored as parquet as select * from alltypesagg; |
| |
| bad_column_metadata.parquet: |
| Generated with hacked version of parquet-mr 1.8.2-SNAPSHOT |
| Schema: |
| {"type": "record", |
| "namespace": "org.apache.impala", |
| "name": "bad_column_metadata", |
| "fields": [ |
| {"name": "id", "type": ["null", "long"]}, |
| {"name": "int_array", "type": ["null", {"type": "array", "items": ["null", "int"]}]} |
| ] |
| } |
| Contains 3 row groups, each with ten rows and each array containing ten elements. The |
| first rowgroup column metadata for 'int_array' incorrectly states there are 50 values |
| (instead of 100), and the second rowgroup column metadata for 'id' incorrectly states |
| there are 11 values (instead of 10). The third rowgroup has the correct metadata. |
| |
| data-bzip2.bz2: |
| Generated with bzip2, contains single bzip2 stream |
| Contains 1 column, uncompressed data size < 8M |
| |
| large_bzip2.bz2: |
| Generated with bzip2, contains single bzip2 stream |
| Contains 1 column, uncompressed data size > 8M |
| |
| data-pbzip2.bz2: |
| Generated with pbzip2, contains multiple bzip2 streams |
| Contains 1 column, uncompressed data size < 8M |
| |
| large_pbzip2.bz2: |
| Generated with pbzip2, contains multiple bzip2 stream |
| Contains 1 column, uncompressed data size > 8M |
| |
| out_of_range_timestamp.parquet: |
| Generated with a hacked version of Impala parquet writer. |
| Contains a single timestamp column with 4 values, 2 of which are out of range |
| and should be read as NULL by Impala: |
| 1399-12-31 00:00:00 (invalid - date too small) |
| 1400-01-01 00:00:00 |
| 9999-12-31 00:00:00 |
| 10000-01-01 00:00:00 (invalid - date too large) |
| |
| table_with_header.csv: |
| Created with a text editor, contains a header line before the data rows. |
| |
| table_with_header_2.csv: |
| Created with a text editor, contains two header lines before the data rows. |
| |
| table_with_header.gz, table_with_header_2.gz: |
| Generated by gzip'ing table_with_header.csv and table_with_header_2.csv. |
| |
| deprecated_statistics.parquet: |
| Generated with with hive shell, which uses parquet-mr version 1.5.0-cdh5.12.0-SNAPSHOT |
| Contains a copy of the data in functional.alltypessmall with statistics that use the old |
| 'min'/'max' fields. |
| |
| repeated_root_schema.parquet: |
| Generated by hacking Impala's Parquet writer. |
| Created to reproduce IMPALA-4826. Contains a table of 300 rows where the |
| repetition level of the root schema is set to REPEATED. |
| Reproduction steps: |
| 1: Extend HdfsParquetTableWriter::CreateSchema with the following line: |
| file_metadata_.schema[0].__set_repetition_type(FieldRepetitionType::REQUIRED); |
| 2: Run test_compute_stats and grab the created Parquet file for |
| alltypes_parquet table. |
| |
| binary_decimal_dictionary.parquet, |
| binary_decimal_no_dictionary.parquet: |
| Generated using parquet-mr and contents verified using parquet-tools-1.9.1. |
| Contains decimals stored as variable sized BYTE_ARRAY with both dictionary |
| and non-dictionary encoding respectively. |
| |
| alltypes_agg_bitpacked_def_levels.parquet: |
| Generated by hacking Impala's Parquet writer to write out bitpacked def levels instead |
| of the standard RLE-encoded levels. See |
| https://github.com/timarmstrong/incubator-impala/tree/hack-bit-packed-levels. This |
| is a single file containing all of the alltypesagg data, which includes a mix of |
| null and non-null values. This is not actually a valid Parquet file because the |
| bit-packed levels are written in the reverse order specified in the Parquet spec |
| for BIT_PACKED. However, this is the order that Impala attempts to read the levels |
| in - see IMPALA-3006. |
| |
| signed_integer_logical_types.parquet: |
| Generated using a utility that uses the java Parquet API. |
| The file has the following schema: |
| schema { |
| optional int32 id; |
| optional int32 tinyint_col (INT_8); |
| optional int32 smallint_col (INT_16); |
| optional int32 int_col; |
| optional int64 bigint_col; |
| } |
| |
| min_max_is_nan.parquet: |
| Generated by Impala's Parquet writer before the fix for IMPALA-6527. Git hash: 3a049a53 |
| Created to test the read path for a Parquet file with invalid metadata, namely when |
| 'max_value' and 'min_value' are both NaN. Contains 2 single-column rows: |
| NaN |
| 42 |
| |
| bad_codec.parquet: |
| Generated by Impala's Parquet writer, hacked to use the invalid enum value 5000 for the |
| compression codec. The data in the file is the whole of the "alltypestiny" data set, with |
| the same columns: id int, bool_col boolean, tinyint_col tinyint, smallint_col smallint, |
| int_col int, bigint_col bigint, float_col float, double_col double, |
| date_string_col string, string_col string, timestamp_col timestamp, year int, month int |
| |
| num_values_def_levels_mismatch.parquet: |
| A file with a single boolean column with page metadata reporting 2 values but only def |
| levels for a single literal value. Generated by hacking Impala's parquet writer to |
| increment page.header.data_page_header.num_values. This caused Impala to hit a DCHECK |
| (IMPALA-6589). |
| |
| rle_encoded_bool.parquet: |
| Parquet v1 file with RLE encoded boolean column "b" and int column "i". |
| Created for IMPALA-6324, generated with modified parquet-mr. Contains 279 rows, |
| 139 with value false, and 140 with value true. "i" is always 1 if "b" is True |
| and always 0 if "b" is false. |
| |
| dict_encoding_with_large_bit_width.parquet: |
| Parquet file with a single TINYINT column "i" with 33 rows. Created by a modified |
| Impala to use 9 bit dictionary indices for encoding. Reading this file used to lead |
| to DCHECK errors (IMPALA-7147). |
| |
| decimal_stored_as_int32.parquet: |
| Parquet file generated by Spark 2.3.1 that contains decimals stored as int32. |
| Impala needs to be able to read such values (IMPALA-5542) |
| |
| decimal_stored_as_int64.parquet: |
| Parquet file generated by Spark 2.3.1 that contains decimals stored as int64. |
| Impala needs to be able to read such values (IMPALA-5542) |
| |
| primitive_type_widening.parquet: |
| Parquet file that contains two rows with the following schema: |
| - int32 tinyint_col1 |
| - int32 tinyint_col2 |
| - int32 tinyint_col3 |
| - int32 tinyint_col4 |
| - int32 smallint_col1 |
| - int32 smallint_col2 |
| - int32 smallint_col3 |
| - int32 int_col1 |
| - int32 int_col2 |
| - float float_col |
| It is used to test primitive type widening (IMPALA-6373). |
| |
| corrupt_footer_len_decr.parquet: |
| Parquet file that contains one row of the following schema: |
| - bigint c |
| The footer size is manually modified (using hexedit) to be the original file size minus |
| 1, to cause metadata deserialization in footer parsing to fail, thus trigger the printing |
| of an error message with incorrect file offset, to verify that it's fixed by IMPALA-6442. |
| |
| corrupt_footer_len_incr.parquet: |
| Parquet file that contains one row of the following schema: |
| - bigint c |
| The footer size is manually modified (using hexedit) to be larger than the original file |
| size and cause footer parsing to fail. It's used to test an error message related to |
| IMPALA-6442. |
| |
| hive_single_value_timestamp.parq: |
| Parquet file written by Hive with the followin schema: |
| i int, timestamp d |
| Contains a single row. It is used to test IMPALA-7559 which only occurs when all values |
| in a column chunk are the same timestamp and the file is written with parquet-mr (which |
| is used by Hive). |
| |
| out_of_range_time_of_day.parquet: |
| IMPALA-7595: Parquet file that contains timestamps where the time part is out of the |
| valid range [0..24H). Before the fix, select * returned these values: |
| 1970-01-01 -00:00:00.000000001 (invalid - negative time of day) |
| 1970-01-01 00:00:00 |
| 1970-01-01 23:59:59.999999999 |
| 1970-01-01 24:00:00 (invalid - time of day should be less than a whole day) |
| |
| strings_with_quotes.csv: |
| Various strings with quotes in them to reproduce bugs like IMPALA-7586. |
| |
| int64_timestamps_plain.parq: |
| Parquet file generated with Parquet-mr that contains plain encoded int64 columns with |
| Timestamp logical types. Has the following columns: |
| new_logical_milli_utc, new_logical_milli_local, |
| new_logical_micro_utc, new_logical_micro_local |
| |
| int64_timestamps_dict.parq: |
| Parquet file generated with Parquet-mr that contains dictionary encoded int64 columns |
| with Timestamp logical types. Has the following columns: |
| id, |
| new_logical_milli_utc, new_logical_milli_local, |
| new_logical_micro_utc, new_logical_micro_local |
| |
| int64_timestamps_at_dst_changes.parquet: |
| Parquet file generated with Parquet-mr that contains plain encoded int64 columns with |
| Timestamp logical types. The file contains 3 row groups, and all row groups contain |
| 3 distinct values, so there is a "min", a "max", and a "middle" value. The values were |
| selected in such a way that the UTC->CET conversion changes the order of the values (this |
| is possible during Summer->Winter DST change) and "middle" falls outside the "min".."max" |
| range after conversion. This means that a naive stat filtering implementation could drop |
| "middle" incorrectly. |
| Example (all dates are 2017-10-29): |
| UTC: 00:45:00, 01:00:00, 01:10:00 => |
| CET: 02:45:00, 02:00:00, 02:10:00 |
| Columns: rawvalue bigint, rowgroup int, millisutc timsestamp, microsutc timestamp |
| |
| int64_timestamps_nano.parquet: |
| Parquet file generated with Parquet-mr that contains int64 columns with nanosecond |
| precision. Tested separately from the micro/millisecond columns because of the different |
| valid range. |
| Columns: rawvalue bigint, nanoutc timestamp, nanononutc timestamp |
| |
| out_of_range_timestamp_hive_211.parquet: |
| Hive-generated file with an out-of-range timestamp. Generated with Hive 2.1.1 using |
| the following query: |
| create table alltypes_hive stored as parquet as |
| select * from functional.alltypes |
| union all |
| select -1, false, 0, 0, 0, 0, 0, 0, '', '', cast('1399-01-01 00:00:00' as timestamp), 0, 0 |
| |
| out_of_range_timestamp2_hive_211.parquet: |
| Hive-generated file with out-of-range timestamps every second value, to exercise code |
| paths in Parquet scanner for non-repeated runs. Generated with Hive 2.1.1 using |
| the following query: |
| create table hive_invalid_timestamps stored as parquet as |
| select id, |
| case id % 3 |
| when 0 then timestamp_col |
| when 1 then NULL |
| when 2 then cast('1300-01-01 9:9:9' as timestamp) |
| end timestamp_col |
| from functional.alltypes |
| sort by id |
| |
| decimal_rtf_tbl.txt: |
| This was generated using formulas in Google Sheets. The goal was to create various |
| decimal values that covers the 3 storage formats with various precision and scale. |
| This is a reasonably large table that is used for testing min-max filters |
| with decimal types on Kudu. |
| |
| decimal_rtf_tiny_tbl.txt: |
| Small table with specific decimal values picked from decimal_rtf_tbl.txt so that |
| min-max filter based pruning can be tested with decimal types on Kudu. |
| |
| date_tbl.orc |
| Small orc table with one DATE column, created by Hive. |
| |
| date_tbl.avro |
| Small avro table with one DATE column, created by Hive. |
| |
| date_tbl.parquet |
| Small parquet table with one DATE column, created by Parquet MR. |
| |
| out_of_range_date.parquet: |
| Generated with a hacked version of Impala parquet writer. |
| Contains a single DATE column with 9 values, 4 of which are out of range |
| and should be read as NULL by Impala: |
| -0001-12-31 (invalid - date too small) |
| 0000-01-01 (invalid - date too small) |
| 0000-01-02 (invalid - date too small) |
| 1969-12-31 |
| 1970-01-01 |
| 1970-01-02 |
| 9999-12-30 |
| 9999-12-31 |
| 10000-01-01 (invalid - date too large) |
| |
| hive2_pre_gregorian.parquet: |
| Small parquet table with one DATE column, created by Hive 2.1.1. |
| Used to demonstrate parquet interoperability issues between Hive and Impala for dates |
| before the introduction of Gregorian calendar in 1582-10-15. |
| |
| decimals_1_10.parquet: |
| Contains two decimal columns, one with precision 1, the other with precision 10. |
| I used Hive 2.1.1 with a modified version of Parquet-MR (6901a20) to create tiny, |
| misaligned pages in order to test the value-skipping logic in the Parquet column readers. |
| The modification in Parquet-MR was to set MIN_SLAB_SIZE to 1. You can find the change |
| here: https://github.com/boroknagyz/parquet-mr/tree/tiny_pages |
| hive --hiveconf parquet.page.row.count.limit=5 --hiveconf parquet.page.size=5 |
| --hiveconf parquet.enable.dictionary=false --hiveconf parquet.page.size.row.check.min=1 |
| create table decimals_1_10 (d_1 DECIMAL(1, 0), d_10 DECIMAL(10, 0)) stored as PARQUET |
| insert into decimals_1_10 values (1, 1), (2, 2), (3, 3), (4, 4), (5, 5), |
| (NULL, 1), (2, 2), (3, 3), (4, 4), (5, 5), |
| (1, 1), (NULL, 2), (3, 3), (4, 4), (5, 5), |
| (1, 1), (2, 2), (NULL, 3), (4, 4), (5, 5), |
| (1, 1), (2, 2), (3, 3), (NULL, 4), (5, 5), |
| (1, 1), (2, 2), (3, 3), (4, 4), (NULL, 5), |
| (NULL, 1), (NULL, 2), (3, 3), (4, 4), (5, 5), |
| (1, 1), (NULL, 2), (3, 3), (NULL, 4), (5, 5), |
| (1, 1), (2, 2), (3, 3), (NULL, 4), (NULL, 5), |
| (NULL, 1), (2, 2), (NULL, 3), (NULL, 4), (5, 5), |
| (1, 1), (2, 2), (3, 3), (4, 4), (5, NULL); |
| |
| nested_decimals.parquet: |
| Contains two columns, one is a decimal column, the other is an array of decimals. |
| I used Hive 2.1.1 with a modified Parquet-MR, see description at decimals_1_10.parquet. |
| hive --hiveconf parquet.page.row.count.limit=5 --hiveconf parquet.page.size=16 |
| --hiveconf parquet.enable.dictionary=false --hiveconf parquet.page.size.row.check.min=1 |
| create table nested_decimals (d_38 Decimal(38, 0), arr array<Decimal(1, 0)>) stored as parquet; |
| insert into nested_decimals select 1, array(cast (1 as decimal(1,0)), cast (1 as decimal(1,0)), cast (1 as decimal(1,0)) ) union all |
| select 2, array(cast (2 as decimal(1,0)), cast (2 as decimal(1,0)), cast (2 as decimal(1,0)) ) union all |
| select 3, array(cast (3 as decimal(1,0)), cast (3 as decimal(1,0)), cast (3 as decimal(1,0)) ) union all |
| select 4, array(cast (4 as decimal(1,0)), cast (4 as decimal(1,0)), cast (4 as decimal(1,0)) ) union all |
| select 5, array(cast (5 as decimal(1,0)), cast (5 as decimal(1,0)), cast (5 as decimal(1,0)) ) union all |
| |
| select 1, array(cast (1 as decimal(1,0)) ) union all |
| select 2, array(cast (2 as decimal(1,0)), cast (2 as decimal(1,0)) ) union all |
| select 3, array(cast (3 as decimal(1,0)), cast (3 as decimal(1,0)), cast (3 as decimal(1,0)) ) union all |
| select 4, array(cast (4 as decimal(1,0)), cast (4 as decimal(1,0)), cast (4 as decimal(1,0)), cast (4 as decimal(1,0)) ) union all |
| select 5, array(cast (5 as decimal(1,0)), cast (5 as decimal(1,0)), cast (5 as decimal(1,0)), cast (5 as decimal(1,0)), cast (5 as decimal(1,0)) ) union all |
| |
| select 1, array(cast (NULL as decimal(1, 0)), NULL, NULL) union all |
| select 2, array(cast (2 as decimal(1,0)), NULL, NULL) union all |
| select 3, array(cast (3 as decimal(1,0)), NULL, cast (3 as decimal(1,0))) union all |
| select 4, array(NULL, cast (4 as decimal(1,0)), cast (4 as decimal(1,0)), NULL) union all |
| select 5, array(NULL, cast (5 as decimal(1,0)), NULL, NULL, cast (5 as decimal(1,0)) ) union all |
| |
| select 6, array(cast (6 as decimal(1,0)), NULL, cast (6 as decimal(1,0)) ) union all |
| select 7, array(cast (7 as decimal(1,0)), cast (7 as decimal(1,0)), cast (7 as decimal(1,0)), NULL ) union all |
| select 8, array(NULL, NULL, cast (8 as decimal(1,0)) ) union all |
| select 7, array(cast (7 as decimal(1,0)), cast (7 as decimal(1,0)), cast (7 as decimal(1,0)) ) union all |
| select 6, array(NULL, NULL, NULL, cast (6 as decimal(1,0)) ); |
| |
| double_nested_decimals.parquet: |
| Contains two columns, one is a decimal column, the other is an array of arrays of |
| decimals. I used Hive 2.1.1 with a modified Parquet-MR, see description |
| at decimals_1_10.parquet. |
| hive --hiveconf parquet.page.row.count.limit=5 --hiveconf parquet.page.size=16 |
| --hiveconf parquet.enable.dictionary=false --hiveconf parquet.page.size.row.check.min=1 |
| create table double_nested_decimals (d_38 Decimal(38, 0), arr array<array<Decimal(1, 0)>>) stored as parquet; |
| insert into double_nested_decimals select 1, array(array(cast (1 as decimal(1,0)), cast (1 as decimal(1,0)) )) union all |
| select 2, array(array(cast (2 as decimal(1,0)), cast (2 as decimal(1,0)) )) union all |
| select 3, array(array(cast (3 as decimal(1,0)), cast (3 as decimal(1,0)), cast (3 as decimal(1,0)) )) union all |
| select 4, array(array(cast (4 as decimal(1,0)), cast (4 as decimal(1,0)), cast (4 as decimal(1,0)) )) union all |
| select 5, array(array(cast (5 as decimal(1,0)), cast (5 as decimal(1,0)), cast (5 as decimal(1,0)) )) union all |
| |
| select 1, array(array(cast (1 as decimal(1,0))), array(cast (1 as decimal(1,0))), array(cast (1 as decimal(1,0))) ) union all |
| select 2, array(array(cast (2 as decimal(1,0))), array(cast (2 as decimal(1,0))) ) union all |
| select 3, array(array(cast (3 as decimal(1,0))), array(cast (3 as decimal(1,0))), array(cast (3 as decimal(1,0))) ) union all |
| select 4, array(array(cast (4 as decimal(1,0))), array(cast (4 as decimal(1,0))) ) union all |
| select 5, array(array(cast (5 as decimal(1,0))), array(cast (5 as decimal(1,0))) ) union all |
| |
| select 1, array(array(cast (1 as decimal(1,0))) ) union all |
| select 2, array(array(cast (2 as decimal(1,0))), array(cast (2 as decimal(1,0))) ) union all |
| select 3, array(array(cast (3 as decimal(1,0))), array(cast (3 as decimal(1,0))), array(cast (3 as decimal(1,0))) ) union all |
| select 4, array(array(cast (4 as decimal(1,0))), array(cast (4 as decimal(1,0))) ) union all |
| select 5, array(array(cast (5 as decimal(1,0))) ) union all |
| |
| select 1, array(array(cast (1 as decimal(1,0))), array(cast (1 as decimal(1,0))), array(cast (1 as decimal(1,0))) ) union all |
| select 2, array(array(cast (2 as decimal(1,0))), array(cast (2 as decimal(1,0))) ) union all |
| select 3, array(array(cast (3 as decimal(1,0))) ) union all |
| select 4, array(array(cast (4 as decimal(1,0))), array(cast (4 as decimal(1,0))) ) union all |
| select 5, array(array(cast (5 as decimal(1,0))), array(cast (5 as decimal(1,0))), array(cast (5 as decimal(1,0))) ) union all |
| |
| select 1, array(array(cast (1 as decimal(1,0))), array(cast (1 as decimal(1,0)), cast (1 as decimal(1,0))) ) union all |
| select 2, array(array(cast (2 as decimal(1,0))) ) union all |
| select 3, array(array(cast (3 as decimal(1,0)), cast (3 as decimal(1,0))), array(cast (3 as decimal(1,0))) ) union all |
| select 4, array(array(cast (4 as decimal(1,0))), array(cast (4 as decimal(1,0)), cast (4 as decimal(1,0))), array(cast (4 as decimal(1,0))) ) union all |
| select 5, array(array(cast (5 as decimal(1,0))), array(cast (5 as decimal(1,0))), array(cast (5 as decimal(1,0))) ) union all |
| |
| select 1, array(array(cast (NULL as decimal(1,0))), array(cast (NULL as decimal(1,0))), array(cast (1 as decimal(1,0))) ) union all |
| select 2, array(array(cast (NULL as decimal(1,0))), array(cast (NULL as decimal(1,0))), array(cast (NULL as decimal(1,0))) ) union all |
| select 3, array(array(cast (NULL as decimal(1,0))), array(cast (3 as decimal(1,0))), NULL ) union all |
| select 4, array(NULL, NULL, array(cast (NULL as decimal(1,0)), NULL, NULL, NULL, NULL) ) union all |
| select 5, array(array(NULL, cast (5 as decimal(1,0)), NULL, NULL, NULL) ) union all |
| |
| select 6, array(array(cast (6 as decimal(1,0)), NULL), array(cast (6 as decimal(1,0))) ) union all |
| select 7, array(array(cast (7 as decimal(1,0)), cast (7 as decimal(1,0))), NULL ) union all |
| select 8, array(array(NULL, NULL, cast (8 as decimal(1,0))) ) union all |
| select 7, array(array(cast (7 as decimal(1,0)), cast (NULL as decimal(1,0))), array(cast (7 as decimal(1,0))) ) union all |
| select 6, array(array(NULL, NULL, cast (6 as decimal(1,0))), array(NULL, cast (6 as decimal(1,0))) ); |
| |
| alltypes_tiny_pages.parquet: |
| Created from 'functional.alltypes' with small page sizes. |
| I used Hive 2.1.1 with a modified Parquet-MR, see description at decimals_1_10.parquet. |
| I used the following commands to create the file: |
| hive --hiveconf parquet.page.row.count.limit=90 --hiveconf parquet.page.size=90 --hiveconf parquet.page.size.row.check.min=7 |
| create table alltypes_tiny_pages stored as parquet as select * from functional_parquet.alltypes |
| |
| alltypes_tiny_pages_plain.parquet: |
| Created from 'functional.alltypes' with small page sizes without dictionary encoding. |
| I used Hive 2.1.1 with a modified Parquet-MR, see description at decimals_1_10.parquet. |
| I used the following commands to create the file: |
| hive --hiveconf parquet.page.row.count.limit=90 --hiveconf parquet.page.size=90 --hiveconf parquet.enable.dictionary=false --hiveconf parquet.page.size.row.check.min=7 |
| create table alltypes_tiny_pages_plain stored as parquet as select * from functional_parquet.alltypes |
| |
| parent_table: |
| Created manually. Contains two columns, an INT and a STRING column. Together they form primary key for the table. This table is used to test primary key and foreign key |
| relationships along with parent_table_2 and child_table. |
| |
| parent_table_2: |
| Created manually. Contains just one int column which is also the table's primary key. This table is used to test primary key and foreign key |
| relationships along with parent_table and child_table. |
| |
| child_table: |
| Created manually. Contains four columns. 'seq' column is the primary key of this table. ('id', 'year') form a foreign key referring to parent_table('id', 'year') and 'a' is a |
| foreign key referring to parent_table_2's primary column 'a'. |
| |
| out_of_range_timestamp.orc: |
| Created with Hive. ORC file with a single timestamp column 'ts'. |
| Contains one row (1300-01-01 00:00:00) which is outside Impala's valid time range. |