| bad_parquet_data.parquet: |
| Generated with parquet-mr 1.2.5 |
| Contains 3 single-column rows: |
| "parquet" |
| "is" |
| "fun" |
| |
| bad_compressed_dict_page_size.parquet: |
| Generated by hacking Impala's Parquet writer. |
| Contains a single string column 'col' with one row ("a"). The compressed_page_size field |
| in dict page header is modifed to 0 to test if it is correctly handled. |
| |
| bad_rle_literal_count.parquet: |
| Generated by hacking Impala's Parquet writer. |
| Contains a single bigint column 'c' with the values 1, 3, 7 stored |
| in a single data chunk as dictionary plain. The RLE encoded dictionary |
| indexes are all literals (and not repeated), but the literal count |
| is incorrectly 0 in the file to test that such data corruption is |
| proprly handled. |
| |
| bad_rle_repeat_count.parquet: |
| Generated by hacking Impala's Parquet writer. |
| Contains a single bigint column 'c' with the value 7 repeated 7 times |
| stored in a single data chunk as dictionary plain. The RLE encoded dictionary |
| indexes are a single repeated run (and not literals), but the repeat count |
| is incorrectly 0 in the file to test that such data corruption is proprly |
| handled. |
| |
| zero_rows_zero_row_groups.parquet: |
| Generated by hacking Impala's Parquet writer. |
| The file metadata indicates zero rows and no row groups. |
| |
| zero_rows_one_row_group.parquet: |
| Generated by hacking Impala's Parquet writer. |
| The file metadata indicates zero rows but one row group. |
| |
| huge_num_rows.parquet: |
| Generated by hacking Impala's Parquet writer. |
| The file metadata indicates 2 * MAX_INT32 rows. |
| The single row group also has the same number of rows in the metadata. |
| |
| repeated_values.parquet: |
| Generated with parquet-mr 1.2.5 |
| Contains 3 single-column rows: |
| "parquet" |
| "parquet" |
| "parquet" |
| |
| multiple_rowgroups.parquet: |
| Generated with parquet-mr 1.2.5 |
| Populated with: |
| hive> set parquet.block.size=500; |
| hive> INSERT INTO TABLE tbl |
| SELECT l_comment FROM tpch.lineitem LIMIT 1000; |
| |
| alltypesagg_hive_13_1.parquet: |
| Generated with parquet-mr version 1.5.0-cdh5.4.0-SNAPSHOT |
| hive> create table alltypesagg_hive_13_1 stored as parquet as select * from alltypesagg; |
| |
| bad_column_metadata.parquet: |
| Generated with hacked version of parquet-mr 1.8.2-SNAPSHOT |
| Schema: |
| {"type": "record", |
| "namespace": "org.apache.impala", |
| "name": "bad_column_metadata", |
| "fields": [ |
| {"name": "id", "type": ["null", "long"]}, |
| {"name": "int_array", "type": ["null", {"type": "array", "items": ["null", "int"]}]} |
| ] |
| } |
| Contains 3 row groups, each with ten rows and each array containing ten elements. The |
| first rowgroup column metadata for 'int_array' incorrectly states there are 50 values |
| (instead of 100), and the second rowgroup column metadata for 'id' incorrectly states |
| there are 11 values (instead of 10). The third rowgroup has the correct metadata. |
| |
| data-bzip2.bz2: |
| Generated with bzip2, contains single bzip2 stream |
| Contains 1 column, uncompressed data size < 8M |
| |
| large_bzip2.bz2: |
| Generated with bzip2, contains single bzip2 stream |
| Contains 1 column, uncompressed data size > 8M |
| |
| data-pbzip2.bz2: |
| Generated with pbzip2, contains multiple bzip2 streams |
| Contains 1 column, uncompressed data size < 8M |
| |
| large_pbzip2.bz2: |
| Generated with pbzip2, contains multiple bzip2 stream |
| Contains 1 column, uncompressed data size > 8M |
| |
| out_of_range_timestamp.parquet: |
| Generated with a hacked version of Impala parquet writer. |
| Contains a single timestamp column with 4 values, 2 of which are out of range |
| and should be read as NULL by Impala: |
| 1399-12-31 00:00:00 (invalid - date too small) |
| 1400-01-01 00:00:00 |
| 9999-12-31 00:00:00 |
| 10000-01-01 00:00:00 (invalid - date too large) |
| |
| table_with_header.csv: |
| Created with a text editor, contains a header line before the data rows. |
| |
| table_with_header_2.csv: |
| Created with a text editor, contains two header lines before the data rows. |
| |
| table_with_header.gz, table_with_header_2.gz: |
| Generated by gzip'ing table_with_header.csv and table_with_header_2.csv. |
| |
| deprecated_statistics.parquet: |
| Generated with with hive shell, which uses parquet-mr version 1.5.0-cdh5.12.0-SNAPSHOT |
| Contains a copy of the data in functional.alltypessmall with statistics that use the old |
| 'min'/'max' fields. |
| |
| repeated_root_schema.parquet: |
| Generated by hacking Impala's Parquet writer. |
| Created to reproduce IMPALA-4826. Contains a table of 300 rows where the |
| repetition level of the root schema is set to REPEATED. |
| Reproduction steps: |
| 1: Extend HdfsParquetTableWriter::CreateSchema with the following line: |
| file_metadata_.schema[0].__set_repetition_type(FieldRepetitionType::REQUIRED); |
| 2: Run test_compute_stats and grab the created Parquet file for |
| alltypes_parquet table. |
| |
| binary_decimal_dictionary.parquet, |
| binary_decimal_no_dictionary.parquet: |
| Generated using parquet-mr and contents verified using parquet-tools-1.9.1. |
| Contains decimals stored as variable sized BYTE_ARRAY with both dictionary |
| and non-dictionary encoding respectively. |
| |
| alltypes_agg_bitpacked_def_levels.parquet: |
| Generated by hacking Impala's Parquet writer to write out bitpacked def levels instead |
| of the standard RLE-encoded levels. See |
| https://github.com/timarmstrong/incubator-impala/tree/hack-bit-packed-levels. This |
| is a single file containing all of the alltypesagg data, which includes a mix of |
| null and non-null values. This is not actually a valid Parquet file because the |
| bit-packed levels are written in the reverse order specified in the Parquet spec |
| for BIT_PACKED. However, this is the order that Impala attempts to read the levels |
| in - see IMPALA-3006. |
| |
| signed_integer_logical_types.parquet: |
| Generated using a utility that uses the java Parquet API. |
| The file has the following schema: |
| schema { |
| optional int32 id; |
| optional int32 tinyint_col (INT_8); |
| optional int32 smallint_col (INT_16); |
| optional int32 int_col; |
| optional int64 bigint_col; |
| } |
| |
| min_max_is_nan.parquet: |
| Generated by Impala's Parquet writer before the fix for IMPALA-6527. Git hash: 3a049a53 |
| Created to test the read path for a Parquet file with invalid metadata, namely when |
| 'max_value' and 'min_value' are both NaN. Contains 2 single-column rows: |
| NaN |
| 42 |
| |
| bad_codec.parquet: |
| Generated by Impala's Parquet writer, hacked to use the invalid enum value 5000 for the |
| compression codec. The data in the file is the whole of the "alltypestiny" data set, with |
| the same columns: id int, bool_col boolean, tinyint_col tinyint, smallint_col smallint, |
| int_col int, bigint_col bigint, float_col float, double_col double, |
| date_string_col string, string_col string, timestamp_col timestamp, year int, month int |
| |
| num_values_def_levels_mismatch.parquet: |
| A file with a single boolean column with page metadata reporting 2 values but only def |
| levels for a single literal value. Generated by hacking Impala's parquet writer to |
| increment page.header.data_page_header.num_values. This caused Impala to hit a DCHECK |
| (IMPALA-6589). |
| |
| rle_encoded_bool.parquet: |
| Parquet v1 file with RLE encoded boolean column "b" and int column "i". |
| Created for IMPALA-6324, generated with modified parquet-mr. Contains 279 rows, |
| 139 with value false, and 140 with value true. "i" is always 1 if "b" is True |
| and always 0 if "b" is false. |
| |
| dict_encoding_with_large_bit_width.parquet: |
| Parquet file with a single TINYINT column "i" with 33 rows. Created by a modified |
| Impala to use 9 bit dictionary indices for encoding. Reading this file used to lead |
| to DCHECK errors (IMPALA-7147). |
| |
| decimal_stored_as_int32.parquet: |
| Parquet file generated by Spark 2.3.1 that contains decimals stored as int32. |
| Impala needs to be able to read such values (IMPALA-5542) |
| |
| decimal_stored_as_int64.parquet: |
| Parquet file generated by Spark 2.3.1 that contains decimals stored as int64. |
| Impala needs to be able to read such values (IMPALA-5542) |
| |
| decimal_padded_fixed_len_byte_array.parquet: |
| Parquet file generated by a hacked Impala where decimals are encoded as |
| FIXED_LEN_BYTE_ARRAY with an extra byte of padding (IMPALA-2515). The |
| data is the same as functional.decimal_tbl. |
| |
| decimal_padded_fixed_len_byte_array2.parquet: |
| Parquet file generated by a hacked Impala where decimals are encoded as |
| FIXED_LEN_BYTE_ARRAY with an extra byte of padding (IMPALA-2515). |
| The Impala was hacked to limit dictionaries to 2000 entries, and |
| PARQUET_PAGE_ROW_COUNT_LIMIT was set to 1234. The resulted in |
| a file with two dictionary encoded pages and multiple plain |
| encoded pages. The values are distributed across the full |
| range of DECIMAL(10, 0). The file was created as follows: |
| |
| create table d(d decimal(10, 0)) stored as parquet; |
| set num_nodes=1; |
| set PARQUET_PAGE_ROW_COUNT_LIMIT=1234; |
| insert into d |
| select distinct cast((1000000000 - (o_orderkey * 110503)) % 1000000000 as decimal(9, 0)) |
| from tpch_parquet.orders where o_orderkey < 50000; |
| |
| |
| primitive_type_widening.parquet: |
| Parquet file that contains two rows with the following schema: |
| - int32 tinyint_col1 |
| - int32 tinyint_col2 |
| - int32 tinyint_col3 |
| - int32 tinyint_col4 |
| - int32 smallint_col1 |
| - int32 smallint_col2 |
| - int32 smallint_col3 |
| - int32 int_col1 |
| - int32 int_col2 |
| - float float_col |
| It is used to test primitive type widening (IMPALA-6373). |
| |
| corrupt_footer_len_decr.parquet: |
| Parquet file that contains one row of the following schema: |
| - bigint c |
| The footer size is manually modified (using hexedit) to be the original file size minus |
| 1, to cause metadata deserialization in footer parsing to fail, thus trigger the printing |
| of an error message with incorrect file offset, to verify that it's fixed by IMPALA-6442. |
| |
| corrupt_footer_len_incr.parquet: |
| Parquet file that contains one row of the following schema: |
| - bigint c |
| The footer size is manually modified (using hexedit) to be larger than the original file |
| size and cause footer parsing to fail. It's used to test an error message related to |
| IMPALA-6442. |
| |
| hive_single_value_timestamp.parq: |
| Parquet file written by Hive with the followin schema: |
| i int, timestamp d |
| Contains a single row. It is used to test IMPALA-7559 which only occurs when all values |
| in a column chunk are the same timestamp and the file is written with parquet-mr (which |
| is used by Hive). |
| |
| out_of_range_time_of_day.parquet: |
| IMPALA-7595: Parquet file that contains timestamps where the time part is out of the |
| valid range [0..24H). Before the fix, select * returned these values: |
| 1970-01-01 -00:00:00.000000001 (invalid - negative time of day) |
| 1970-01-01 00:00:00 |
| 1970-01-01 23:59:59.999999999 |
| 1970-01-01 24:00:00 (invalid - time of day should be less than a whole day) |
| |
| strings_with_quotes.csv: |
| Various strings with quotes in them to reproduce bugs like IMPALA-7586. |
| |
| int64_timestamps_plain.parq: |
| Parquet file generated with Parquet-mr that contains plain encoded int64 columns with |
| Timestamp logical types. Has the following columns: |
| new_logical_milli_utc, new_logical_milli_local, |
| new_logical_micro_utc, new_logical_micro_local |
| |
| int64_timestamps_dict.parq: |
| Parquet file generated with Parquet-mr that contains dictionary encoded int64 columns |
| with Timestamp logical types. Has the following columns: |
| id, |
| new_logical_milli_utc, new_logical_milli_local, |
| new_logical_micro_utc, new_logical_micro_local |
| |
| int64_timestamps_at_dst_changes.parquet: |
| Parquet file generated with Parquet-mr that contains plain encoded int64 columns with |
| Timestamp logical types. The file contains 3 row groups, and all row groups contain |
| 3 distinct values, so there is a "min", a "max", and a "middle" value. The values were |
| selected in such a way that the UTC->CET conversion changes the order of the values (this |
| is possible during Summer->Winter DST change) and "middle" falls outside the "min".."max" |
| range after conversion. This means that a naive stat filtering implementation could drop |
| "middle" incorrectly. |
| Example (all dates are 2017-10-29): |
| UTC: 00:45:00, 01:00:00, 01:10:00 => |
| CET: 02:45:00, 02:00:00, 02:10:00 |
| Columns: rawvalue bigint, rowgroup int, millisutc timsestamp, microsutc timestamp |
| |
| int64_timestamps_nano.parquet: |
| Parquet file generated with Parquet-mr that contains int64 columns with nanosecond |
| precision. Tested separately from the micro/millisecond columns because of the different |
| valid range. |
| Columns: rawvalue bigint, nanoutc timestamp, nanononutc timestamp |
| |
| out_of_range_timestamp_hive_211.parquet: |
| Hive-generated file with an out-of-range timestamp. Generated with Hive 2.1.1 using |
| the following query: |
| create table alltypes_hive stored as parquet as |
| select * from functional.alltypes |
| union all |
| select -1, false, 0, 0, 0, 0, 0, 0, '', '', cast('1399-01-01 00:00:00' as timestamp), 0, 0 |
| |
| out_of_range_timestamp2_hive_211.parquet: |
| Hive-generated file with out-of-range timestamps every second value, to exercise code |
| paths in Parquet scanner for non-repeated runs. Generated with Hive 2.1.1 using |
| the following query: |
| create table hive_invalid_timestamps stored as parquet as |
| select id, |
| case id % 3 |
| when 0 then timestamp_col |
| when 1 then NULL |
| when 2 then cast('1300-01-01 9:9:9' as timestamp) |
| end timestamp_col |
| from functional.alltypes |
| sort by id |
| |
| decimal_rtf_tbl.txt: |
| This was generated using formulas in Google Sheets. The goal was to create various |
| decimal values that covers the 3 storage formats with various precision and scale. |
| This is a reasonably large table that is used for testing min-max filters |
| with decimal types on Kudu. |
| |
| decimal_rtf_tiny_tbl.txt: |
| Small table with specific decimal values picked from decimal_rtf_tbl.txt so that |
| min-max filter based pruning can be tested with decimal types on Kudu. |
| |
| date_tbl.orc |
| Small orc table with one DATE column, created by Hive. |
| |
| date_tbl.avro |
| Small avro table with one DATE column, created by Hive. |
| |
| date_tbl.parquet |
| Small parquet table with one DATE column, created by Parquet MR. |
| |
| out_of_range_date.parquet: |
| Generated with a hacked version of Impala parquet writer. |
| Contains a single DATE column with 9 values, 4 of which are out of range |
| and should be read as NULL by Impala: |
| -0001-12-31 (invalid - date too small) |
| 0000-01-01 (invalid - date too small) |
| 0000-01-02 (invalid - date too small) |
| 1969-12-31 |
| 1970-01-01 |
| 1970-01-02 |
| 9999-12-30 |
| 9999-12-31 |
| 10000-01-01 (invalid - date too large) |
| |
| out_of_range_date.orc: |
| Created using a pre-3.1. Hive version (2.1.1.) to contain an out-of-range date value. |
| Took advantage of the incompatibility between Hive and Impala when Hive (before 3.1) |
| writes a date before 1582-10-15 is interpreted incorrectly by Impala. The values I wrote |
| with Hive: |
| 2019-10-04, 1582-10-15, 0001-01-01, 9999-12-31 |
| This is interpreted by Impala as: |
| 2019-10-04, 1582-10-15, 0000-12-30 (invalid - date too small), 9999-12-31 |
| |
| |
| |
| hive2_pre_gregorian.parquet: |
| Small parquet table with one DATE column, created by Hive 2.1.1. |
| Used to demonstrate parquet interoperability issues between Hive and Impala for dates |
| before the introduction of Gregorian calendar in 1582-10-15. |
| |
| hive2_pre_gregorian.orc: |
| Same as the above but in ORC format instead of Parquet. |
| |
| decimals_1_10.parquet: |
| Contains two decimal columns, one with precision 1, the other with precision 10. |
| I used Hive 2.1.1 with a modified version of Parquet-MR (6901a20) to create tiny, |
| misaligned pages in order to test the value-skipping logic in the Parquet column readers. |
| The modification in Parquet-MR was to set MIN_SLAB_SIZE to 1. You can find the change |
| here: https://github.com/boroknagyz/parquet-mr/tree/tiny_pages |
| hive --hiveconf parquet.page.row.count.limit=5 --hiveconf parquet.page.size=5 |
| --hiveconf parquet.enable.dictionary=false --hiveconf parquet.page.size.row.check.min=1 |
| create table decimals_1_10 (d_1 DECIMAL(1, 0), d_10 DECIMAL(10, 0)) stored as PARQUET |
| insert into decimals_1_10 values (1, 1), (2, 2), (3, 3), (4, 4), (5, 5), |
| (NULL, 1), (2, 2), (3, 3), (4, 4), (5, 5), |
| (1, 1), (NULL, 2), (3, 3), (4, 4), (5, 5), |
| (1, 1), (2, 2), (NULL, 3), (4, 4), (5, 5), |
| (1, 1), (2, 2), (3, 3), (NULL, 4), (5, 5), |
| (1, 1), (2, 2), (3, 3), (4, 4), (NULL, 5), |
| (NULL, 1), (NULL, 2), (3, 3), (4, 4), (5, 5), |
| (1, 1), (NULL, 2), (3, 3), (NULL, 4), (5, 5), |
| (1, 1), (2, 2), (3, 3), (NULL, 4), (NULL, 5), |
| (NULL, 1), (2, 2), (NULL, 3), (NULL, 4), (5, 5), |
| (1, 1), (2, 2), (3, 3), (4, 4), (5, NULL); |
| |
| nested_decimals.parquet: |
| Contains two columns, one is a decimal column, the other is an array of decimals. |
| I used Hive 2.1.1 with a modified Parquet-MR, see description at decimals_1_10.parquet. |
| hive --hiveconf parquet.page.row.count.limit=5 --hiveconf parquet.page.size=16 |
| --hiveconf parquet.enable.dictionary=false --hiveconf parquet.page.size.row.check.min=1 |
| create table nested_decimals (d_38 Decimal(38, 0), arr array<Decimal(1, 0)>) stored as parquet; |
| insert into nested_decimals select 1, array(cast (1 as decimal(1,0)), cast (1 as decimal(1,0)), cast (1 as decimal(1,0)) ) union all |
| select 2, array(cast (2 as decimal(1,0)), cast (2 as decimal(1,0)), cast (2 as decimal(1,0)) ) union all |
| select 3, array(cast (3 as decimal(1,0)), cast (3 as decimal(1,0)), cast (3 as decimal(1,0)) ) union all |
| select 4, array(cast (4 as decimal(1,0)), cast (4 as decimal(1,0)), cast (4 as decimal(1,0)) ) union all |
| select 5, array(cast (5 as decimal(1,0)), cast (5 as decimal(1,0)), cast (5 as decimal(1,0)) ) union all |
| |
| select 1, array(cast (1 as decimal(1,0)) ) union all |
| select 2, array(cast (2 as decimal(1,0)), cast (2 as decimal(1,0)) ) union all |
| select 3, array(cast (3 as decimal(1,0)), cast (3 as decimal(1,0)), cast (3 as decimal(1,0)) ) union all |
| select 4, array(cast (4 as decimal(1,0)), cast (4 as decimal(1,0)), cast (4 as decimal(1,0)), cast (4 as decimal(1,0)) ) union all |
| select 5, array(cast (5 as decimal(1,0)), cast (5 as decimal(1,0)), cast (5 as decimal(1,0)), cast (5 as decimal(1,0)), cast (5 as decimal(1,0)) ) union all |
| |
| select 1, array(cast (NULL as decimal(1, 0)), NULL, NULL) union all |
| select 2, array(cast (2 as decimal(1,0)), NULL, NULL) union all |
| select 3, array(cast (3 as decimal(1,0)), NULL, cast (3 as decimal(1,0))) union all |
| select 4, array(NULL, cast (4 as decimal(1,0)), cast (4 as decimal(1,0)), NULL) union all |
| select 5, array(NULL, cast (5 as decimal(1,0)), NULL, NULL, cast (5 as decimal(1,0)) ) union all |
| |
| select 6, array(cast (6 as decimal(1,0)), NULL, cast (6 as decimal(1,0)) ) union all |
| select 7, array(cast (7 as decimal(1,0)), cast (7 as decimal(1,0)), cast (7 as decimal(1,0)), NULL ) union all |
| select 8, array(NULL, NULL, cast (8 as decimal(1,0)) ) union all |
| select 7, array(cast (7 as decimal(1,0)), cast (7 as decimal(1,0)), cast (7 as decimal(1,0)) ) union all |
| select 6, array(NULL, NULL, NULL, cast (6 as decimal(1,0)) ); |
| |
| double_nested_decimals.parquet: |
| Contains two columns, one is a decimal column, the other is an array of arrays of |
| decimals. I used Hive 2.1.1 with a modified Parquet-MR, see description |
| at decimals_1_10.parquet. |
| hive --hiveconf parquet.page.row.count.limit=5 --hiveconf parquet.page.size=16 |
| --hiveconf parquet.enable.dictionary=false --hiveconf parquet.page.size.row.check.min=1 |
| create table double_nested_decimals (d_38 Decimal(38, 0), arr array<array<Decimal(1, 0)>>) stored as parquet; |
| insert into double_nested_decimals select 1, array(array(cast (1 as decimal(1,0)), cast (1 as decimal(1,0)) )) union all |
| select 2, array(array(cast (2 as decimal(1,0)), cast (2 as decimal(1,0)) )) union all |
| select 3, array(array(cast (3 as decimal(1,0)), cast (3 as decimal(1,0)), cast (3 as decimal(1,0)) )) union all |
| select 4, array(array(cast (4 as decimal(1,0)), cast (4 as decimal(1,0)), cast (4 as decimal(1,0)) )) union all |
| select 5, array(array(cast (5 as decimal(1,0)), cast (5 as decimal(1,0)), cast (5 as decimal(1,0)) )) union all |
| |
| select 1, array(array(cast (1 as decimal(1,0))), array(cast (1 as decimal(1,0))), array(cast (1 as decimal(1,0))) ) union all |
| select 2, array(array(cast (2 as decimal(1,0))), array(cast (2 as decimal(1,0))) ) union all |
| select 3, array(array(cast (3 as decimal(1,0))), array(cast (3 as decimal(1,0))), array(cast (3 as decimal(1,0))) ) union all |
| select 4, array(array(cast (4 as decimal(1,0))), array(cast (4 as decimal(1,0))) ) union all |
| select 5, array(array(cast (5 as decimal(1,0))), array(cast (5 as decimal(1,0))) ) union all |
| |
| select 1, array(array(cast (1 as decimal(1,0))) ) union all |
| select 2, array(array(cast (2 as decimal(1,0))), array(cast (2 as decimal(1,0))) ) union all |
| select 3, array(array(cast (3 as decimal(1,0))), array(cast (3 as decimal(1,0))), array(cast (3 as decimal(1,0))) ) union all |
| select 4, array(array(cast (4 as decimal(1,0))), array(cast (4 as decimal(1,0))) ) union all |
| select 5, array(array(cast (5 as decimal(1,0))) ) union all |
| |
| select 1, array(array(cast (1 as decimal(1,0))), array(cast (1 as decimal(1,0))), array(cast (1 as decimal(1,0))) ) union all |
| select 2, array(array(cast (2 as decimal(1,0))), array(cast (2 as decimal(1,0))) ) union all |
| select 3, array(array(cast (3 as decimal(1,0))) ) union all |
| select 4, array(array(cast (4 as decimal(1,0))), array(cast (4 as decimal(1,0))) ) union all |
| select 5, array(array(cast (5 as decimal(1,0))), array(cast (5 as decimal(1,0))), array(cast (5 as decimal(1,0))) ) union all |
| |
| select 1, array(array(cast (1 as decimal(1,0))), array(cast (1 as decimal(1,0)), cast (1 as decimal(1,0))) ) union all |
| select 2, array(array(cast (2 as decimal(1,0))) ) union all |
| select 3, array(array(cast (3 as decimal(1,0)), cast (3 as decimal(1,0))), array(cast (3 as decimal(1,0))) ) union all |
| select 4, array(array(cast (4 as decimal(1,0))), array(cast (4 as decimal(1,0)), cast (4 as decimal(1,0))), array(cast (4 as decimal(1,0))) ) union all |
| select 5, array(array(cast (5 as decimal(1,0))), array(cast (5 as decimal(1,0))), array(cast (5 as decimal(1,0))) ) union all |
| |
| select 1, array(array(cast (NULL as decimal(1,0))), array(cast (NULL as decimal(1,0))), array(cast (1 as decimal(1,0))) ) union all |
| select 2, array(array(cast (NULL as decimal(1,0))), array(cast (NULL as decimal(1,0))), array(cast (NULL as decimal(1,0))) ) union all |
| select 3, array(array(cast (NULL as decimal(1,0))), array(cast (3 as decimal(1,0))), NULL ) union all |
| select 4, array(NULL, NULL, array(cast (NULL as decimal(1,0)), NULL, NULL, NULL, NULL) ) union all |
| select 5, array(array(NULL, cast (5 as decimal(1,0)), NULL, NULL, NULL) ) union all |
| |
| select 6, array(array(cast (6 as decimal(1,0)), NULL), array(cast (6 as decimal(1,0))) ) union all |
| select 7, array(array(cast (7 as decimal(1,0)), cast (7 as decimal(1,0))), NULL ) union all |
| select 8, array(array(NULL, NULL, cast (8 as decimal(1,0))) ) union all |
| select 7, array(array(cast (7 as decimal(1,0)), cast (NULL as decimal(1,0))), array(cast (7 as decimal(1,0))) ) union all |
| select 6, array(array(NULL, NULL, cast (6 as decimal(1,0))), array(NULL, cast (6 as decimal(1,0))) ); |
| |
| alltypes_tiny_pages.parquet: |
| Created from 'functional.alltypes' with small page sizes. |
| I used Hive 2.1.1 with a modified Parquet-MR, see description at decimals_1_10.parquet. |
| I used the following commands to create the file: |
| hive --hiveconf parquet.page.row.count.limit=90 --hiveconf parquet.page.size=90 --hiveconf parquet.page.size.row.check.min=7 |
| create table alltypes_tiny_pages stored as parquet as select * from functional_parquet.alltypes |
| |
| alltypes_tiny_pages_plain.parquet: |
| Created from 'functional.alltypes' with small page sizes without dictionary encoding. |
| I used Hive 2.1.1 with a modified Parquet-MR, see description at decimals_1_10.parquet. |
| I used the following commands to create the file: |
| hive --hiveconf parquet.page.row.count.limit=90 --hiveconf parquet.page.size=90 --hiveconf parquet.enable.dictionary=false --hiveconf parquet.page.size.row.check.min=7 |
| create table alltypes_tiny_pages_plain stored as parquet as select * from functional_parquet.alltypes |
| |
| parent_table: |
| Created manually. Contains two columns, an INT and a STRING column. Together they form primary key for the table. This table is used to test primary key and foreign key |
| relationships along with parent_table_2 and child_table. |
| |
| parent_table_2: |
| Created manually. Contains just one int column which is also the table's primary key. This table is used to test primary key and foreign key |
| relationships along with parent_table and child_table. |
| |
| child_table: |
| Created manually. Contains four columns. 'seq' column is the primary key of this table. ('id', 'year') form a foreign key referring to parent_table('id', 'year') and 'a' is a |
| foreign key referring to parent_table_2's primary column 'a'. |
| |
| out_of_range_timestamp.orc: |
| Created with Hive. ORC file with a single timestamp column 'ts'. |
| Contains one row (1300-01-01 00:00:00) which is outside Impala's valid time range. |
| |
| corrupt_schema.orc: |
| ORC file from IMPALA-9277, generated by fuzz test. The file contains malformed metadata. |
| |
| corrupt_root_type.orc: |
| ORC file for IMPALA-9249, generated by fuzz test. The root type of the schema is not |
| struct, which used to hit a DCHECK. |
| |
| hll_sketches_from_hive.parquet: |
| This file contains a table that has some string columns to store serialized Apache |
| DataSketches HLL sketches created by Hive. Each column contains a sketch for a |
| specific data type. Covers the following types: TINYINT, INT, BIGINT, FLOAT, DOUBLE, |
| STRING, CHAR and VARCHAR. Has an additional column for NULL values. |
| |
| hll_sketches_from_impala.parquet: |
| This holds the same sketches as hll_sketches_from_hive.parquet but these sketches were |
| created by Impala instead of Hive. |
| |
| cpc_sketches_from_hive.parquet: |
| This file contains a table that has some string columns to store serialized Apache |
| DataSketches CPC sketches created by Hive. Each column contains a sketch for a |
| specific data type. Covers the following types: TINYINT, INT, BIGINT, FLOAT, DOUBLE, |
| STRING, CHAR and VARCHAR. Has an additional column for NULL values. |
| |
| cpc_sketches_from_impala.parquet: |
| This holds the same sketches as cpc_sketches_from_hive.parquet but these sketches were |
| created by Impala instead of Hive. |
| |
| theta_sketches_from_hive.parquet: |
| This file contains a table that has some string columns to store serialized Apache |
| DataSketches Theta sketches created by Hive. Each column contains a sketch for a |
| specific data type. Covers the following types: TINYINT, INT, BIGINT, FLOAT, DOUBLE, |
| STRING, CHAR and VARCHAR. Has an additional column for NULL values. |
| |
| theta_sketches_from_impala.parquet: |
| This holds the same sketches as theta_sketches_from_hive.parquet but these sketches were |
| created by Impala instead of Hive. |
| |
| kll_sketches_from_hive.parquet: |
| This file contains a table that has some string columns to store serialized Apache |
| DataSketches KLL sketches created by Hive. Each column is for a different purpose: |
| - 'f': Float with distinct values. |
| - 'repetitions': Float with some repetition in the values. |
| - 'some_nulls': Float values and some NULLs. |
| - 'all_nulls': All values are NULLs. |
| - 'some_nans': Floats with some NaN values. |
| - 'all_nans': All values are NaNs. |
| |
| kll_sketches_from_impala.parquet: |
| This holds the same sketches as kll_sketches_from_hive.parquet but these sketches were |
| created by Impala instead of Hive. |
| |
| iceberg: |
| IMPALA-9741: Support querying Iceberg table by impala |
| We generated data by spark-shell, version is 2.4.x, and there are two tables' data in |
| testdata/data/iceberg_test. Iceberg table location contains |
| two directories: metadata, which contains table metadata managed |
| by iceberg; data, which contains the data files. |
| |
| iceberg_test/hadoop_catalog/ice/complextypestbl_iceberg_orc: |
| Iceberg table generated by Hive 3.1 + Iceberg 0.11. Originally it was a HiveCatalog |
| table, so I've renamed the metadata JSON files and added a version-hint.text file. |
| I've also edited the metadata JSON and AVRO files to remove 'hdfs://localhost:20500', |
| and updated the file paths. Now it can be used as a HadoopCatalog table. |
| |
| hudi_parquet: |
| IMPALA-8778: Support read Apache Hudi tables |
| Hudi parquet is a special format of parquet files managed by Apache Hudi |
| (hudi.incubator.apache.org) to provide ACID transaction. |
| In order to provide snapshot isolation between writer and queries, |
| Hudi will write a newer version of the existing parquet file |
| if there is any update comes into the file. |
| Hudi store the indexing information and version information in the file name. |
| For example: |
| `ca51fa17-681b-4497-85b7-4f68e7a63ee7-0_1-5-10_20200112194517.parquet` |
| `ca51fa17-681b-4497-85b7-4f68e7a63ee7-0` is the bloom index hash of this file |
| `20200112194517` is the timestamp of this version |
| If there is a record was updated in this file, Hudi will write a new file with |
| the same indexing hash but a newer version depends on the time of writing. |
| `ca51fa17-681b-4497-85b7-4f68e7a63ee7-0_1-38-282_20200112194529.parquet` |
| If the impala table was refreshed after this file was written, impala will |
| only query on the file with latest version. |
| |
| streaming.orc: |
| ORC file generated by Hive Streaming Ingestion. I used a slightly altered version of |
| TestStreaming.testNoBuckets() from Hive 3.1 to generate this file. It contains |
| values coming from two transactions. The file has two stripes (one per transaction). |
| |
| alltypes_non_acid.orc: |
| Non-acid ORC file generated by Hive 3.1 with the following command: |
| CREATE TABLE alltypes_clone STORED AS ORC AS SELECT * FROM functional.alltypes. |
| It's used as an original file in ACID tests. |
| |
| dateless_timestamps.parq: |
| Parquet file generated before the removal of dateless timestamp support. |
| Created as: CREATE TABLE timestamps_pq(t TIMESTAMP) STORED AS parquet; |
| It contains the folloving lines: |
| 1996-04-22 10:00:00.432100000 |
| 1996-04-22 10:00:00.432100000 |
| 1996-04-22 10:00:00 |
| 1996-04-22 10:00:00 |
| 1996-04-22 00:00:00 |
| 10:00:00.432100000 |
| 10:00:00 |
| |
| full_acid_schema_but_no_acid_version.orc |
| IMPALA-10115: Impala should check file schema as well to check full ACIDv2 files |
| Genereted by query-based compaction by Hive 3.1. The file has full ACIDv2 schema, but |
| doesn't have the file user metadata 'hive.acid.version'. |
| |
| alltypes_empty_pages.parquet |
| Parquet file that contians empty data pages. Needed to test IMPALA-9952. |
| Generated by a modified Impala (git hash e038db44 (between 3.4 and 4.0)). I modified |
| HdfsParquetTableWriter::ShouldStartNewPage() to randomly start a new page: |
| int64_t r = random(); if (r % 2 + r % 3 + r % 5 == 0) return true; |
| Also modified HdfsParquetTableWriter::NewPage() to randomly insert empty pages: |
| if (r ... ) pages_.push_back(DataPage()); |
| |
| alltypes_invalid_pages.parquet |
| Parquet file that contains invalid data pages Needed to test IMPALA-9952. |
| Generated by a modified Impala (git hash e038db44 (between 3.4 and 4.0)). I modified |
| HdfsParquetTableWriter::ShouldStartNewPage() to randomly start a new page: |
| int64_t r = random(); if (r % 2 + r % 3 + r % 5 == 0) return true; |
| Also modified HdfsParquetTableWriter::BaseColumnWriter::Flush to randomly invalidate |
| the offset index: |
| if (r ... ) location.offset = -1; |
| |
| customer_multiblock_page_index.parquet |
| Parquet file that contains multiple blocks in a single file Needed to test IMPALA-10310. |
| In order to generate this file, execute the following instruments in beeline |
| (Beeline version 3.1.3000.7.3.1.0-160 by Apache Hive): |
| 1. SET parquet.block.size=8192; // use little block size |
| 2. SET parquet.page.row.count.limit=10; // little page row count generate multi pages |
| 3. CREATE TABLE customer_multiblock_page_index_6 |
| STORED AS PARQUET |
| TBLPROPERTIES('parquet.compression'='SNAPPY') |
| AS SELECT * FROM tpcds.customer |
| WHERE c_current_cdemo_sk IS NOT NULL |
| ORDER BY c_current_cdemo_sk, c_customer_sk |
| LIMIT 2000; |
| generated file will contains multi blocks, multi pages per block. |
| |
| customer_nested_multiblock_multipage.parquet |
| Parquet file that contains multiple row groups multiple pages and store nested |
| data. |
| Used Hive (version 3.1.3000.7.2.16.0-233) to generate Parquet file: |
| 1. SET parquet.block.size=8192; |
| 2. SET parquet.page.row.count.limit=20; |
| 3. CREATE TABLE customer_nested_multiblock_multipage |
| LIKE tpch_nested_parquet.customer STORED AS PARQUET; |
| 4. INSERT INTO customer_nested_multiblock_multipage |
| SELECT * FROM tpch_nested_parquet.customer ORDER BY c_custkey LIMIT 300; |
| |
| IMPALA-10361: Use field id to resolve columns for Iceberg tables |
| We generated data by spark-shell, version is 2.4.x, and table data is in |
| testdata/data/iceberg_test/hadoop_catalog/iceberg_resolution_test, this table |
| are generated with HadoopCatalog and Parquet fileformat. We use this table to |
| test complex types for field id resolving. |
| |
| alltypes_tiny_rle_dictionary.parquet: |
| Tiny file using the RLE_DICTIONARY encoding. |
| Started impala with -write_new_parquet_dictionary_encodings=true |
| set max_fs_writers=1; |
| create table att stored as parquet as |
| select * from functional_parquet.alltypestiny; |
| |
| timestamp_with_local_timezone.orc: |
| ORC file that contains column with type 'timestamp with local timezone'. |
| Generated by Spark/Iceberg. |
| |
| parquet-bloom-filtering.parquet: |
| Generated by hacking |
| https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestBloomFiltering.java |
| (65b95fb72be8f5a8a193a6f7bc4560fdcd742fc7). |
| The schema was completely changed to allow us to test types supported in Parquet Bloom |
| filters. |
| |
| ComplexTypesTbl/arrays.orc and arrays.parq |
| These tables hold 3 columns, an int ID and two arrays, one with int and the second with |
| string. The purpose of introducing these tables is to give more test coverage for zipping |
| unnests. There are rows where the 2 arrays are of the same lenght, or one of them is |
| longer than the other plus there are NULL and empty arrays as well. |
| |
| binary_decimal_precision_and_scale_widening.parquet |
| Parquet file written with schema (decimal(9,2), decimal(18,2), decimal(38,2)). The rows |
| inside the file are carefully chosen so that they don't cause an overflow when being read |
| by an Impala table with a higher precision/scale. |
| |
| iceberg_test/hadoop_catalog/ice/airports_parquet: |
| Regular Parquet table converted to Iceberg, which means that the data file doesn't contain |
| field ids. |
| |
| iceberg_test/hadoop_catalog/ice/airports_orc: |
| Regular ORC table converted to Iceberg, which means that the data file doesn't contain |
| field ids. |
| |
| too_many_def_levels.parquet: |
| Written by Impala 2.5. The Parquet pages have more encoded def levels than num_values. |
| |
| partition_col_in_parquet.parquet: |
| Written by Impala 4.0. Parquet file with INT and DATE column. Values in the DATE columns |
| are identical. There's only a single value per page in the Parquet file (written by |
| setting query option 'parquet_page_row_count_limit' to 1). |
| |
| no_scale.parquet |
| Generated by code injection, removed scale from written parquet files: |
| Status HdfsParquetTableWriter::WriteFileFooter() { |
| + file_metadata_.schema[1].__set_scale(1); |
| + file_metadata_.schema[1].__isset.scale = false; |
| + file_metadata_.schema[1].logicalType.DECIMAL.scale = 1; |
| + file_metadata_.schema[1].logicalType.__isset.DECIMAL = false; |
| + file_metadata_.schema[1].__isset.logicalType = false; |
| create table my_decimal_tbl (d1 decimal(4,2)) stored as parquet; |
| insert into my_decimal_tbl values (cast(0 as decimal(4,2))); |
| |
| iceberg_test/hadoop_catalog/ice/alltypes_part: |
| iceberg_test/hadoop_catalog/ice/alltypes_part_orc: |
| Generated by Hive 3.1 + Iceberg 0.11. Then the JSON and AVRO files were manually edited |
| to make these tables correspond to an Iceberg table in a HadoopCatalog instead of |
| HiveCatalog. |
| alltypes_part has PARQUET data files, alltypes_part_orc has ORC data files. They have |
| identity partitions with all the supported data types. |
| |
| iceberg_test/hadoop_catalog/ice/iceberg_legacy_partition_schema_evolution: |
| iceberg_test/hadoop_catalog/ice/iceberg_legacy_partition_schema_evolution_orc: |
| Generated by Hive 3.1 + Iceberg 0.11. Then the JSON and AVRO files were manually edited |
| to make these tables correspond to an Iceberg table in a HadoopCatalog instead of |
| HiveCatalog. |
| iceberg_legacy_partition_schema_evolution has PARQUET data files, |
| iceberg_legacy_partition_schema_evolution_orc has ORC data files. |
| The tables that have the following schema changes since table migration: |
| * Partition INT column to BIGINT |
| * Partition FLOAT column to DOUBLE |
| * Partition DECIMAL(5,3) column to DECIMAL(8,3) |
| * Non-partition column has been moved to end of the schema |
| |
| iceberg_test/hadoop_catalog/ice/iceberg_timestamp_part: |
| Written by Hive, contains Iceberg TIMESTAMP type and the table is partitioned by HOUR(ts). |
| create table iceberg_timestamp_part (i int, ts timestamp) partitioned by spec (hour(ts)) stored by iceberg; |
| insert into iceberg_timestamp_part values (-2, '1969-01-01 01:00:00'), (-1, '1969-01-01 01:15:00'), (1, '2021-10-31 02:15:00'), (2, '2021-01-10 12:00:00'), (3, '2022-04-11 00:04:00'), (4, '2022-04-11 12:04:55'); |
| |
| iceberg_test/hadoop_catalog/ice/iceberg_timestamptz_part: |
| Written by Hive, contains Iceberg TIMESTAMPTZ type and the table is partitioned by |
| HOUR(ts). The local timezone was 'Europe/Budapest'; |
| create table iceberg_timestamptz_part (i int, ts timestamp with local time zone) partitioned by spec (hour(ts)) stored by iceberg; |
| insert into iceberg_timestamptz_part values (-2, '1969-01-01 01:00:00'), (-1, '1969-01-01 01:15:00'), (0, '2021-10-31 00:15:00 UTC'), (1, '2021-10-31 01:15:00 UTC'), (2, '2021-01-10 12:00:00'), (3, '2022-04-11 00:04:00'), (4, '2022-04-11 12:04:55'); |
| |
| iceberg_test/hadoop_catalog/ice/iceberg_uppercase_col: |
| Generated by Impala, then modified the metadata.json file to contain uppercase characters. |
| |
| iceberg_test/hadoop_catalog/ice/iceberg_v2_delete_positional: |
| Generated by Spark 3.2 + Iceberg 0.13. Then the JSON and AVRO files were manually edited |
| to make these tables correspond to an Iceberg table in a HadoopCatalog instead of |
| HiveCatalog. |
| The table has a positional delete file. |
| |
| iceberg_test/hadoop_catalog/ice/iceberg_v2_delete_equality: |
| Since Hive/Spark is unable to write equality delete files Flink has been used to create |
| an equality delete file by overwriting an existing unique row with the following |
| statements: |
| CREATE TABLE `ssb`.`ssb_default`.`iceberg_v2_delete_equality` ( |
| `id` BIGINT UNIQUE COMMENT 'unique id', |
| `data` STRING NOT NULL, |
| PRIMARY KEY(`id`) NOT ENFORCED |
| ) with ('format-version'='2', |
| 'write.upsert.enabled'='true', |
| 'connector' = 'iceberg', |
| 'catalog-database' = 'test_db', |
| 'catalog-type' = 'hive', |
| 'catalog-name' = 'iceberg_hive_catalog', |
| 'hive-conf-dir' = '/etc/hive/conf', |
| 'engine.hive.enabled' = 'true' |
| ); |
| insert into iceberg_v2_delete_equality values (1, 'test_1_base'); |
| insert into iceberg_v2_delete_equality values (2, 'test_2_base'); |
| insert into iceberg_v2_delete_equality values (2, 'test_2_updated'); |
| This table was created with HDFS absolute paths, which were replaced with the script |
| specified in `iceberg_test/hadoop_catalog/ice/iceberg_v2_no_deletes`. |
| |
| iceberg_test/hadoop_catalog/ice/iceberg_v2_delete_equality_nulls: |
| This table has an equality delete file that contains NULL value. Created by a hacked |
| Impala where IcebergCatalogOpExecutor is changed to write equality delete metadata |
| instead of position delete metadata when running a DELETE FROM statement. In a second |
| step the underlying delete file was replaced by another parquet file with the desired |
| content. |
| The content: |
| 1: insert into functional_parquet.iceberg_v2_delete_equality_nulls values (1, "str1"), (null, "str2"), (3, "str3"); |
| 2: EQ-delete file for the first column with values: (null), (3) |
| 3: insert into functional_parquet.iceberg_v2_delete_equality_nulls values (4, "str4"), (null, "str5"); |
| As a result 2 values (including the row with null) will be dropped from the first data |
| file, while there is going to be another data file containing a null value that has |
| greater data sequence number than the delete file. |
| |
| iceberg_test/hadoop_catalog/ice/iceberg_v2_delete_both_eq_and_pos: |
| This table is created by Flink with 2 columns as primary key. Some data and equality |
| delete files were added by Flink, and then Impala was used for dropping a row by writing |
| a positional delete file. |
| Steps: |
| 1-Flink: |
| create table ice.iceberg_v2_delete_both_eq_and_pos |
| (i int, s string, d date, primary key (i, d) not enforced) |
| with ('format-version'='2', 'write.upsert.enabled'='true'); |
| 2: Flink: |
| insert into ice.iceberg_v2_delete_both_eq_and_pos values |
| (1, 'str1', to_date('2023-12-13')), |
| (2, 'str2', to_date('2023-12-13')); |
| 3-Flink: |
| insert into ice.iceberg_v2_delete_both_eq_and_pos values |
| (3, 'str3', to_date('2023-12-23')), |
| (2, 'str2_updated', to_date('2023-12-13')); |
| 4-Impala: delete from functional_parquet.iceberg_v2_delete_both_eq_and_pos where i = 1; |
| |
| iceberg_test/hadoop_catalog/ice/iceberg_v2_delete_equality_partitioned: |
| Flink is used for creating this test table. The statements executed are the follow: |
| 1: Create the table with the partition column part of the primary key. This is enforced |
| by Flink: |
| create table ice.iceberg_v2_delete_equality_partitioned |
| (i int, s string, d date, primary key (d, s) not enforced) |
| partitioned by (d) |
| with ('format-version'='2', 'write.upsert.enabled'='true'); |
| 2: Populate one partition. |
| insert into ice.iceberg_v2_delete_equality_partitioned partition (d='2023-12-24') values |
| (1, 'str1'), (2, 'str2'), (3, 'str3'); |
| 3: Populate another partition |
| insert into ice.iceberg_v2_delete_equality_partitioned partition (d='2023-12-25') values |
| (1, 'str1'), (2, 'str2'); |
| 4: Update one row in the first partiton and add a new row. |
| insert into ice.iceberg_v2_delete_equality_partitioned partition (d='2023-12-24') values |
| (333333, 'str3'), (4, 'str4'); |
| 5: Update one row in the second partition. |
| insert into ice.iceberg_v2_delete_equality_partitioned partition (d='2023-12-25') values |
| (222, 'str2'); |
| |
| iceberg_test/hadoop_catalog/ice/iceberg_v2_delete_equality_partition_evolution: |
| Flink is used to create and populate this simple table to have an equlity delete file. |
| Impala is used for doing some partition evolution on this table. |
| 1-Flink: |
| create table ice.iceberg_v2_delete_equality_partition_evolution |
| (i int, s string, d date, primary key (d, s) not enforced) |
| partitioned by (d) |
| with ('format-version'='2', 'write.upsert.enabled'='true'); |
| 2-Flink: |
| insert into ice.iceberg_v2_delete_equality_partition_evolution |
| partition (d='2023-12-24') values (1, 'str1'), (2, 'str2'); |
| 3-Flink: |
| insert into ice.iceberg_v2_delete_equality_partition_evolution |
| partition (d='2023-12-24') values (111, 'str1'); |
| 4-Impala: |
| alter table functional_parquet.iceberg_v2_delete_equality_partition_evolution |
| set partition spec (d, i); |
| |
| iceberg_test/hadoop_catalog/ice/iceberg_v2_delete_equality_multi_eq_ids: |
| Used Impala and Nifi to create a table that has equality delete files with different |
| equality field ID lists. Steps: |
| 1-Impala: |
| create table functional_parquet.iceberg_v2_delete_equality_multi_eq_ids |
| (i int not null, s string not null, primary key(i)) |
| STORED AS ICEBERG |
| TBLPROPERTIES('iceberg.catalog'='hadoop.catalog', |
| 'iceberg.catalog_location'='/test-warehouse/iceberg_test/hadoop_catalog', |
| 'iceberg.table_identifier'='ice.iceberg_v2_delete_equality_multi_eq_ids', |
| 'format-version'='2'); |
| 2-Nifi: Insert 3 rows in one data file: |
| (1, "str1"), (2, "str2"), (3, "str3") |
| 3-Nifi: Update a row using column 'i' as PK: |
| (3, "str3_updated") |
| 4: Manually edited 'identifier-field-ids' from [1] to [2] |
| 5-Nifi: In One step insert new rows and update existing ones using column 's' as PK: |
| Insert (4, "str4"), (5, "str5") |
| Update (2222, "str2"), (3333, "str3_updated") |
| 6: Manually edited 'identifier-field-ids' from [2] to [1,2] |
| 7: Update rows using columns [i,s] as PK: |
| (4, "str4") -> (4, "str4_updated"), (3333, "str3_updated") -> (33, "str3_updated_twice") |
| |
| iceberg_test/hadoop_catalog/ice/iceberg_v2_delete_pos_and_multi_eq_ids: |
| Used Flink and Impala to create a table that has both positional and equality delete |
| files where some of the equality deletes have different equality field IDs. |
| 1-Flink: |
| create table hadoop_catalog.ice.iceberg_v2_delete_pos_and_multi_eq_ids |
| (i int not null, s string not null, d date not null, primary key(i,s) not enforced) |
| with ('format-version'='2', 'write.upsert.enabled'='true'); |
| 2-Flink: |
| insert into hadoop_catalog.ice.iceberg_v2_delete_pos_and_multi_eq_ids values |
| (1, 'str1', to_date('2024-01-23')), |
| (2, 'str2', to_date('2024-01-24')), |
| (3, 'str3', to_date('2024-01-25')); |
| 3-Flink: |
| insert into hadoop_catalog.ice.iceberg_v2_delete_pos_and_multi_eq_ids values |
| (1, 'str1', to_date('2020-12-01')), |
| (4, 'str4', to_date('2024-01-26')); |
| 4-Impala: |
| delete from functional_parquet.iceberg_v2_delete_pos_and_multi_eq_ids where s = 'str2'; |
| 5: Manually edited 'identifier-field-ids' from [1,2] to [3,2]. |
| 6: Restarted Flink to forget the table metadata. |
| 7-Flink: |
| insert into hadoop_catalog.ice.iceberg_v2_delete_pos_and_multi_eq_ids values |
| (333333, 'str3', to_date('2024-01-25')), |
| (5, 'str5', to_date('2024-01-27')); |
| |
| iceberg_test/iceberg_migrated_alter_test |
| Generated and migrated by Hive |
| CREATE TABLE iceberg_migrated_alter_test (int_col int, string_col string, double_col double) stored as parquet; |
| insert into table iceberg_migrated_alter_test values (0, "A", 0.5), (1, "B", 1.5), (2, "C", 2.5); |
| ALTER TABLE iceberg_migrated_alter_test SET TBLPROPERTIES ('storage_handler'='org.apache.iceberg.mr.hive.HiveIcebergStorageHandler'); |
| Then extracted from hdfs and modified to be able to load as an external hadoop table |
| |
| iceberg_test/iceberg_migrated_alter_test_orc |
| Generated and migrated by Hive |
| CREATE TABLE iceberg_migrated_alter_test_orc (int_col int, string_col string, double_col double) stored as orc; |
| insert into table iceberg_migrated_alter_test_orc values (0, "A", 0.5), (1, "B", 1.5), (2, "C", 2.5); |
| ALTER TABLE iceberg_migrated_alter_test_orc SET TBLPROPERTIES ('storage_handler'='org.apache.iceberg.mr.hive.HiveIcebergStorageHandler'); |
| Then extracted from hdfs and modified to be able to load as an external hadoop table |
| |
| iceberg_test/iceberg_migrated_complex_test |
| Generated and migrated by Hive |
| CREATE TABLE iceberg_migrated_complex_test (struct_1_col struct<int_array_col: array<int>, string_col: string, bool_int_map_col: map<boolean, int>>, int_bigint_map_col map<int, bigint>, struct_2_col struct<struct_3_col: struct<float_col: float, string_double_map_col: map<string, double>, bigint_array_col: array<bigint>>, int_int_map_col: map<int, int>>) stored as parquet; |
| insert into table iceberg_migrated_complex_test values (named_struct("int_array_col", array(0), "string_col", "A", "bool_int_map_col", map(True, 1 )), map(2,CAST(3 as bigint)), named_struct("struct_3_col", named_struct("float_col", cast(0.5 as float), "string_double_map_col", map("B", cast(1.5 as double)), "bigint_array_col", array(cast(4 as bigint))), "int_int_map_col", map(5,6))); |
| ALTER TABLE iceberg_migrated_complex_test SET TBLPROPERTIES ('storage_handler'='org.apache.iceberg.mr.hive.HiveIcebergStorageHandler'); |
| Then extracted from hdfs and modified to be able to load as an external hadoop table |
| |
| iceberg_test/iceberg_migrated_complex_test_orc |
| Generated and migrated by Hive |
| CREATE TABLE iceberg_migrated_complex_test_orc (struct_1_col struct<int_array_col: array<int>, string_col: string, bool_int_map_col: map<boolean, int>>, int_bigint_map_col map<int, bigint>, struct_2_col struct<struct_3_col: struct<float_col: float, string_double_map_col: map<string, double>, bigint_array_col: array<bigint>>, int_int_map_col: map<int, int>>) stored as orc; |
| insert into table iceberg_migrated_complex_test_orc values (named_struct("int_array_col", array(0), "string_col", "A", "bool_int_map_col", map(True, 1 )), map(2,CAST(3 as bigint)), named_struct("struct_3_col", named_struct("float_col", cast(0.5 as float), "string_double_map_col", map("B", cast(1.5 as double)), "bigint_array_col", array(cast(4 as bigint))), "int_int_map_col", map(5,6))); |
| ALTER TABLE iceberg_migrated_complex_test_orc SET TBLPROPERTIES ('storage_handler'='org.apache.iceberg.mr.hive.HiveIcebergStorageHandler'); |
| Then extracted from hdfs and modified to be able to load as an external hadoop table |
| |
| iceberg_test/hadoop_catalog/ice/iceberg_v2_no_deletes: |
| Created by Hive 3.1.3000.2022.0.10.0-49 r5cd8759d0df2ecbfb788b7f4ee0edce6022ee459 |
| create table iceberg_v2_no_deletes (i int, s string) |
| stored by iceberg |
| location '/test-warehouse/iceberg_test/hadoop_catalog/ice/iceberg_v2_no_deletes' |
| tblproperties ('format-version'='2'); |
| insert into iceberg_v2_no_deletes values (1, 'x'), (2, 'y'), (3, 'z'); |
| (setting table location is important because it makes it easier the conversion later). |
| Then saved the contents from HDFS to local ${IMPALA_HOME}/testdata/data/iceberg_test/hadoop_catalog/ice |
| And converted the HiveCatalog metadata to HadoopCatalog metadata via the following scripts: |
| convert_to_iceberg.sh: |
| #!/bin/bash |
| |
| i=0 |
| for f in *.json; do |
| i=$((i+1)) |
| sed -i 's|hdfs://localhost:20500/test-warehouse/|/test-warehouse/|' $f |
| mv $f v${i}.metadata.json |
| done |
| echo ${i} > version-hint.txt |
| |
| for f in *.avro; do |
| avro_iceberg_convert.sh $f 'hdfs://localhost:20500/test-warehouse/' '/test-warehouse/' |
| mv ${f}_mod $f |
| done |
| |
| rm *.avro_json |
| |
| avro_iceberg_convert.sh: |
| #!/bin/bash |
| |
| # Usage: avro_iceberg_convert.sh <source avro> <search string> <replace string> |
| # Example: |
| SOURCE_FILE=$1 |
| SEARCH_STR=$2 |
| REPLACE_STR=$3 |
| TMP_JSON=$1_json |
| DST_AVRO=$1_mod |
| AVRO_TOOLS="/path/to/avro-tools-1.11.0.jar" |
| |
| if [ ! -f "$AVRO_TOOLS" ]; then |
| echo "Can't find $AVRO_TOOLS." |
| exit |
| fi |
| if [ ! -f $SOURCE_FILE ]; then |
| echo "Can't find source file: $SOURCE_FILE!" |
| exit |
| fi |
| # Transform avro to json: |
| java -jar $AVRO_TOOLS tojson --pretty $SOURCE_FILE > $TMP_JSON |
| # Replace search string with replace string |
| sed --in-place "s|$SEARCH_STR|$REPLACE_STR|g" $TMP_JSON |
| # Convert the file back to avro |
| SCHEMA=`java -jar $AVRO_TOOLS getschema $SOURCE_FILE` |
| java -jar $AVRO_TOOLS fromjson $TMP_JSON --schema "$SCHEMA" > $DST_AVRO |
| |
| These updates the manifest files and their length probably change. Snapshot files store the |
| length of the manifest files, so it need to be changed as well. If it only stores one manifest: |
| |
| avro_iceberg_convert.sh FILE "\"manifest_length.*$" "\"manifest_length\" : SIZE," |
| mv FILE_mod FILE |
| |
| If a snapshot has multiple manifest files, then you need to change manually the previously generated |
| *_json files and transform it back using avro tools, the last step of avro_iceberg_convert.sh, or |
| use testdata/bin/rewrite-iceberg-metadata.py |
| |
| iceberg_test/hadoop_catalog/ice/iceberg_with_key_metadata: |
| Created by the following steps: |
| - saved the HDFS directory of 'iceberg_v2_no_deletes' to local |
| ${IMPALA_HOME}/testdata/data/iceberg_test/hadoop_catalog/ice |
| - converted the avro manifest file to json |
| - manually replaced the 'null' value for "key_metadata" with "{"bytes" : |
| "binary_key_metadata"}" |
| - converted the modified json file back to avro. |
| - adjusted the length of manifest file in the avro snapshot file |
| |
| The commands for converting the avro file to json and back are listed under |
| 'iceberg_v2_no_deletes' in the script avro_iceberg_convert.sh. Adjusting the length is |
| described after the script. |
| |
| iceberg_v2_partitioned_position_deletes: |
| iceberg_v2_partitioned_position_deletes_orc: |
| Created similarly to iceberg_v2_no_deletes. |
| Hive> create table iceberg_v2_partitioned_position_deletes ( |
| id int, `user` string, action string, event_time timestamp) |
| partitioned by spec (action) |
| STORED BY ICEBERG |
| location '/test-warehouse/iceberg_test/hadoop_catalog/ice/iceberg_v2_partitioned_position_deletes' |
| tblproperties ('format-version'='2', 'write.format.default'='parquet'); |
| Impala> insert into iceberg_v2_partitioned_position_deletes select * from functional_parquet.iceberg_partitioned; |
| Hive> delete from iceberg_v2_partitioned_position_deletes where id % 2 = 1; |
| ORC table is similarly created. |
| |
| |
| iceberg_v2_positional_delete_all_rows: |
| iceberg_v2_positional_delete_all_rows_orc: |
| Created similarly to iceberg_v2_no_deletes. |
| Hive> create table iceberg_v2_positional_delete_all_rows_orc ( |
| i int, s string) |
| stored by iceberg |
| location '/test-warehouse/iceberg_test/hadoop_catalog/ice/iceberg_v2_positional_delete_all_rows_orc' |
| tblproperties ('format-version'='2', 'write.format.default'='orc'); |
| Hive> insert into iceberg_v2_positional_delete_all_rows_orc values (1, 'x'), (2, 'y'), (3, 'z'); |
| Hive> delete from iceberg_v2_positional_delete_all_rows_orc; |
| |
| iceberg_v2_positional_not_all_data_files_have_delete_files: |
| iceberg_v2_positional_not_all_data_files_have_delete_files_orc: |
| Created similarly to iceberg_v2_no_deletes. |
| create table iceberg_v2_positional_not_all_data_files_have_delete_files_orc (i int, s string) |
| stored by iceberg |
| location '/test-warehouse/iceberg_test/hadoop_catalog/ice/iceberg_v2_positional_not_all_data_files_have_delete_files_orc' |
| tblproperties ('format-version'='2', 'write.format.default'='orc'); |
| insert into iceberg_v2_positional_not_all_data_files_have_delete_files_orc values (1,'a'), (2,'b'), (3,'c'); |
| insert into iceberg_v2_positional_not_all_data_files_have_delete_files_orc values (4,'d'), (5,'e'), (6,'f'); |
| insert into iceberg_v2_positional_not_all_data_files_have_delete_files_orc values (7,'g'), (8,'h'), (9,'i'); |
| update iceberg_v2_positional_not_all_data_files_have_delete_files_orc set s='X' where i = 5; |
| delete from iceberg_v2_positional_not_all_data_files_have_delete_files_orc where i > 6; |
| |
| iceberg_v2_positional_update_all_rows: |
| Created similarly to iceberg_v2_no_deletes |
| create table iceberg_v2_positional_update_all_rows (i int, s string) |
| stored by iceberg |
| location '/test-warehouse/iceberg_test/hadoop_catalog/ice/iceberg_v2_positional_update_all_rows' |
| tblproperties ('format-version'='2'); |
| insert into iceberg_v2_positional_update_all_rows values (1,'a'), (2,'b'), (3,'c') |
| update iceberg_v2_positional_update_all_rows set s = upper(s); |
| |
| iceberg_with_puffin_stats: |
| Created similarly to iceberg_v2_no_deletes. |
| With Impala: |
| create table iceberg_with_puffin_stats(i INT, d DECIMAL(9, 0)) stored as iceberg; |
| insert into iceberg_with_puffin_stats values (1, 1), (2, 2); |
| With Trino: |
| use iceberg.default; |
| analyze iceberg_with_puffin_stats; |
| And then converted the table with 'convert_to_iceberg.sh' and 'avro_iceberg_convert.sh' |
| described under the section of 'iceberg_v2_no_deletes.'. |
| |
| iceberg_test/hadoop_catalog/ice/iceberg_multiple_storage_locations*: |
| - 'iceberg_test/hadoop_catalog/ice/iceberg_multiple_storage_locations' |
| - 'iceberg_test/hadoop_catalog/ice/iceberg_multiple_storage_locations_data' |
| - 'iceberg_test/hadoop_catalog/ice/iceberg_multiple_storage_locations_data01' |
| - 'iceberg_test/hadoop_catalog/ice/iceberg_multiple_storage_locations_data02' |
| Generated by Iceberg Java API version 0.13.2, the address of the document is https://iceberg.apache.org/docs/latest/api/ |
| Step 1, create the Iceberg table 'iceberg_multiple_storage_locations' that location is 'iceberg_test/hadoop_catalog/ice/iceberg_multiple_storage_locations': |
| 'col_name','data_type' |
| col_int,int |
| col_bigint,bigint |
| col_float,float |
| col_double,double |
| col_string,string |
| col_timestamp,timestamp |
| col_date,date |
| 'col_name','transform_type' |
| col_int,IDENTITY |
| Step 2, set the table property 'write.data.path' to '/test-warehouse/iceberg_test/hadoop_catalog/ice/iceberg_multiple_storage_locations_data' and insert 3 records: |
| 0,12345678900,3.1400001049,2.7182,'a',1970-01-01 00:00:00,1974-02-09 |
| 0,12345678901,3.1400001049,2.71821,'b',1970-01-01 00:00:00,1974-02-09 |
| 1,12345678902,3.1400001049,2.71822,'c',1970-01-01 00:00:00,1974-02-09 |
| Step 3, update the table property 'write.data.path' to '/test-warehouse/iceberg_test/hadoop_catalog/ice/iceberg_multiple_storage_locations_data01' and insert 3 records: |
| 1,12345678900,3.1400001049,2.7182,'a',1970-01-01 00:00:00,1974-02-09 |
| 1,12345678901,3.1400001049,2.71821,'b',1970-01-01 00:00:00,1974-02-09 |
| 2,12345678902,3.1400001049,2.71822,'c',1970-01-01 00:00:00,1974-02-09 |
| Step 4, update the table property 'write.data.path' to '/test-warehouse/iceberg_test/hadoop_catalog/ice/iceberg_multiple_storage_locations_data02' and insert 3 records: |
| 2,12345678900,3.1400001049,2.7182,'a',1970-01-01 00:00:00,1974-02-09 |
| 2,12345678901,3.1400001049,2.71821,'b',1970-01-01 00:00:00,1974-02-09 |
| 0,12345678902,3.1400001049,2.71822,'c',1970-01-01 00:00:00,1974-02-09 |
| |
| iceberg_test/iceberg_migrated_alter_test_orc |
| Generated by Hive |
| create table iceberg_mixed_file_format_test (i int, s string, d double) stored by iceberg; |
| insert into iceberg_mixed_file_format_test values (1, "A", 0.5); |
| alter table iceberg_mixed_file_format_test set tblproperties("write.format.default"="orc"); |
| insert into iceberg_mixed_file_format_test values (2, "B", 1.5); |
| alter table iceberg_mixed_file_format_test set tblproperties("write.format.default"="parquet"); |
| insert into iceberg_mixed_file_format_test values (3, "C", 2.5); |
| alter table iceberg_mixed_file_format_test set tblproperties("write.format.default"="orc"); |
| insert into iceberg_mixed_file_format_test values (4, "D", 3.5); |
| Converted similarly to iceberg_v2_no_deletes |
| |
| create_table_like_parquet_test.parquet: |
| Generated by Hive |
| create table iceberg_create_table_like_parquet_test (col_int int, col_float float, col_double double, col_string string, col_struct struct<col_int:int, col_float:float>, col_array array<string>, col_map map<string,array<int>>) stored as parquet; |
| insert into iceberg_create_table_like_parquet_test values (0, 1.0, 2.0, "3", named_struct("col_int", 4, "col_float", cast(5.0 as float)), array("6","7","8"), map("A", array(11,12), "B", array(21,22))); |
| |
| iceberg_lineitem_multiblock |
| Generated by Hive, see testdata/LineItemMultiBlock/README.dox for more details |
| set parquet.block.size=4086; |
| create table functional_parquet.iceberg_lineitem_multiblock like tpch.lineitem stored by iceberg location 'hdfs://localhost:20500/test-warehouse/iceberg_test/hadoop_catalog/ice/iceberg_lineitem_multiblock' tblproperties('format-version'='2'); |
| insert into functional_parquet.iceberg_lineitem_multiblock select * from tpch.lineitem limit 20000; |
| Delete by Ímpala |
| delete from functional_parquet.iceberg_lineitem_multiblock where l_linenumber%5=0; |
| Then saved the contents from HDFS to local ${IMPALA_HOME}/testdata/data/iceberg_test/hadoop_catalog/ice |
| And converted the HiveCatalog metadata to HadoopCatalog metadata via scripts at iceberg_v2_no_deletes |
| And rewrote metadata content to the correct lengths with |
| testdata/bin/rewrite-iceberg-metadata.py "" testdata/data/iceberg_test/hadoop_catalog/ice/iceberg_lineitem_multiblock/metadata/ |
| |
| iceberg_spark_compaction_with_dangling_delete: |
| 1) Create an Iceberg table with Impala and insert some rows. |
| create table functional_parquet.iceberg_spark_compaction_with_dangling_delete (id int, j bigint) |
| STORED AS ICEBERG |
| TBLPROPERTIES('iceberg.catalog'='hadoop.catalog', |
| 'iceberg.catalog_location'='/test-warehouse/iceberg_test/hadoop_catalog', |
| 'iceberg.table_identifier'='ice.iceberg_spark_compaction_with_dangling_delete', |
| 'format-version'='2'); |
| insert into functional_parquet.iceberg_spark_compaction_with_dangling_delete values |
| (1, 10), (2, 20), (3, 30), (4, 40), (5, 50); |
| 2) Update one field of a row by Impala. This adds a new data and a new delete file to the table. |
| update functional_parquet.iceberg_spark_compaction_with_dangling_delete set j = -100 where id = 4; |
| 3) Delete the same row with Impala that we updated in step 2). This adds another delete file. |
| delete from functional_parquet.iceberg_spark_compaction_with_dangling_delete where id = 4; |
| 4) Run compaction on the table with Spark. |
| spark.sql(s"CALL hadoop_catalog.system.rewrite_data_files(table => 'ice.iceberg_spark_compaction_with_dangling_delete', options => map('min-input-files','2') )") |
| |
| iceberg_v2_equality_delete_schema_evolution: |
| 1: Create and populate an Iceberg table with primary keys with Impala: |
| create table functional_parquet.iceberg_v2_equality_delete_schema_evolution |
| (i int not null, d date not null, s string, primary key(i, d) not enforced) |
| PARTITIONED BY SPEC (d) |
| STORED AS ICEBERG |
| TBLPROPERTIES('iceberg.catalog'='hadoop.catalog', |
| 'iceberg.catalog_location'='/test-warehouse/iceberg_test/hadoop_catalog', |
| 'iceberg.table_identifier'='ice.iceberg_v2_equality_delete_schema_evolution', |
| 'format-version'='2'); |
| insert into functional_parquet.iceberg_v2_equality_delete_schema_evolution values |
| (1, "2024-03-20", "str1"), |
| (2, "2024-03-20", "str2"), |
| (3, "2024-03-21", "str3"), |
| (4, "2024-03-21", "str4"), |
| (5, "2024-03-22", "str5"); |
| 2: Delete some rows with Nifi where i=2, i=3 |
| 3: Do some schema evolution on the table with Impala: |
| alter table functional_parquet.iceberg_v2_equality_delete_schema_evolution change s str string; |
| alter table functional_parquet.iceberg_v2_equality_delete_schema_evolution add column j int; |
| 4: Update a row with Nifi where i=4 to the following: |
| (44, 2024-03-21, "str4", 4444) |
| |
| iceberg_v2_null_delete_record: |
| 1) Created the table via Impala and added some records to it. |
| CREATE TABLE iceberg_v2_null_delete_record(i INT, j INT) |
| STORED BY ICEBERG; |
| INSERT INTO iceberg_v2_null_delete_record VALUES (1,1), (2,2), (3,3), (4,4); |
| INSERT INTO iceberg_v2_null_delete_record VALUES (1,1), (2,2), (3,3), (4,4); |
| |
| (We need at least 2 data files to use DIRECTED mode in KrpcDataStreamSender) |
| |
| 2) Created the following temporary table: |
| CREATE TABLE iceberg_v2_null_delete_record_pos_delete (file_path STRING, pos BIGINT) |
| STORED BY ICEBERG; |
| |
| Manually rewrote the metadata JSON file of this table, so the schema elements have the |
| following field ids (there are two places where I had to modify the schemas): |
| |
| file_path : 2147483546 |
| pos : 2147483545 |
| |
| 3) Inserted data files into iceberg_v2_null_delete_record_pos_delete: |
| |
| INSERT INTO iceberg_v2_null_delete_record_pos_delete VALUES |
| (NULL, 0); |
| |
| INSERT INTO iceberg_v2_null_delete_record_pos_delete VALUES |
| ('<data file path of iceberg_v2_null_delete_record>', 0), (NULL, 3); |
| |
| INSERT INTO iceberg_v2_null_delete_record_pos_delete VALUES |
| (NULL, 2), ('<data file path of iceberg_v2_null_delete_record>', 0); |
| |
| INSERT INTO iceberg_v2_null_delete_record_pos_delete VALUES |
| (NULL, 0), (NULL, 1), (NULL, 2); |
| |
| The written Parquet files have the schema of position delete files (with the |
| correct Iceberg field ids) |
| |
| 4) Copied iceberg_v2_null_delete_record to the local filesystem and applied |
| the following modifications: |
| |
| * added the Parquet files from iceberg_v2_null_delete_record_pos_delete to |
| the /data directory |
| * manually edited the metadata JSON, and the manifest and manifest list files to |
| register the delete files in the table |
| |
| arrays_big.parq: |
| Generated with RandomNestedDataGenerator.java from the following schema: |
| { |
| "fields": [ |
| { |
| "name": "int_col", |
| "type": "int" |
| }, |
| { |
| "name": "string_col", |
| "type": "string" |
| }, |
| { |
| "name": "int_array", |
| "type": [ |
| "null", |
| { |
| "type": "array", |
| "items": ["int", "null"] |
| } |
| ] |
| }, |
| { |
| "name": "double_map", |
| "type": [ |
| "null", |
| { |
| "type": "map", |
| "values": ["double", "null"] |
| } |
| ] |
| }, |
| { |
| "name": "string_array", |
| "type": [ |
| "null", |
| { |
| "type": "array", |
| "items": ["string", "null"] |
| } |
| ] |
| }, |
| { |
| "name" : "mixed", |
| "type" : { |
| "type" : "map", |
| "values" : [ |
| "null", |
| { |
| "type" : "array", |
| "items" : [ |
| "null", |
| { |
| "type": "map", |
| "values": [ |
| "null", |
| { |
| "name": "struct_in_mixed", |
| "type": "record", |
| "fields": [ |
| { |
| "name": "string_member", |
| "type": ["string", "null"] |
| }, |
| { |
| "name": "int_member", |
| "type": ["int", "null"] |
| } |
| ] |
| } |
| ] |
| } |
| ] |
| } |
| ] |
| } |
| } |
| ], |
| "name": "table_0", |
| "namespace": "org.apache.impala", |
| "type": "record" |
| } |
| The following command was used: |
| mvn -f "${IMPALA_HOME}/java/datagenerator/pom.xml" exec:java \ |
| -Dexec.mainClass="org.apache.impala.datagenerator.RandomNestedDataGenerator" \ |
| -Dexec.args="${INPUT_TBL_SCHEMA} 1500000 20 15 ${OUTPUT_FILE}" |
| |
| empty_present_stream.orc: |
| Generated by ORC C++ library using the following code |
| |
| size_t num = 500; |
| WriterOptions options; |
| options.setRowIndexStride(100); |
| auto stream = writeLocalFile("empty_present_stream.orc"); |
| std::unique_ptr<Type> type(Type::buildTypeFromString( |
| "struct<s1:struct<id:int>,s2:struct<id:int>>")); |
| |
| std::unique_ptr<Writer> writer = createWriter(*type, stream.get(), options); |
| |
| std::unique_ptr<ColumnVectorBatch> batch = writer->createRowBatch(num); |
| StructVectorBatch* structBatch = |
| dynamic_cast<StructVectorBatch*>(batch.get()); |
| StructVectorBatch* structBatch2 = |
| dynamic_cast<StructVectorBatch*>(structBatch->fields[0]); |
| LongVectorBatch* intBatch = |
| dynamic_cast<LongVectorBatch*>(structBatch2->fields[0]); |
| |
| StructVectorBatch* structBatch3 = |
| dynamic_cast<StructVectorBatch*>(structBatch->fields[1]); |
| LongVectorBatch* intBatch2 = |
| dynamic_cast<LongVectorBatch*>(structBatch3->fields[0]); |
| |
| structBatch->numElements = num; |
| structBatch2->numElements = num; |
| |
| structBatch3->numElements = num; |
| structBatch3->hasNulls = true; |
| |
| for (size_t i = 0; i < num; ++i) { |
| intBatch->data.data()[i] = i; |
| intBatch->notNull[i] = 1; |
| |
| intBatch2->notNull[i] = 0; |
| intBatch2->hasNulls = true; |
| |
| structBatch3->notNull[i] = 0; |
| } |
| intBatch->hasNulls = false; |
| |
| writer->add(*batch); |
| writer->close(); |
| |
| invalid_binary_data.txt: |
| Hand-written file where BINARY values are not Base64 encoded. |