| bad_parquet_data.parquet: |
| Generated with parquet-mr 1.2.5 |
| Contains 3 single-column rows: |
| "parquet" |
| "is" |
| "fun" |
| |
| bad_compressed_dict_page_size.parquet: |
| Generated by hacking Impala's Parquet writer. |
| Contains a single string column 'col' with one row ("a"). The compressed_page_size field |
| in dict page header is modifed to 0 to test if it is correctly handled. |
| |
| bad_rle_literal_count.parquet: |
| Generated by hacking Impala's Parquet writer. |
| Contains a single bigint column 'c' with the values 1, 3, 7 stored |
| in a single data chunk as dictionary plain. The RLE encoded dictionary |
| indexes are all literals (and not repeated), but the literal count |
| is incorrectly 0 in the file to test that such data corruption is |
| proprly handled. |
| |
| bad_rle_repeat_count.parquet: |
| Generated by hacking Impala's Parquet writer. |
| Contains a single bigint column 'c' with the value 7 repeated 7 times |
| stored in a single data chunk as dictionary plain. The RLE encoded dictionary |
| indexes are a single repeated run (and not literals), but the repeat count |
| is incorrectly 0 in the file to test that such data corruption is proprly |
| handled. |
| |
| zero_rows_zero_row_groups.parquet: |
| Generated by hacking Impala's Parquet writer. |
| The file metadata indicates zero rows and no row groups. |
| |
| zero_rows_one_row_group.parquet: |
| Generated by hacking Impala's Parquet writer. |
| The file metadata indicates zero rows but one row group. |
| |
| huge_num_rows.parquet |
| Generated by hacking Impala's Parquet writer. |
| The file metadata indicates 2 * MAX_INT32 rows. |
| The single row group also has the same number of rows in the metadata. |
| |
| repeated_values.parquet: |
| Generated with parquet-mr 1.2.5 |
| Contains 3 single-column rows: |
| "parquet" |
| "parquet" |
| "parquet" |
| |
| multiple_rowgroups.parquet: |
| Generated with parquet-mr 1.2.5 |
| Populated with: |
| hive> set parquet.block.size=500; |
| hive> INSERT INTO TABLE tbl |
| SELECT l_comment FROM tpch.lineitem LIMIT 1000; |
| |
| alltypesagg_hive_13_1.parquet: |
| Generated with parquet-mr version 1.5.0-cdh5.4.0-SNAPSHOT |
| hive> create table alltypesagg_hive_13_1 stored as parquet as select * from alltypesagg; |
| |
| bad_column_metadata.parquet: |
| Generated with hacked version of parquet-mr 1.8.2-SNAPSHOT |
| Schema: |
| {"type": "record", |
| "namespace": "org.apache.impala", |
| "name": "bad_column_metadata", |
| "fields": [ |
| {"name": "id", "type": ["null", "long"]}, |
| {"name": "int_array", "type": ["null", {"type": "array", "items": ["null", "int"]}]} |
| ] |
| } |
| Contains 3 row groups, each with ten rows and each array containing ten elements. The |
| first rowgroup column metadata for 'int_array' incorrectly states there are 50 values |
| (instead of 100), and the second rowgroup column metadata for 'id' incorrectly states |
| there are 11 values (instead of 10). The third rowgroup has the correct metadata. |
| |
| data-bzip2.bz2: |
| Generated with bzip2, contains single bzip2 stream |
| Contains 1 column, uncompressed data size < 8M |
| |
| large_bzip2.bz2: |
| Generated with bzip2, contains single bzip2 stream |
| Contains 1 column, uncompressed data size > 8M |
| |
| data-pbzip2.bz2: |
| Generated with pbzip2, contains multiple bzip2 streams |
| Contains 1 column, uncompressed data size < 8M |
| |
| large_pbzip2.bz2: |
| Generated with pbzip2, contains multiple bzip2 stream |
| Contains 1 column, uncompressed data size > 8M |
| |
| out_of_range_timestamp.parquet: |
| Generated with a hacked version of Impala parquet writer. |
| Contains a single timestamp column with 4 values, 2 of which are out of range |
| and should be read as NULL by Impala: |
| 1399-12-31 00:00:00 (invalid - date too small) |
| 1400-01-01 00:00:00 |
| 9999-12-31 00:00:00 |
| 10000-01-01 00:00:00 (invalid - date too large) |
| |
| table_with_header.csv: |
| Created with a text editor, contains a header line before the data rows. |
| |
| table_with_header_2.csv: |
| Created with a text editor, contains two header lines before the data rows. |
| |
| table_with_header.gz, table_with_header_2.gz: |
| Generated by gzip'ing table_with_header.csv and table_with_header_2.csv. |
| |
| deprecated_statistics.parquet: |
| Generated with with hive shell, which uses parquet-mr version 1.5.0-cdh5.12.0-SNAPSHOT |
| Contains a copy of the data in functional.alltypessmall with statistics that use the old |
| 'min'/'max' fields. |
| |
| repeated_root_schema.parquet: |
| Generated by hacking Impala's Parquet writer. |
| Created to reproduce IMPALA-4826. Contains a table of 300 rows where the |
| repetition level of the root schema is set to REPEATED. |
| Reproduction steps: |
| 1: Extend HdfsParquetTableWriter::CreateSchema with the following line: |
| file_metadata_.schema[0].__set_repetition_type(FieldRepetitionType::REQUIRED); |
| 2: Run test_compute_stats and grab the created Parquet file for |
| alltypes_parquet table. |
| |
| binary_decimal_dictionary.parquet, |
| binary_decimal_no_dictionary.parquet: |
| Generated using parquet-mr and contents verified using parquet-tools-1.9.1. |
| Contains decimals stored as variable sized BYTE_ARRAY with both dictionary |
| and non-dictionary encoding respectively. |
| |
| alltypes_agg_bitpacked_def_levels.parquet: |
| Generated by hacking Impala's Parquet writer to write out bitpacked def levels instead |
| of the standard RLE-encoded levels. See |
| https://github.com/timarmstrong/incubator-impala/tree/hack-bit-packed-levels. This |
| is a single file containing all of the alltypesagg data, which includes a mix of |
| null and non-null values. This is not actually a valid Parquet file because the |
| bit-packed levels are written in the reverse order specified in the Parquet spec |
| for BIT_PACKED. However, this is the order that Impala attempts to read the levels |
| in - see IMPALA-3006. |
| |
| signed_integer_logical_types.parquet: |
| Generated using a utility that uses the java Parquet API. |
| The file has the following schema: |
| schema { |
| optional int32 id; |
| optional int32 tinyint_col (INT_8); |
| optional int32 smallint_col (INT_16); |
| optional int32 int_col; |
| optional int64 bigint_col; |
| } |
| |
| min_max_is_nan.parquet: |
| Generated by Impala's Parquet writer before the fix for IMPALA-6527. Git hash: 3a049a53 |
| Created to test the read path for a Parquet file with invalid metadata, namely when |
| 'max_value' and 'min_value' are both NaN. Contains 2 single-column rows: |
| NaN |
| 42 |
| |
| bad_codec.parquet: |
| Generated by Impala's Parquet writer, hacked to use the invalid enum value 5000 for the |
| compression codec. The data in the file is the whole of the "alltypestiny" data set, with |
| the same columns: id int, bool_col boolean, tinyint_col tinyint, smallint_col smallint, |
| int_col int, bigint_col bigint, float_col float, double_col double, |
| date_string_col string, string_col string, timestamp_col timestamp, year int, month int |
| |
| num_values_def_levels_mismatch.parquet: |
| A file with a single boolean column with page metadata reporting 2 values but only def |
| levels for a single literal value. Generated by hacking Impala's parquet writer to |
| increment page.header.data_page_header.num_values. This caused Impala to hit a DCHECK |
| (IMPALA-6589). |
| |
| rle_encoded_bool.parquet: |
| Parquet v1 file with RLE encoded boolean column "b" and int column "i". |
| Created for IMPALA-6324, generated with modified parquet-mr. Contains 279 rows, |
| 139 with value false, and 140 with value true. "i" is always 1 if "b" is True |
| and always 0 if "b" is false. |
| |
| dict_encoding_with_large_bit_width.parquet: |
| Parquet file with a single TINYINT column "i" with 33 rows. Created by a modified |
| Impala to use 9 bit dictionary indices for encoding. Reading this file used to lead |
| to DCHECK errors (IMPALA-7147). |
| |
| decimal_stored_as_int32.parquet: |
| Parquet file generated by Spark 2.3.1 that contains decimals stored as int32. |
| Impala needs to be able to read such values (IMPALA-5542) |
| |
| decimal_stored_as_int64.parquet: |
| Parquet file generated by Spark 2.3.1 that contains decimals stored as int64. |
| Impala needs to be able to read such values (IMPALA-5542) |
| |
| primitive_type_widening.parquet: |
| Parquet file that contains two rows with the following schema: |
| - int32 tinyint_col1 |
| - int32 tinyint_col2 |
| - int32 tinyint_col3 |
| - int32 tinyint_col4 |
| - int32 smallint_col1 |
| - int32 smallint_col2 |
| - int32 smallint_col3 |
| - int32 int_col1 |
| - int32 int_col2 |
| - float float_col |
| It is used to test primitive type widening (IMPALA-6373). |