testdata/data/README - impala - Git at Google

 bad_parquet_data.parquet:
 Generated with parquet-mr 1.2.5
 Contains 3 single-column rows:
 "parquet"
 "is"
 "fun"

 bad_compressed_dict_page_size.parquet:
 Generated by hacking Impala's Parquet writer.
 Contains a single string column 'col' with one row ("a"). The compressed_page_size field
 in dict page header is modifed to 0 to test if it is correctly handled.

 bad_rle_literal_count.parquet:
 Generated by hacking Impala's Parquet writer.
 Contains a single bigint column 'c' with the values 1, 3, 7 stored
 in a single data chunk as dictionary plain. The RLE encoded dictionary
 indexes are all literals (and not repeated), but the literal count
 is incorrectly 0 in the file to test that such data corruption is
 proprly handled.

 bad_rle_repeat_count.parquet:
 Generated by hacking Impala's Parquet writer.
 Contains a single bigint column 'c' with the value 7 repeated 7 times
 stored in a single data chunk as dictionary plain. The RLE encoded dictionary
 indexes are a single repeated run (and not literals), but the repeat count
 is incorrectly 0 in the file to test that such data corruption is proprly
 handled.

 zero_rows_zero_row_groups.parquet:
 Generated by hacking Impala's Parquet writer.
 The file metadata indicates zero rows and no row groups.

 zero_rows_one_row_group.parquet:
 Generated by hacking Impala's Parquet writer.
 The file metadata indicates zero rows but one row group.

 huge_num_rows.parquet
 Generated by hacking Impala's Parquet writer.
 The file metadata indicates 2 * MAX_INT32 rows.
 The single row group also has the same number of rows in the metadata.

 repeated_values.parquet:
 Generated with parquet-mr 1.2.5
 Contains 3 single-column rows:
 "parquet"
 "parquet"
 "parquet"

 multiple_rowgroups.parquet:
 Generated with parquet-mr 1.2.5
 Populated with:
 hive> set parquet.block.size=500;
 hive> INSERT INTO TABLE tbl
       SELECT l_comment FROM tpch.lineitem LIMIT 1000;

 alltypesagg_hive_13_1.parquet:
 Generated with parquet-mr version 1.5.0-cdh5.4.0-SNAPSHOT
 hive> create table alltypesagg_hive_13_1 stored as parquet as select * from alltypesagg;

 bad_column_metadata.parquet:
 Generated with hacked version of parquet-mr 1.8.2-SNAPSHOT
 Schema:
  {"type": "record",
   "namespace": "org.apache.impala",
   "name": "bad_column_metadata",
   "fields": [
       {"name": "id", "type": ["null", "long"]},
       {"name": "int_array", "type": ["null", {"type": "array", "items": ["null", "int"]}]}
   ]
  }
 Contains 3 row groups, each with ten rows and each array containing ten elements. The
 first rowgroup column metadata for 'int_array' incorrectly states there are 50 values
 (instead of 100), and the second rowgroup column metadata for 'id' incorrectly states
 there are 11 values (instead of 10). The third rowgroup has the correct metadata.

 data-bzip2.bz2:
 Generated with bzip2, contains single bzip2 stream
 Contains 1 column, uncompressed data size < 8M

 large_bzip2.bz2:
 Generated with bzip2, contains single bzip2 stream
 Contains 1 column, uncompressed data size > 8M

 data-pbzip2.bz2:
 Generated with pbzip2, contains multiple bzip2 streams
 Contains 1 column, uncompressed data size < 8M

 large_pbzip2.bz2:
 Generated with pbzip2, contains multiple bzip2 stream
 Contains 1 column, uncompressed data size > 8M

 out_of_range_timestamp.parquet:
 Generated with a hacked version of Impala parquet writer.
 Contains a single timestamp column with 4 values, 2 of which are out of range
 and should be read as NULL by Impala:
    1399-12-31 00:00:00 (invalid - date too small)
    1400-01-01 00:00:00
    9999-12-31 00:00:00
   10000-01-01 00:00:00 (invalid - date too large)

 table_with_header.csv:
 Created with a text editor, contains a header line before the data rows.

 table_with_header_2.csv:
 Created with a text editor, contains two header lines before the data rows.

 table_with_header.gz, table_with_header_2.gz:
 Generated by gzip'ing table_with_header.csv and table_with_header_2.csv.

 deprecated_statistics.parquet:
 Generated with with hive shell, which uses parquet-mr version 1.5.0-cdh5.12.0-SNAPSHOT
 Contains a copy of the data in functional.alltypessmall with statistics that use the old
 'min'/'max' fields.

 repeated_root_schema.parquet:
 Generated by hacking Impala's Parquet writer.
 Created to reproduce IMPALA-4826. Contains a table of 300 rows where the
 repetition level of the root schema is set to REPEATED.
 Reproduction steps:
 1: Extend HdfsParquetTableWriter::CreateSchema with the following line:
    file_metadata_.schema[0].__set_repetition_type(FieldRepetitionType::REQUIRED);
 2: Run test_compute_stats and grab the created Parquet file for
    alltypes_parquet table.

 binary_decimal_dictionary.parquet,
 binary_decimal_no_dictionary.parquet:
 Generated using parquet-mr and contents verified using parquet-tools-1.9.1.
 Contains decimals stored as variable sized BYTE_ARRAY with both dictionary
 and non-dictionary encoding respectively.

 alltypes_agg_bitpacked_def_levels.parquet:
 Generated by hacking Impala's Parquet writer to write out bitpacked def levels instead
 of the standard RLE-encoded levels. See
 https://github.com/timarmstrong/incubator-impala/tree/hack-bit-packed-levels. This
 is a single file containing all of the alltypesagg data, which includes a mix of
 null and non-null values. This is not actually a valid Parquet file because the
 bit-packed levels are written in the reverse order specified in the Parquet spec
 for BIT_PACKED. However, this is the order that Impala attempts to read the levels
 in - see IMPALA-3006.

 signed_integer_logical_types.parquet:
 Generated using a utility that uses the java Parquet API.
 The file has the following schema:
   schema {
     optional int32 id;
     optional int32 tinyint_col (INT_8);
     optional int32 smallint_col (INT_16);
     optional int32 int_col;
     optional int64 bigint_col;
   }

 min_max_is_nan.parquet:
 Generated by Impala's Parquet writer before the fix for IMPALA-6527. Git hash: 3a049a53
 Created to test the read path for a Parquet file with invalid metadata, namely when
 'max_value' and 'min_value' are both NaN. Contains 2 single-column rows:
 NaN
 42

 bad_codec.parquet:
 Generated by Impala's Parquet writer, hacked to use the invalid enum value 5000 for the
 compression codec. The data in the file is the whole of the "alltypestiny" data set, with
 the same columns: id int, bool_col boolean, tinyint_col tinyint, smallint_col smallint,
 int_col int, bigint_col bigint, float_col float, double_col double,
 date_string_col string, string_col string, timestamp_col timestamp, year int, month int

 num_values_def_levels_mismatch.parquet:
 A file with a single boolean column with page metadata reporting 2 values but only def
 levels for a single literal value. Generated by hacking Impala's parquet writer to
 increment page.header.data_page_header.num_values. This caused Impala to hit a DCHECK
 (IMPALA-6589).

 rle_encoded_bool.parquet:
 Parquet v1 file with RLE encoded boolean column "b" and int column "i".
 Created for IMPALA-6324, generated with modified parquet-mr. Contains 279 rows,
 139 with value false, and 140 with value true. "i" is always 1 if "b" is True
 and always 0 if "b" is false.

 dict_encoding_with_large_bit_width.parquet:
 Parquet file with a single TINYINT column "i" with 33 rows. Created by a modified
 Impala to use 9 bit dictionary indices for encoding. Reading this file used to lead
 to DCHECK errors (IMPALA-7147).
	bad_parquet_data.parquet:
	Generated with parquet-mr 1.2.5
	Contains 3 single-column rows:
	"parquet"
	"is"
	"fun"

	bad_compressed_dict_page_size.parquet:
	Generated by hacking Impala's Parquet writer.
	Contains a single string column 'col' with one row ("a"). The compressed_page_size field
	in dict page header is modifed to 0 to test if it is correctly handled.

	bad_rle_literal_count.parquet:
	Generated by hacking Impala's Parquet writer.
	Contains a single bigint column 'c' with the values 1, 3, 7 stored
	in a single data chunk as dictionary plain. The RLE encoded dictionary
	indexes are all literals (and not repeated), but the literal count
	is incorrectly 0 in the file to test that such data corruption is
	proprly handled.

	bad_rle_repeat_count.parquet:
	Generated by hacking Impala's Parquet writer.
	Contains a single bigint column 'c' with the value 7 repeated 7 times
	stored in a single data chunk as dictionary plain. The RLE encoded dictionary
	indexes are a single repeated run (and not literals), but the repeat count
	is incorrectly 0 in the file to test that such data corruption is proprly
	handled.

	zero_rows_zero_row_groups.parquet:
	Generated by hacking Impala's Parquet writer.
	The file metadata indicates zero rows and no row groups.

	zero_rows_one_row_group.parquet:
	Generated by hacking Impala's Parquet writer.
	The file metadata indicates zero rows but one row group.

	huge_num_rows.parquet
	Generated by hacking Impala's Parquet writer.
	The file metadata indicates 2 * MAX_INT32 rows.
	The single row group also has the same number of rows in the metadata.

	repeated_values.parquet:
	Generated with parquet-mr 1.2.5
	Contains 3 single-column rows:
	"parquet"
	"parquet"
	"parquet"

	multiple_rowgroups.parquet:
	Generated with parquet-mr 1.2.5
	Populated with:
	hive> set parquet.block.size=500;
	hive> INSERT INTO TABLE tbl
	SELECT l_comment FROM tpch.lineitem LIMIT 1000;

	alltypesagg_hive_13_1.parquet:
	Generated with parquet-mr version 1.5.0-cdh5.4.0-SNAPSHOT
	hive> create table alltypesagg_hive_13_1 stored as parquet as select * from alltypesagg;

	bad_column_metadata.parquet:
	Generated with hacked version of parquet-mr 1.8.2-SNAPSHOT
	Schema:
	{"type": "record",
	"namespace": "org.apache.impala",
	"name": "bad_column_metadata",
	"fields": [
	{"name": "id", "type": ["null", "long"]},
	{"name": "int_array", "type": ["null", {"type": "array", "items": ["null", "int"]}]}
	]
	}
	Contains 3 row groups, each with ten rows and each array containing ten elements. The
	first rowgroup column metadata for 'int_array' incorrectly states there are 50 values
	(instead of 100), and the second rowgroup column metadata for 'id' incorrectly states
	there are 11 values (instead of 10). The third rowgroup has the correct metadata.

	data-bzip2.bz2:
	Generated with bzip2, contains single bzip2 stream
	Contains 1 column, uncompressed data size < 8M

	large_bzip2.bz2:
	Generated with bzip2, contains single bzip2 stream
	Contains 1 column, uncompressed data size > 8M

	data-pbzip2.bz2:
	Generated with pbzip2, contains multiple bzip2 streams
	Contains 1 column, uncompressed data size < 8M

	large_pbzip2.bz2:
	Generated with pbzip2, contains multiple bzip2 stream
	Contains 1 column, uncompressed data size > 8M

	out_of_range_timestamp.parquet:
	Generated with a hacked version of Impala parquet writer.
	Contains a single timestamp column with 4 values, 2 of which are out of range
	and should be read as NULL by Impala:
	1399-12-31 00:00:00 (invalid - date too small)
	1400-01-01 00:00:00
	9999-12-31 00:00:00
	10000-01-01 00:00:00 (invalid - date too large)

	table_with_header.csv:
	Created with a text editor, contains a header line before the data rows.

	table_with_header_2.csv:
	Created with a text editor, contains two header lines before the data rows.

	table_with_header.gz, table_with_header_2.gz:
	Generated by gzip'ing table_with_header.csv and table_with_header_2.csv.

	deprecated_statistics.parquet:
	Generated with with hive shell, which uses parquet-mr version 1.5.0-cdh5.12.0-SNAPSHOT
	Contains a copy of the data in functional.alltypessmall with statistics that use the old
	'min'/'max' fields.

	repeated_root_schema.parquet:
	Generated by hacking Impala's Parquet writer.
	Created to reproduce IMPALA-4826. Contains a table of 300 rows where the
	repetition level of the root schema is set to REPEATED.
	Reproduction steps:
	1: Extend HdfsParquetTableWriter::CreateSchema with the following line:
	file_metadata_.schema[0].__set_repetition_type(FieldRepetitionType::REQUIRED);
	2: Run test_compute_stats and grab the created Parquet file for
	alltypes_parquet table.

	binary_decimal_dictionary.parquet,
	binary_decimal_no_dictionary.parquet:
	Generated using parquet-mr and contents verified using parquet-tools-1.9.1.
	Contains decimals stored as variable sized BYTE_ARRAY with both dictionary
	and non-dictionary encoding respectively.

	alltypes_agg_bitpacked_def_levels.parquet:
	Generated by hacking Impala's Parquet writer to write out bitpacked def levels instead
	of the standard RLE-encoded levels. See
	https://github.com/timarmstrong/incubator-impala/tree/hack-bit-packed-levels. This
	is a single file containing all of the alltypesagg data, which includes a mix of
	null and non-null values. This is not actually a valid Parquet file because the
	bit-packed levels are written in the reverse order specified in the Parquet spec
	for BIT_PACKED. However, this is the order that Impala attempts to read the levels
	in - see IMPALA-3006.

	signed_integer_logical_types.parquet:
	Generated using a utility that uses the java Parquet API.
	The file has the following schema:
	schema {
	optional int32 id;
	optional int32 tinyint_col (INT_8);
	optional int32 smallint_col (INT_16);
	optional int32 int_col;
	optional int64 bigint_col;
	}

	min_max_is_nan.parquet:
	Generated by Impala's Parquet writer before the fix for IMPALA-6527. Git hash: 3a049a53
	Created to test the read path for a Parquet file with invalid metadata, namely when
	'max_value' and 'min_value' are both NaN. Contains 2 single-column rows:
	NaN
	42

	bad_codec.parquet:
	Generated by Impala's Parquet writer, hacked to use the invalid enum value 5000 for the
	compression codec. The data in the file is the whole of the "alltypestiny" data set, with
	the same columns: id int, bool_col boolean, tinyint_col tinyint, smallint_col smallint,
	int_col int, bigint_col bigint, float_col float, double_col double,
	date_string_col string, string_col string, timestamp_col timestamp, year int, month int

	num_values_def_levels_mismatch.parquet:
	A file with a single boolean column with page metadata reporting 2 values but only def
	levels for a single literal value. Generated by hacking Impala's parquet writer to
	increment page.header.data_page_header.num_values. This caused Impala to hit a DCHECK
	(IMPALA-6589).

	rle_encoded_bool.parquet:
	Parquet v1 file with RLE encoded boolean column "b" and int column "i".
	Created for IMPALA-6324, generated with modified parquet-mr. Contains 279 rows,
	139 with value false, and 140 with value true. "i" is always 1 if "b" is True
	and always 0 if "b" is false.

	dict_encoding_with_large_bit_width.parquet:
	Parquet file with a single TINYINT column "i" with 33 rows. Created by a modified
	Impala to use 9 bit dictionary indices for encoding. Reading this file used to lead
	to DCHECK errors (IMPALA-7147).