testdata/data/README - impala - Git at Google

 bad_parquet_data.parquet:
 Generated with parquet-mr 1.2.5
 Contains 3 single-column rows:
 "parquet"
 "is"
 "fun"

 bad_rle_literal_count.parquet:
 Generated by hacking Impala's Parquet writer.
 Contains a single bigint column 'c' with the values 1, 3, 7 stored
 in a single data chunk as dictionary plain. The RLE encoded dictionary
 indexes are all literals (and not repeated), but the literal count
 is incorrectly 0 in the file to test that such data corruption is
 proprly handled.

 bad_rle_repeat_count.parquet:
 Generated by hacking Impala's Parquet writer.
 Contains a single bigint column 'c' with the value 7 repeated 7 times
 stored in a single data chunk as dictionary plain. The RLE encoded dictionary
 indexes are a single repeated run (and not literals), but the repeat count
 is incorrectly 0 in the file to test that such data corruption is proprly
 handled.

 zero_rows_zero_row_groups.parquet:
 Generated by hacking Impala's Parquet writer.
 The file metadata indicates zero rows and no row groups.

 zero_rows_one_row_group.parquet:
 Generated by hacking Impala's Parquet writer.
 The file metadata indicates zero rows but one row group.

 repeated_values.parquet:
 Generated with parquet-mr 1.2.5
 Contains 3 single-column rows:
 "parquet"
 "parquet"
 "parquet"

 multiple_rowgroups.parquet:
 Generated with parquet-mr 1.2.5
 Populated with:
 hive> set parquet.block.size=500;
 hive> INSERT INTO TABLE tbl
       SELECT l_comment FROM tpch.lineitem LIMIT 1000;

 alltypesagg_hive_13_1.parquet:
 Generated with parquet-mr version 1.5.0-cdh5.4.0-SNAPSHOT
 hive> create table alltypesagg_hive_13_1 stored as parquet as select * from alltypesagg;

 bad_column_metadata.parquet:
 Generated with hacked version of parquet-mr 1.8.2-SNAPSHOT
 Schema:
  {"type": "record",
   "namespace": "org.apache.impala",
   "name": "bad_column_metadata",
   "fields": [
       {"name": "id", "type": ["null", "long"]},
       {"name": "int_array", "type": ["null", {"type": "array", "items": ["null", "int"]}]}
   ]
  }
 Contains 3 row groups, each with ten rows and each array containing ten elements. The
 first rowgroup column metadata for 'int_array' incorrectly states there are 50 values
 (instead of 100), and the second rowgroup column metadata for 'id' incorrectly states
 there are 11 values (instead of 10). The third rowgroup has the correct metadata.

 data-bzip2.bz2
 Generated with bzip2, contains single bzip2 stream
 Contains 1 column, uncompressed data size < 8M

 large_bzip2.bz2
 Generated with bzip2, contains single bzip2 stream
 Contains 1 column, uncompressed data size > 8M

 data-pbzip2.bz2
 Generated with pbzip2, contains multiple bzip2 streams
 Contains 1 column, uncompressed data size < 8M

 large_pbzip2.bz2
 Generated with pbzip2, contains multiple bzip2 stream
 Contains 1 column, uncompressed data size > 8M

 out_of_range_timestamp.parquet:
 -----------
 Generated with a hacked version of Impala parquet writer.
 Contains a single timestamp column with 4 values, 2 of which are out of range
 and should be read as NULL by Impala:
    1399-12-31 00:00:00 (invalid - date too small)
    1400-01-01 00:00:00
    9999-12-31 00:00:00
   10000-01-01 00:00:00 (invalid - date too large)
	bad_parquet_data.parquet:
	Generated with parquet-mr 1.2.5
	Contains 3 single-column rows:
	"parquet"
	"is"
	"fun"

	bad_rle_literal_count.parquet:
	Generated by hacking Impala's Parquet writer.
	Contains a single bigint column 'c' with the values 1, 3, 7 stored
	in a single data chunk as dictionary plain. The RLE encoded dictionary
	indexes are all literals (and not repeated), but the literal count
	is incorrectly 0 in the file to test that such data corruption is
	proprly handled.

	bad_rle_repeat_count.parquet:
	Generated by hacking Impala's Parquet writer.
	Contains a single bigint column 'c' with the value 7 repeated 7 times
	stored in a single data chunk as dictionary plain. The RLE encoded dictionary
	indexes are a single repeated run (and not literals), but the repeat count
	is incorrectly 0 in the file to test that such data corruption is proprly
	handled.

	zero_rows_zero_row_groups.parquet:
	Generated by hacking Impala's Parquet writer.
	The file metadata indicates zero rows and no row groups.

	zero_rows_one_row_group.parquet:
	Generated by hacking Impala's Parquet writer.
	The file metadata indicates zero rows but one row group.

	repeated_values.parquet:
	Generated with parquet-mr 1.2.5
	Contains 3 single-column rows:
	"parquet"
	"parquet"
	"parquet"

	multiple_rowgroups.parquet:
	Generated with parquet-mr 1.2.5
	Populated with:
	hive> set parquet.block.size=500;
	hive> INSERT INTO TABLE tbl
	SELECT l_comment FROM tpch.lineitem LIMIT 1000;

	alltypesagg_hive_13_1.parquet:
	Generated with parquet-mr version 1.5.0-cdh5.4.0-SNAPSHOT
	hive> create table alltypesagg_hive_13_1 stored as parquet as select * from alltypesagg;

	bad_column_metadata.parquet:
	Generated with hacked version of parquet-mr 1.8.2-SNAPSHOT
	Schema:
	{"type": "record",
	"namespace": "org.apache.impala",
	"name": "bad_column_metadata",
	"fields": [
	{"name": "id", "type": ["null", "long"]},
	{"name": "int_array", "type": ["null", {"type": "array", "items": ["null", "int"]}]}
	]
	}
	Contains 3 row groups, each with ten rows and each array containing ten elements. The
	first rowgroup column metadata for 'int_array' incorrectly states there are 50 values
	(instead of 100), and the second rowgroup column metadata for 'id' incorrectly states
	there are 11 values (instead of 10). The third rowgroup has the correct metadata.

	data-bzip2.bz2
	Generated with bzip2, contains single bzip2 stream
	Contains 1 column, uncompressed data size < 8M

	large_bzip2.bz2
	Generated with bzip2, contains single bzip2 stream
	Contains 1 column, uncompressed data size > 8M

	data-pbzip2.bz2
	Generated with pbzip2, contains multiple bzip2 streams
	Contains 1 column, uncompressed data size < 8M

	large_pbzip2.bz2
	Generated with pbzip2, contains multiple bzip2 stream
	Contains 1 column, uncompressed data size > 8M

	out_of_range_timestamp.parquet:
	-----------
	Generated with a hacked version of Impala parquet writer.
	Contains a single timestamp column with 4 values, 2 of which are out of range
	and should be read as NULL by Impala:
	1399-12-31 00:00:00 (invalid - date too small)
	1400-01-01 00:00:00
	9999-12-31 00:00:00
	10000-01-01 00:00:00 (invalid - date too large)