tree: 253f0df34cbada0b29add1db40936531a96c79e0 [path history] [tgz]
  1. alltypes_dictionary.parquet
  2. alltypes_plain.parquet
  3. alltypes_plain.snappy.parquet
  4. alltypes_tiny_pages.parquet
  5. alltypes_tiny_pages_plain.parquet
  6. binary.parquet
  7. bloom_filter.bin
  8. byte_array_decimal.parquet
  9. data_index_bloom_encoding_stats.parquet
  10. datapage_v1-corrupt-checksum.parquet
  11. datapage_v1-snappy-compressed-checksum.parquet
  12. datapage_v1-uncompressed-checksum.parquet
  13. datapage_v2.snappy.parquet
  14. delta_binary_packed.md
  15. delta_binary_packed.parquet
  16. delta_binary_packed_expect.csv
  17. delta_byte_array.md
  18. delta_byte_array.parquet
  19. delta_byte_array_expect.csv
  20. delta_encoding_optional_column.md
  21. delta_encoding_optional_column.parquet
  22. delta_encoding_optional_column_expect.csv
  23. delta_encoding_required_column.md
  24. delta_encoding_required_column.parquet
  25. delta_encoding_required_column_expect.csv
  26. delta_length_byte_array.parquet
  27. dict-page-offset-zero.parquet
  28. encrypt_columns_and_footer.parquet.encrypted
  29. encrypt_columns_and_footer_aad.parquet.encrypted
  30. encrypt_columns_and_footer_ctr.parquet.encrypted
  31. encrypt_columns_and_footer_disable_aad_storage.parquet.encrypted
  32. encrypt_columns_plaintext_footer.parquet.encrypted
  33. fixed_length_byte_array.md
  34. fixed_length_byte_array.parquet
  35. fixed_length_decimal.parquet
  36. fixed_length_decimal_legacy.parquet
  37. hadoop_lz4_compressed.parquet
  38. hadoop_lz4_compressed_larger.parquet
  39. int32_decimal.parquet
  40. int64_decimal.parquet
  41. list_columns.parquet
  42. lz4_raw_compressed.parquet
  43. lz4_raw_compressed_larger.parquet
  44. nation.dict-malformed.parquet
  45. nested_lists.snappy.parquet
  46. nested_maps.snappy.parquet
  47. nested_structs.rust.parquet
  48. non_hadoop_lz4_compressed.parquet
  49. nonnullable.impala.parquet
  50. null_list.parquet
  51. nullable.impala.parquet
  52. nulls.snappy.parquet
  53. README.md
  54. repeated_no_annotation.parquet
  55. rle_boolean_encoding.parquet
  56. single_nan.parquet
  57. uniform_encryption.parquet.encrypted
data/README.md

Test data files for Parquet compatibility and regression testing

FileDescription
delta_byte_array.parquetstring columns with DELTA_BYTE_ARRAY encoding. See delta_byte_array.md for details.
delta_length_byte_array.parquetstring columns with DELTA_LENGTH_BYTE_ARRAY encoding.
delta_binary_packed.parquetINT32 and INT64 columns with DELTA_BINARY_PACKED encoding. See delta_binary_packed.md for details.
delta_encoding_required_column.parquetrequired INT32 and STRING columns with delta encoding. See delta_encoding_required_column.md for details.
delta_encoding_optional_column.parquetoptional INT64 and STRING columns with delta encoding. See delta_encoding_optional_column.md for details.
nested_structs.rust.parquetUsed to test that the Rust Arrow reader can lookup the correct field from a nested struct. See ARROW-11452
data_index_bloom_encoding_stats.parquetoptional STRING column. Contains optional metadata: bloom filters, column index, offset index and encoding stats.
null_list.parquetan empty list. Generated from this json {"emptylist":[]} and for the purposes of testing correct read/write behaviour of this base case.
alltypes_tiny_pages.parquetsmall page sizes with dictionary encoding with page index from impala.
alltypes_tiny_pages_plain.parquetsmall page sizes with plain encoding with page index impala.
rle_boolean_encoding.parquetoption boolean columns with RLE encoding
fixed_length_byte_array.parquetoptional FIXED_LENGTH_BYTE_ARRAY column with page index. See fixed_length_byte_array.md for details.
datapage_v1-uncompressed-checksum.parquetuncompressed INT32 columns in v1 data pages with a matching CRC
datapage_v1-snappy-compressed-checksum.parquetcompressed INT32 columns in v1 data pages with a matching CRC
datapage_v1-corrupt-checksum.parquetuncompressed INT32 columns in v1 data pages with a mismatching CRC

TODO: Document what each file is in the table above.

Encrypted Files

Tests files with .parquet.encrypted suffix are encrypted using Parquet Modular Encryption.

A detailed description of the Parquet Modular Encryption specification can be found here:

 https://github.com/apache/parquet-format/blob/encryption/Encryption.md

Following are the keys and key ids (when using key_retriever) used to encrypt the encrypted columns and footer in the all the encrypted files:

  • Encrypted/Signed Footer:
    • key: {0,1,2,3,4,5,6,7,8,9,0,1,2,3,4,5}
    • key_id: “kf”
  • Encrypted column named double_field:
    • key: {1,2,3,4,5,6,7,8,9,0,1,2,3,4,5,0}
    • key_id: “kc1”
  • Encrypted column named float_field:
    • key: {1,2,3,4,5,6,7,8,9,0,1,2,3,4,5,1}
    • key_id: “kc2”

The following files are encrypted with AAD prefix “tester”:

  1. encrypt_columns_and_footer_disable_aad_storage.parquet.encrypted
  2. encrypt_columns_and_footer_aad.parquet.encrypted

A sample that reads and checks these files can be found at the following tests:

cpp/src/parquet/encryption-read-configurations-test.cc
cpp/src/parquet/test-encryption-util.h

Checksum Files

The schema for the datapage_v1-*-checksum.parquet test files is:

message m {
    required int32 a;
    required int32 b;
} 

The detailed structure for these files is as follows:

  • data/datapage_v1-uncompressed-checksum.parquet:

    [ Column "a" [ Page 0 [correct crc] | Uncompressed Contents ][ Page 1 [correct crc] | Uncompressed Contents ]]
    [ Column "b" [ Page 0 [correct crc] | Uncompressed Contents ][ Page 1 [correct crc] | Uncompressed Contents ]]
    
  • data/datapage_v1-snappy-compressed-checksum.parquet:

    [ Column "a" [ Page 0 [correct crc] | Snappy Contents ][ Page 1 [correct crc] | Snappy Contents ]]
    [ Column "b" [ Page 0 [correct crc] | Snappy Contents ][ Page 1 [correct crc] | Snappy Contents ]]
    
  • data/datapage_v1-corrupt-checksum.parquet:

    [ Column "a" [ Page 0 [bad crc] | Uncompressed Contents ][ Page 1 [correct crc] | Uncompressed Contents ]]
    [ Column "b" [ Page 0 [correct crc] | Uncompressed Contents ][ Page 1 [bad crc] | Uncompressed Contents ]]