data/README.md - parquet-testing - Git at Google

 <!--
   ~ Licensed to the Apache Software Foundation (ASF) under one
   ~ or more contributor license agreements.  See the NOTICE file
   ~ distributed with this work for additional information
   ~ regarding copyright ownership.  The ASF licenses this file
   ~ to you under the Apache License, Version 2.0 (the
   ~ "License"); you may not use this file except in compliance
   ~ with the License.  You may obtain a copy of the License at
   ~
   ~   http://www.apache.org/licenses/LICENSE-2.0
   ~
   ~ Unless required by applicable law or agreed to in writing,
   ~ software distributed under the License is distributed on an
   ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
   ~ KIND, either express or implied.  See the License for the
   ~ specific language governing permissions and limitations
   ~ under the License.
   -->

 # Test data files for Parquet compatibility and regression testing

 | File                                         | Description                                                                                                                                                      |
 |----------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------|
 | delta_byte_array.parquet                     | string columns with DELTA_BYTE_ARRAY encoding. See [delta_byte_array.md](delta_byte_array.md) for details.                                                       |
 | delta_length_byte_array.parquet              | string columns with DELTA_LENGTH_BYTE_ARRAY encoding.                                                                                                            |
 | delta_binary_packed.parquet                  | INT32 and INT64 columns with DELTA_BINARY_PACKED encoding. See [delta_binary_packed.md](delta_binary_packed.md) for details.                                     |
 | delta_encoding_required_column.parquet       | required INT32 and STRING columns with delta encoding. See [delta_encoding_required_column.md](delta_encoding_required_column.md) for details.                   |
 | delta_encoding_optional_column.parquet       | optional INT64 and STRING columns with delta encoding. See [delta_encoding_optional_column.md](delta_encoding_optional_column.md) for details.                   |
 | nested_structs.rust.parquet                  | Used to test that the Rust Arrow reader can lookup the correct field from a nested struct. See [ARROW-11452](https://issues.apache.org/jira/browse/ARROW-11452)  |
 | data_index_bloom_encoding_stats.parquet | optional STRING column. Contains optional metadata: bloom filters, column index, offset index and encoding stats.                                                |
 | data_index_bloom_encoding_with_length.parquet | Same as `data_index_bloom_encoding_stats.parquet` but has `bloom_filter_length` populated in the ColumnMetaData |
 | null_list.parquet                       | an empty list. Generated from this json `{"emptylist":[]}` and for the purposes of testing correct read/write behaviour of this base case.                       |
 | alltypes_tiny_pages.parquet             | small page sizes with dictionary encoding with page index from [impala](https://github.com/apache/impala/tree/master/testdata/data/alltypes_tiny_pages.parquet). |
 | alltypes_tiny_pages_plain.parquet       | small page sizes with plain encoding with page index [impala](https://github.com/apache/impala/tree/master/testdata/data/alltypes_tiny_pages.parquet).           |
 | rle_boolean_encoding.parquet            | option boolean columns with RLE encoding                                                                                                                         |
 | fixed_length_byte_array.parquet                | optional FIXED_LENGTH_BYTE_ARRAY column with page index. See [fixed_length_byte_array.md](fixed_length_byte_array.md) for details.                        |
 | int32_with_null_pages.parquet                  | optional INT32 column with random null pages. See [int32_with_null_pages.md](int32_with_null_pages.md) for details.                        |
 | datapage_v1-uncompressed-checksum.parquet      | uncompressed INT32 columns in v1 data pages with a matching CRC        |
 | datapage_v1-snappy-compressed-checksum.parquet | compressed INT32 columns in v1 data pages with a matching CRC          |
 | datapage_v1-corrupt-checksum.parquet           | uncompressed INT32 columns in v1 data pages with a mismatching CRC     |
 | overflow_i16_page_cnt.parquet                  | row group with more than INT16_MAX pages                   |
 | bloom_filter.bin                               | deprecated bloom filter binary with binary header and murmur3 hashing |
 | bloom_filter.xxhash.bin                        | bloom filter binary with thrift header and xxhash hashing    |
 | nan_in_stats.parquet                           | statistics contains NaN in max, from PyArrow 0.8.0. See note below on "NaN in stats".  |
 | rle-dict-snappy-checksum.parquet                 | compressed and dictionary-encoded INT32 and STRING columns in format v2 with a matching CRC |
 | plain-dict-uncompressed-checksum.parquet         | uncompressed and dictionary-encoded INT32 and STRING columns in format v1 with a matching CRC |
 | rle-dict-uncompressed-corrupt-checksum.parquet   | uncompressed and dictionary-encoded INT32 and STRING columns in format v2 with a mismatching CRC |
 | large_string_map.brotli.parquet       | MAP(STRING, INT32) with a string column chunk of more than 2GB. See [note](#large-string-map) below |
 | float16_nonzeros_and_nans.parquet | Float16 (logical type) column with NaNs and nonzero finite min/max values |
 | float16_zeros_and_nans.parquet    | Float16 (logical type) column with NaNs and zeros as min/max values. . See [note](#float16-files) below |
 | concatenated_gzip_members.parquet     | 513 UINT64 numbers compressed using 2 concatenated gzip members in a single data page |
 | byte_stream_split.zstd.parquet | Standard normals with `BYTE_STREAM_SPLIT` encoding. See [note](#byte-stream-split) below |
 | incorrect_map_schema.parquet | Contains a Map schema without explicitly required keys, produced by Presto. See [note](#incorrect-map-schema) |

 TODO: Document what each file is in the table above.

 ## Encrypted Files

 Tests files with .parquet.encrypted suffix are encrypted using Parquet Modular Encryption.

 A detailed description of the Parquet Modular Encryption specification can be found here:
 ```
  https://github.com/apache/parquet-format/blob/encryption/Encryption.md
 ```

 Following are the keys and key ids (when using key\_retriever) used to encrypt
 the encrypted columns and footer in all the encrypted files:
 * Encrypted/Signed Footer:
   * key:   {0,1,2,3,4,5,6,7,8,9,0,1,2,3,4,5}
   * key_id: "kf"
 * Encrypted column named double_field (including column and offset index):
   * key:  {1,2,3,4,5,6,7,8,9,0,1,2,3,4,5,0}
   * key_id: "kc1"
 * Encrypted column named float_field (including column and offset index):
   * key: {1,2,3,4,5,6,7,8,9,0,1,2,3,4,5,1}
   * key_id: "kc2"

 The following files are encrypted with AAD prefix "tester":
 1. encrypt\_columns\_and\_footer\_disable\_aad\_storage.parquet.encrypted
 2. encrypt\_columns\_and\_footer\_aad.parquet.encrypted


 A sample that reads and checks these files can be found at the following tests
 in Parquet C++:
 ```
 cpp/src/parquet/encryption/read-configurations-test.cc
 cpp/src/parquet/encryption/test-encryption-util.h
 ```

 The `external_key_material_java.parquet.encrypted` file was encrypted using parquet-mr with
 external key material enabled, so the key material is found in the
 `_KEY_MATERIAL_FOR_external_key_material_java.parquet.encrypted.json` file.
 This data was written using the `org.apache.parquet.crypto.keytools.mocks.InMemoryKMS` KMS client,
 which is compatible with the `TestOnlyInServerWrapKms` KMS client used in C++ tests.

 ## Checksum Files

 The schema for the `datapage_v1-*-checksum.parquet` test files is:
 ```
 message m {
     required int32 a;
     required int32 b;
 }
 ```

 The detailed structure for these files is as follows:

 * `data/datapage_v1-uncompressed-checksum.parquet`:
   ```
   [ Column "a" [ Page 0 [correct crc] | Uncompressed Contents ][ Page 1 [correct crc] | Uncompressed Contents ]]
   [ Column "b" [ Page 0 [correct crc] | Uncompressed Contents ][ Page 1 [correct crc] | Uncompressed Contents ]]
   ```

 * `data/datapage_v1-snappy-compressed-checksum.parquet`:
   ```
   [ Column "a" [ Page 0 [correct crc] | Snappy Contents ][ Page 1 [correct crc] | Snappy Contents ]]
   [ Column "b" [ Page 0 [correct crc] | Snappy Contents ][ Page 1 [correct crc] | Snappy Contents ]]
   ```

 * `data/datapage_v1-corrupt-checksum.parquet`:
   ```
   [ Column "a" [ Page 0 [bad crc] | Uncompressed Contents ][ Page 1 [correct crc] | Uncompressed Contents ]]
   [ Column "b" [ Page 0 [correct crc] | Uncompressed Contents ][ Page 1 [bad crc] | Uncompressed Contents ]]
   ```

 The schema for the `*-dict-*-checksum.parquet` test files is:
 * `data/rle-dict-snappy-checksum.parquet`:
   ```
   [ Column "long_field" [ Dict Page [correct crc] | Compressed PLAIN Contents ][ Page 0 [correct crc] | Compressed RLE_DICTIONARY Contents ]]
   [ Column "binary_field" [ Dict Page [correct crc] | Compressed PLAIN Contents ][ Page 0 [correct crc] | Compressed RLE_DICTIONARY Contents ]]
   ```

 * `data/plain-dict-uncompressed-checksum.parquet`:
   ```
   [ Column "long_field" [ Dict Page [correct crc] | Uncompressed PLAIN_DICTIONARY(DICT) Contents ][ Page 0 [correct crc] | Uncompressed PLAIN_DICTIONARY Contents ]]
   [ Column "binary_field" [ Dict Page [correct crc] | Uncompressed PLAIN_DICTIONARY(DICT) Contents ][ Page 0 [correct crc] | Uncompressed PLAIN_DICTIONARY Contents ]]
   ```

 * `data/rle-dict-uncompressed-corrupt-checksum.parquet`:
   ```
   [ Column "long_field" [ Dict Page [bad crc] | Uncompressed PLAIN Contents ][ Page 0 [correct crc] | Uncompressed RLE_DICTIONARY Contents ]]
   [ Column "binary_field" [ Dict Page [bad crc] | Uncompressed PLAIN Contents ][ Page 0 [correct crc] | Uncompressed RLE_DICTIONARY Contents ]]
   ```
 ## Bloom Filter Files

 Bloom filter examples have been generated by parquet-mr.
 They are not Parquet files but only contain the bloom filter header and payload.

 For each of `bloom_filter.bin` and `bloom_filter.xxhash.bin`, the bloom filter
 was generated by inserting the strings "hello", "parquet", "bloom", "filter".

 `bloom_filter.bin` uses the original Murmur3-based bloom filter format as of
 https://github.com/apache/parquet-format/commit/54839ad5e04314c944fed8aa4bc6cf15e4a58698.

 `bloom_filter.xxhash.bin` uses the newer xxHash-based bloom filter format as of
 https://github.com/apache/parquet-format/commit/3fb10e00c2204bf1c6cc91e094c59e84cefcee33.

 ## NaN in stats

 Prior to version 1.4.0, the C++ Parquet writer would write NaN values in min and
 max statistics. (Correction in [this issue](https://issues.apache.org/jira/browse/PARQUET-1225)).
 It has been updated since to ignore NaN values when calculating
 statistics, but for backwards compatibility the following rules were established
 (in [PARQUET-1222](https://github.com/apache/parquet-format/pull/185)):

 > For backwards compatibility when reading files:
 > * If the min is a NaN, it should be ignored.
 > * If the max is a NaN, it should be ignored.
 > * If the min is +0, the row group may contain -0 values as well.
 > * If the max is -0, the row group may contain +0 values as well.
 > * When looking for NaN values, min and max should be ignored.

 The file `nan_in_stats.parquet` was generated with:

 ```python
 import pyarrow as pa # version 0.8.0
 import pyarrow.parquet as pq
 from numpy import NaN

 tab = pa.Table.from_arrays(
     [pa.array([1.0, NaN])],
     names="x"
 )

 pq.write_table(tab, "nan_in_stats.parquet")

 metadata = pq.read_metadata("nan_in_stats.parquet")
 metadata.row_group(0).column(0)
 # <pyarrow._parquet.ColumnChunkMetaData object at 0x7f28539e58f0>
 #   file_offset: 88
 #   file_path:
 #   type: DOUBLE
 #   num_values: 2
 #   path_in_schema: x
 #   is_stats_set: True
 #   statistics:
 #     <pyarrow._parquet.RowGroupStatistics object at 0x7f28539e5738>
 #       has_min_max: True
 #       min: 1
 #       max: nan
 #       null_count: 0
 #       distinct_count: 0
 #       num_values: 2
 #       physical_type: DOUBLE
 #   compression: 1
 #   encodings: <map object at 0x7f28539eb4e0>
 #   has_dictionary_page: True
 #   dictionary_page_offset: 4
 #   data_page_offset: 36
 #   index_page_offset: 0
 #   total_compressed_size: 84
 #   total_uncompressed_size: 80
 ```

 ## Large string map

 The file `large_string_map.brotli.parquet` was generated with:
 ```python
 import pyarrow as pa
 import pyarrow.parquet as pq

 arr = pa.array([[("a" * 2**30, 1)]], type = pa.map_(pa.string(), pa.int32()))
 arr = pa.chunked_array([arr, arr])
 tab = pa.table({ "arr": arr })

 pq.write_table(tab, "test.parquet", compression='BROTLI')
 ```

 It is meant to exercise reading of structured data where each value
 is smaller than 2GB but the combined uncompressed column chunk size
 is greater than 2GB.

 ## Float16 Files

 The files `float16_zeros_and_nans.parquet` and `float16_nonzeros_and_nans.parquet`
 are meant to exercise a variety of test cases regarding `Float16` columns (which
 are represented as 2-byte `FixedLenByteArray`s), including:
 * Basic binary representations of standard values, +/- zeros, and NaN
 * Comparisons between finite values
 * Exclusion of NaNs from statistics min/max
 * Normalizing min/max values when only zeros are present (i.e. `min` is always -0 and `max` is always +0)

 The aforementioned files were generated with:

 ```python
 import pyarrow as pa
 import pyarrow.parquet as pq
 import numpy as np

 t1 = pa.Table.from_arrays(
     [pa.array([None,
                np.float16(0.0),
                np.float16(np.NaN)], type=pa.float16())],
     names="x")
 t2 = pa.Table.from_arrays(
     [pa.array([None,
                np.float16(1.0),
                np.float16(-2.0),
                np.float16(np.NaN),
                np.float16(0.0),
                np.float16(-1.0),
                np.float16(-0.0),
                np.float16(2.0)],
               type=pa.float16())],
     names="x")

 pq.write_table(t1, "float16_zeros_and_nans.parquet", compression='none')
 pq.write_table(t2, "float16_nonzeros_and_nans.parquet", compression='none')

 m1 = pq.read_metadata("float16_zeros_and_nans.parquet")
 m2 = pq.read_metadata("float16_nonzeros_and_nans.parquet")

 print(m1.row_group(0).column(0))
 print(m2.row_group(0).column(0))
 # <pyarrow._parquet.ColumnChunkMetaData object at 0x7f79e9a3d850>
 #   file_offset: 68
 #   file_path:
 #   physical_type: FIXED_LEN_BYTE_ARRAY
 #   num_values: 3
 #   path_in_schema: x
 #   is_stats_set: True
 #   statistics:
 #     <pyarrow._parquet.Statistics object at 0x7f79e9a3d940>
 #       has_min_max: True
 #       min: b'\x00\x80'
 #       max: b'\x00\x00'
 #       null_count: 1
 #       distinct_count: None
 #       num_values: 2
 #       physical_type: FIXED_LEN_BYTE_ARRAY
 #       logical_type: Float16
 #       converted_type (legacy): NONE
 #   compression: UNCOMPRESSED
 #   encodings: ('PLAIN', 'RLE', 'RLE_DICTIONARY')
 #   has_dictionary_page: True
 #   dictionary_page_offset: 4
 #   data_page_offset: 22
 #   total_compressed_size: 64
 #   total_uncompressed_size: 64
 # <pyarrow._parquet.ColumnChunkMetaData object at 0x7f79ea003c40>
 #   file_offset: 80
 #   file_path:
 #   physical_type: FIXED_LEN_BYTE_ARRAY
 #   num_values: 8
 #   path_in_schema: x
 #   is_stats_set: True
 #   statistics:
 #     <pyarrow._parquet.Statistics object at 0x7f79e9a3d8a0>
 #       has_min_max: True
 #       min: b'\x00\xc0'
 #       max: b'\x00@'
 #       null_count: 1
 #       distinct_count: None
 #       num_values: 7
 #       physical_type: FIXED_LEN_BYTE_ARRAY
 #       logical_type: Float16
 #       converted_type (legacy): NONE
 #   compression: UNCOMPRESSED
 #   encodings: ('PLAIN', 'RLE', 'RLE_DICTIONARY')
 #   has_dictionary_page: True
 #   dictionary_page_offset: 4
 #   data_page_offset: 32
 #   total_compressed_size: 76
 #   total_uncompressed_size: 76
 ```

 ## Byte Stream Split

 # FLOAT and DOUBLE data

 `byte_stream_split.zstd.parquet` is generated by pyarrow 14.0.2 using the following code:

 ```python
 import pyarrow as pa
 from pyarrow import parquet as pq
 import numpy as np

 np.random.seed(0)
 table = pa.Table.from_pydict({
   'f32': np.random.normal(size=300).astype(np.float32),
   'f64': np.random.normal(size=300).astype(np.float64),
 })

 pq.write_table(
   table,
   'byte_stream_split.parquet',
   version='2.6',
   compression='zstd',
   compression_level=22,
   column_encoding='BYTE_STREAM_SPLIT',
   use_dictionary=False,
 )
 ```

 This is a practical case where `BYTE_STREAM_SPLIT` encoding obtains a smaller file size than `PLAIN` or dictionary.
 Since the distributions are random normals centered at 0, each byte has nontrivial behavior.

 # Additional types

 `byte_stream_split_extended.gzip.parquet` is generated by pyarrow 16.0.0.
 It contains 7 pairs of columns, each in two variants containing the same
 values: one `PLAIN`-encoded and one `BYTE_STREAM_SPLIT`-encoded:
 ```
 Version: 2.6
 Created By: parquet-cpp-arrow version 16.0.0-SNAPSHOT
 Total rows: 200
 Number of RowGroups: 1
 Number of Real Columns: 14
 Number of Columns: 14
 Number of Selected Columns: 14
 Column 0: float16_plain (FIXED_LEN_BYTE_ARRAY(2) / Float16)
 Column 1: float16_byte_stream_split (FIXED_LEN_BYTE_ARRAY(2) / Float16)
 Column 2: float_plain (FLOAT)
 Column 3: float_byte_stream_split (FLOAT)
 Column 4: double_plain (DOUBLE)
 Column 5: double_byte_stream_split (DOUBLE)
 Column 6: int32_plain (INT32)
 Column 7: int32_byte_stream_split (INT32)
 Column 8: int64_plain (INT64)
 Column 9: int64_byte_stream_split (INT64)
 Column 10: flba5_plain (FIXED_LEN_BYTE_ARRAY(5))
 Column 11: flba5_byte_stream_split (FIXED_LEN_BYTE_ARRAY(5))
 Column 12: decimal_plain (FIXED_LEN_BYTE_ARRAY(4) / Decimal(precision=7, scale=3) / DECIMAL(7,3))
 Column 13: decimal_byte_stream_split (FIXED_LEN_BYTE_ARRAY(4) / Decimal(precision=7, scale=3) / DECIMAL(7,3))
 ```

 To check conformance of a `BYTE_STREAM_SPLIT` decoder, read each
 `BYTE_STREAM_SPLIT`-encoded column and compare the decoded values against
 the values from the corresponding `PLAIN`-encoded column. The values should
 be equal.

 ## Incorrect Map Schema

 A number of producers, such as Presto/Trino/Athena, have been creating files with schemas
 where the Map key fields are marked as optional rather than required.
 This is not spec-compliant, yet appears in a number of existing data files in the wild.

 This issue has been fixed in:
 - [Trino v386+](https://github.com/trinodb/trino/commit/3247bd2e64d7422bd13e805cd67cfca3fa8ba520)
 - [Presto v0.274+](https://github.com/prestodb/presto/commit/842b46972c11534a7729d0a18e3abc5347922d1a)

 We can recreate these problematic files for testing [arrow-rs #5630](https://github.com/apache/arrow-rs/pull/5630)
 with relevant Presto/Trino CLI, or with AWS Athena Console:

 ```sql
 CREATE TABLE my_catalog.my_table_name WITH (format = 'Parquet') AS (
     SELECT MAP (
         ARRAY['name', 'parent'],
         ARRAY[
             'report',
             'another'
         ]
     ) my_map
 )
 ```

 The schema in the created file is:

 ```
 message hive_schema {
   OPTIONAL group my_map (MAP) {
     REPEATED group key_value (MAP_KEY_VALUE) {
       OPTIONAL BYTE_ARRAY key (STRING);
       OPTIONAL BYTE_ARRAY value (STRING);
     }
   }
 }
 ```
	<!--
	~ Licensed to the Apache Software Foundation (ASF) under one
	~ or more contributor license agreements. See the NOTICE file
	~ distributed with this work for additional information
	~ regarding copyright ownership. The ASF licenses this file
	~ to you under the Apache License, Version 2.0 (the
	~ "License"); you may not use this file except in compliance
	~ with the License. You may obtain a copy of the License at
	~
	~ http://www.apache.org/licenses/LICENSE-2.0
	~
	~ Unless required by applicable law or agreed to in writing,
	~ software distributed under the License is distributed on an
	~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	~ KIND, either express or implied. See the License for the
	~ specific language governing permissions and limitations
	~ under the License.
	-->

	# Test data files for Parquet compatibility and regression testing

	\| File \| Description \|
	\|----------------------------------------------\|------------------------------------------------------------------------------------------------------------------------------------------------------------------\|
	\| delta_byte_array.parquet \| string columns with DELTA_BYTE_ARRAY encoding. See [delta_byte_array.md](delta_byte_array.md) for details. \|
	\| delta_length_byte_array.parquet \| string columns with DELTA_LENGTH_BYTE_ARRAY encoding. \|
	\| delta_binary_packed.parquet \| INT32 and INT64 columns with DELTA_BINARY_PACKED encoding. See [delta_binary_packed.md](delta_binary_packed.md) for details. \|
	\| delta_encoding_required_column.parquet \| required INT32 and STRING columns with delta encoding. See [delta_encoding_required_column.md](delta_encoding_required_column.md) for details. \|
	\| delta_encoding_optional_column.parquet \| optional INT64 and STRING columns with delta encoding. See [delta_encoding_optional_column.md](delta_encoding_optional_column.md) for details. \|
	\| nested_structs.rust.parquet \| Used to test that the Rust Arrow reader can lookup the correct field from a nested struct. See [ARROW-11452](https://issues.apache.org/jira/browse/ARROW-11452) \|
	\| data_index_bloom_encoding_stats.parquet \| optional STRING column. Contains optional metadata: bloom filters, column index, offset index and encoding stats. \|
	\| data_index_bloom_encoding_with_length.parquet \| Same as `data_index_bloom_encoding_stats.parquet` but has `bloom_filter_length` populated in the ColumnMetaData \|
	\| null_list.parquet \| an empty list. Generated from this json `{"emptylist":[]}` and for the purposes of testing correct read/write behaviour of this base case. \|
	\| alltypes_tiny_pages.parquet \| small page sizes with dictionary encoding with page index from [impala](https://github.com/apache/impala/tree/master/testdata/data/alltypes_tiny_pages.parquet). \|
	\| alltypes_tiny_pages_plain.parquet \| small page sizes with plain encoding with page index [impala](https://github.com/apache/impala/tree/master/testdata/data/alltypes_tiny_pages.parquet). \|
	\| rle_boolean_encoding.parquet \| option boolean columns with RLE encoding \|
	\| fixed_length_byte_array.parquet \| optional FIXED_LENGTH_BYTE_ARRAY column with page index. See [fixed_length_byte_array.md](fixed_length_byte_array.md) for details. \|
	\| int32_with_null_pages.parquet \| optional INT32 column with random null pages. See [int32_with_null_pages.md](int32_with_null_pages.md) for details. \|
	\| datapage_v1-uncompressed-checksum.parquet \| uncompressed INT32 columns in v1 data pages with a matching CRC \|
	\| datapage_v1-snappy-compressed-checksum.parquet \| compressed INT32 columns in v1 data pages with a matching CRC \|
	\| datapage_v1-corrupt-checksum.parquet \| uncompressed INT32 columns in v1 data pages with a mismatching CRC \|
	\| overflow_i16_page_cnt.parquet \| row group with more than INT16_MAX pages \|
	\| bloom_filter.bin \| deprecated bloom filter binary with binary header and murmur3 hashing \|
	\| bloom_filter.xxhash.bin \| bloom filter binary with thrift header and xxhash hashing \|
	\| nan_in_stats.parquet \| statistics contains NaN in max, from PyArrow 0.8.0. See note below on "NaN in stats". \|
	\| rle-dict-snappy-checksum.parquet \| compressed and dictionary-encoded INT32 and STRING columns in format v2 with a matching CRC \|
	\| plain-dict-uncompressed-checksum.parquet \| uncompressed and dictionary-encoded INT32 and STRING columns in format v1 with a matching CRC \|
	\| rle-dict-uncompressed-corrupt-checksum.parquet \| uncompressed and dictionary-encoded INT32 and STRING columns in format v2 with a mismatching CRC \|
	\| large_string_map.brotli.parquet \| MAP(STRING, INT32) with a string column chunk of more than 2GB. See [note](#large-string-map) below \|
	\| float16_nonzeros_and_nans.parquet \| Float16 (logical type) column with NaNs and nonzero finite min/max values \|
	\| float16_zeros_and_nans.parquet \| Float16 (logical type) column with NaNs and zeros as min/max values. . See [note](#float16-files) below \|
	\| concatenated_gzip_members.parquet \| 513 UINT64 numbers compressed using 2 concatenated gzip members in a single data page \|
	\| byte_stream_split.zstd.parquet \| Standard normals with `BYTE_STREAM_SPLIT` encoding. See [note](#byte-stream-split) below \|
	\| incorrect_map_schema.parquet \| Contains a Map schema without explicitly required keys, produced by Presto. See [note](#incorrect-map-schema) \|

	TODO: Document what each file is in the table above.

	## Encrypted Files

	Tests files with .parquet.encrypted suffix are encrypted using Parquet Modular Encryption.

	A detailed description of the Parquet Modular Encryption specification can be found here:
	```
	https://github.com/apache/parquet-format/blob/encryption/Encryption.md
	```

	Following are the keys and key ids (when using key\_retriever) used to encrypt
	the encrypted columns and footer in all the encrypted files:
	* Encrypted/Signed Footer:
	* key: {0,1,2,3,4,5,6,7,8,9,0,1,2,3,4,5}
	* key_id: "kf"
	* Encrypted column named double_field (including column and offset index):
	* key: {1,2,3,4,5,6,7,8,9,0,1,2,3,4,5,0}
	* key_id: "kc1"
	* Encrypted column named float_field (including column and offset index):
	* key: {1,2,3,4,5,6,7,8,9,0,1,2,3,4,5,1}
	* key_id: "kc2"

	The following files are encrypted with AAD prefix "tester":
	1. encrypt\_columns\_and\_footer\_disable\_aad\_storage.parquet.encrypted
	2. encrypt\_columns\_and\_footer\_aad.parquet.encrypted


	A sample that reads and checks these files can be found at the following tests
	in Parquet C++:
	```
	cpp/src/parquet/encryption/read-configurations-test.cc
	cpp/src/parquet/encryption/test-encryption-util.h
	```

	The `external_key_material_java.parquet.encrypted` file was encrypted using parquet-mr with
	external key material enabled, so the key material is found in the
	`_KEY_MATERIAL_FOR_external_key_material_java.parquet.encrypted.json` file.
	This data was written using the `org.apache.parquet.crypto.keytools.mocks.InMemoryKMS` KMS client,
	which is compatible with the `TestOnlyInServerWrapKms` KMS client used in C++ tests.

	## Checksum Files

	The schema for the `datapage_v1-*-checksum.parquet` test files is:
	```
	message m {
	required int32 a;
	required int32 b;
	}
	```

	The detailed structure for these files is as follows:

	* `data/datapage_v1-uncompressed-checksum.parquet`:
	```
	[ Column "a" [ Page 0 [correct crc] \| Uncompressed Contents ][ Page 1 [correct crc] \| Uncompressed Contents ]]
	[ Column "b" [ Page 0 [correct crc] \| Uncompressed Contents ][ Page 1 [correct crc] \| Uncompressed Contents ]]
	```

	* `data/datapage_v1-snappy-compressed-checksum.parquet`:
	```
	[ Column "a" [ Page 0 [correct crc] \| Snappy Contents ][ Page 1 [correct crc] \| Snappy Contents ]]
	[ Column "b" [ Page 0 [correct crc] \| Snappy Contents ][ Page 1 [correct crc] \| Snappy Contents ]]
	```

	* `data/datapage_v1-corrupt-checksum.parquet`:
	```
	[ Column "a" [ Page 0 [bad crc] \| Uncompressed Contents ][ Page 1 [correct crc] \| Uncompressed Contents ]]
	[ Column "b" [ Page 0 [correct crc] \| Uncompressed Contents ][ Page 1 [bad crc] \| Uncompressed Contents ]]
	```

	The schema for the `-dict--checksum.parquet` test files is:
	* `data/rle-dict-snappy-checksum.parquet`:
	```
	[ Column "long_field" [ Dict Page [correct crc] \| Compressed PLAIN Contents ][ Page 0 [correct crc] \| Compressed RLE_DICTIONARY Contents ]]
	[ Column "binary_field" [ Dict Page [correct crc] \| Compressed PLAIN Contents ][ Page 0 [correct crc] \| Compressed RLE_DICTIONARY Contents ]]
	```

	* `data/plain-dict-uncompressed-checksum.parquet`:
	```
	[ Column "long_field" [ Dict Page [correct crc] \| Uncompressed PLAIN_DICTIONARY(DICT) Contents ][ Page 0 [correct crc] \| Uncompressed PLAIN_DICTIONARY Contents ]]
	[ Column "binary_field" [ Dict Page [correct crc] \| Uncompressed PLAIN_DICTIONARY(DICT) Contents ][ Page 0 [correct crc] \| Uncompressed PLAIN_DICTIONARY Contents ]]
	```

	* `data/rle-dict-uncompressed-corrupt-checksum.parquet`:
	```
	[ Column "long_field" [ Dict Page [bad crc] \| Uncompressed PLAIN Contents ][ Page 0 [correct crc] \| Uncompressed RLE_DICTIONARY Contents ]]
	[ Column "binary_field" [ Dict Page [bad crc] \| Uncompressed PLAIN Contents ][ Page 0 [correct crc] \| Uncompressed RLE_DICTIONARY Contents ]]
	```
	## Bloom Filter Files

	Bloom filter examples have been generated by parquet-mr.
	They are not Parquet files but only contain the bloom filter header and payload.

	For each of `bloom_filter.bin` and `bloom_filter.xxhash.bin`, the bloom filter
	was generated by inserting the strings "hello", "parquet", "bloom", "filter".

	`bloom_filter.bin` uses the original Murmur3-based bloom filter format as of
	https://github.com/apache/parquet-format/commit/54839ad5e04314c944fed8aa4bc6cf15e4a58698.

	`bloom_filter.xxhash.bin` uses the newer xxHash-based bloom filter format as of
	https://github.com/apache/parquet-format/commit/3fb10e00c2204bf1c6cc91e094c59e84cefcee33.

	## NaN in stats

	Prior to version 1.4.0, the C++ Parquet writer would write NaN values in min and
	max statistics. (Correction in [this issue](https://issues.apache.org/jira/browse/PARQUET-1225)).
	It has been updated since to ignore NaN values when calculating
	statistics, but for backwards compatibility the following rules were established
	(in [PARQUET-1222](https://github.com/apache/parquet-format/pull/185)):

	> For backwards compatibility when reading files:
	> * If the min is a NaN, it should be ignored.
	> * If the max is a NaN, it should be ignored.
	> * If the min is +0, the row group may contain -0 values as well.
	> * If the max is -0, the row group may contain +0 values as well.
	> * When looking for NaN values, min and max should be ignored.

	The file `nan_in_stats.parquet` was generated with:

	```python
	import pyarrow as pa # version 0.8.0
	import pyarrow.parquet as pq
	from numpy import NaN

	tab = pa.Table.from_arrays(
	[pa.array([1.0, NaN])],
	names="x"
	)

	pq.write_table(tab, "nan_in_stats.parquet")

	metadata = pq.read_metadata("nan_in_stats.parquet")
	metadata.row_group(0).column(0)
	# <pyarrow._parquet.ColumnChunkMetaData object at 0x7f28539e58f0>
	# file_offset: 88
	# file_path:
	# type: DOUBLE
	# num_values: 2
	# path_in_schema: x
	# is_stats_set: True
	# statistics:
	# <pyarrow._parquet.RowGroupStatistics object at 0x7f28539e5738>
	# has_min_max: True
	# min: 1
	# max: nan
	# null_count: 0
	# distinct_count: 0
	# num_values: 2
	# physical_type: DOUBLE
	# compression: 1
	# encodings: <map object at 0x7f28539eb4e0>
	# has_dictionary_page: True
	# dictionary_page_offset: 4
	# data_page_offset: 36
	# index_page_offset: 0
	# total_compressed_size: 84
	# total_uncompressed_size: 80
	```

	## Large string map

	The file `large_string_map.brotli.parquet` was generated with:
	```python
	import pyarrow as pa
	import pyarrow.parquet as pq

	arr = pa.array([[("a" * 2**30, 1)]], type = pa.map_(pa.string(), pa.int32()))
	arr = pa.chunked_array([arr, arr])
	tab = pa.table({ "arr": arr })

	pq.write_table(tab, "test.parquet", compression='BROTLI')
	```

	It is meant to exercise reading of structured data where each value
	is smaller than 2GB but the combined uncompressed column chunk size
	is greater than 2GB.

	## Float16 Files

	The files `float16_zeros_and_nans.parquet` and `float16_nonzeros_and_nans.parquet`
	are meant to exercise a variety of test cases regarding `Float16` columns (which
	are represented as 2-byte `FixedLenByteArray`s), including:
	* Basic binary representations of standard values, +/- zeros, and NaN
	* Comparisons between finite values
	* Exclusion of NaNs from statistics min/max
	* Normalizing min/max values when only zeros are present (i.e. `min` is always -0 and `max` is always +0)

	The aforementioned files were generated with:

	```python
	import pyarrow as pa
	import pyarrow.parquet as pq
	import numpy as np

	t1 = pa.Table.from_arrays(
	[pa.array([None,
	np.float16(0.0),
	np.float16(np.NaN)], type=pa.float16())],
	names="x")
	t2 = pa.Table.from_arrays(
	[pa.array([None,
	np.float16(1.0),
	np.float16(-2.0),
	np.float16(np.NaN),
	np.float16(0.0),
	np.float16(-1.0),
	np.float16(-0.0),
	np.float16(2.0)],
	type=pa.float16())],
	names="x")

	pq.write_table(t1, "float16_zeros_and_nans.parquet", compression='none')
	pq.write_table(t2, "float16_nonzeros_and_nans.parquet", compression='none')

	m1 = pq.read_metadata("float16_zeros_and_nans.parquet")
	m2 = pq.read_metadata("float16_nonzeros_and_nans.parquet")

	print(m1.row_group(0).column(0))
	print(m2.row_group(0).column(0))
	# <pyarrow._parquet.ColumnChunkMetaData object at 0x7f79e9a3d850>
	# file_offset: 68
	# file_path:
	# physical_type: FIXED_LEN_BYTE_ARRAY
	# num_values: 3
	# path_in_schema: x
	# is_stats_set: True
	# statistics:
	# <pyarrow._parquet.Statistics object at 0x7f79e9a3d940>
	# has_min_max: True
	# min: b'\x00\x80'
	# max: b'\x00\x00'
	# null_count: 1
	# distinct_count: None
	# num_values: 2
	# physical_type: FIXED_LEN_BYTE_ARRAY
	# logical_type: Float16
	# converted_type (legacy): NONE
	# compression: UNCOMPRESSED
	# encodings: ('PLAIN', 'RLE', 'RLE_DICTIONARY')
	# has_dictionary_page: True
	# dictionary_page_offset: 4
	# data_page_offset: 22
	# total_compressed_size: 64
	# total_uncompressed_size: 64
	# <pyarrow._parquet.ColumnChunkMetaData object at 0x7f79ea003c40>
	# file_offset: 80
	# file_path:
	# physical_type: FIXED_LEN_BYTE_ARRAY
	# num_values: 8
	# path_in_schema: x
	# is_stats_set: True
	# statistics:
	# <pyarrow._parquet.Statistics object at 0x7f79e9a3d8a0>
	# has_min_max: True
	# min: b'\x00\xc0'
	# max: b'\x00@'
	# null_count: 1
	# distinct_count: None
	# num_values: 7
	# physical_type: FIXED_LEN_BYTE_ARRAY
	# logical_type: Float16
	# converted_type (legacy): NONE
	# compression: UNCOMPRESSED
	# encodings: ('PLAIN', 'RLE', 'RLE_DICTIONARY')
	# has_dictionary_page: True
	# dictionary_page_offset: 4
	# data_page_offset: 32
	# total_compressed_size: 76
	# total_uncompressed_size: 76
	```

	## Byte Stream Split

	# FLOAT and DOUBLE data

	`byte_stream_split.zstd.parquet` is generated by pyarrow 14.0.2 using the following code:

	```python
	import pyarrow as pa
	from pyarrow import parquet as pq
	import numpy as np

	np.random.seed(0)
	table = pa.Table.from_pydict({
	'f32': np.random.normal(size=300).astype(np.float32),
	'f64': np.random.normal(size=300).astype(np.float64),
	})

	pq.write_table(
	table,
	'byte_stream_split.parquet',
	version='2.6',
	compression='zstd',
	compression_level=22,
	column_encoding='BYTE_STREAM_SPLIT',
	use_dictionary=False,
	)
	```

	This is a practical case where `BYTE_STREAM_SPLIT` encoding obtains a smaller file size than `PLAIN` or dictionary.
	Since the distributions are random normals centered at 0, each byte has nontrivial behavior.

	# Additional types

	`byte_stream_split_extended.gzip.parquet` is generated by pyarrow 16.0.0.
	It contains 7 pairs of columns, each in two variants containing the same
	values: one `PLAIN`-encoded and one `BYTE_STREAM_SPLIT`-encoded:
	```
	Version: 2.6
	Created By: parquet-cpp-arrow version 16.0.0-SNAPSHOT
	Total rows: 200
	Number of RowGroups: 1
	Number of Real Columns: 14
	Number of Columns: 14
	Number of Selected Columns: 14
	Column 0: float16_plain (FIXED_LEN_BYTE_ARRAY(2) / Float16)
	Column 1: float16_byte_stream_split (FIXED_LEN_BYTE_ARRAY(2) / Float16)
	Column 2: float_plain (FLOAT)
	Column 3: float_byte_stream_split (FLOAT)
	Column 4: double_plain (DOUBLE)
	Column 5: double_byte_stream_split (DOUBLE)
	Column 6: int32_plain (INT32)
	Column 7: int32_byte_stream_split (INT32)
	Column 8: int64_plain (INT64)
	Column 9: int64_byte_stream_split (INT64)
	Column 10: flba5_plain (FIXED_LEN_BYTE_ARRAY(5))
	Column 11: flba5_byte_stream_split (FIXED_LEN_BYTE_ARRAY(5))
	Column 12: decimal_plain (FIXED_LEN_BYTE_ARRAY(4) / Decimal(precision=7, scale=3) / DECIMAL(7,3))
	Column 13: decimal_byte_stream_split (FIXED_LEN_BYTE_ARRAY(4) / Decimal(precision=7, scale=3) / DECIMAL(7,3))
	```

	To check conformance of a `BYTE_STREAM_SPLIT` decoder, read each
	`BYTE_STREAM_SPLIT`-encoded column and compare the decoded values against
	the values from the corresponding `PLAIN`-encoded column. The values should
	be equal.

	## Incorrect Map Schema

	A number of producers, such as Presto/Trino/Athena, have been creating files with schemas
	where the Map key fields are marked as optional rather than required.
	This is not spec-compliant, yet appears in a number of existing data files in the wild.

	This issue has been fixed in:
	- [Trino v386+](https://github.com/trinodb/trino/commit/3247bd2e64d7422bd13e805cd67cfca3fa8ba520)
	- [Presto v0.274+](https://github.com/prestodb/presto/commit/842b46972c11534a7729d0a18e3abc5347922d1a)

	We can recreate these problematic files for testing [arrow-rs #5630](https://github.com/apache/arrow-rs/pull/5630)
	with relevant Presto/Trino CLI, or with AWS Athena Console:

	```sql
	CREATE TABLE my_catalog.my_table_name WITH (format = 'Parquet') AS (
	SELECT MAP (
	ARRAY['name', 'parent'],
	ARRAY[
	'report',
	'another'
	]
	) my_map
	)
	```

	The schema in the created file is:

	```
	message hive_schema {
	OPTIONAL group my_map (MAP) {
	REPEATED group key_value (MAP_KEY_VALUE) {
	OPTIONAL BYTE_ARRAY key (STRING);
	OPTIONAL BYTE_ARRAY value (STRING);
	}
	}
	}
	```