Add test files for dictionary page crc
diff --git a/data/README.md b/data/README.md
index dd25ade..638f0d1 100644
--- a/data/README.md
+++ b/data/README.md
@@ -41,6 +41,9 @@
| bloom_filter.bin | deprecated bloom filter binary with binary header and murmur3 hashing |
| bloom_filter.xxhash.bin | bloom filter binary with thrift header and xxhash hashing |
| nan_in_stats.parquet | statistics contains NaN in max, from PyArrow 0.8.0. See note below on "NaN in stats". |
+| rle-dict-snappy-checksum.parquet | compressed and dictionary-encoded INT32 and STRING columns in format v2 with a matching CRC |
+| plain-dict-uncompressed-checksum.parquet | uncompressed and dictionary-encoded INT32 and STRING columns in format v1 with a matching CRC |
+| rle-dict-uncompressed-corrupt-checksum.parquet | uncompressed and dictionary-encoded INT32 and STRING columns in format v2 with a mismatching CRC |
TODO: Document what each file is in the table above.
@@ -111,6 +114,24 @@
[ Column "b" [ Page 0 [correct crc] | Uncompressed Contents ][ Page 1 [bad crc] | Uncompressed Contents ]]
```
+The schema for the `*-dict-*-checksum.parquet` test files is:
+* `data/rle-dict-snappy-checksum.parquet`:
+ ```
+ [ Column "long_field" [ Dict Page [correct crc] | Compressed PLAIN Contents ][ Page 0 [correct crc] | Compressed RLE_DICTIONARY Contents ]]
+ [ Column "binary_field" [ Dict Page [correct crc] | Compressed PLAIN Contents ][ Page 0 [correct crc] | Compressed RLE_DICTIONARY Contents ]]
+ ```
+
+* `data/plain-dict-uncompressed-checksum.parquet`:
+ ```
+ [ Column "long_field" [ Dict Page [correct crc] | Uncompressed PLAIN_DICTIONARY(DICT) Contents ][ Page 0 [correct crc] | Uncompressed PLAIN_DICTIONARY Contents ]]
+ [ Column "binary_field" [ Dict Page [correct crc] | Uncompressed PLAIN_DICTIONARY(DICT) Contents ][ Page 0 [correct crc] | Uncompressed PLAIN_DICTIONARY Contents ]]
+ ```
+
+* `data/rle-dict-uncompressed-corrupt-checksum.parquet`:
+ ```
+ [ Column "long_field" [ Dict Page [bad crc] | Uncompressed PLAIN Contents ][ Page 0 [correct crc] | Uncompressed RLE_DICTIONARY Contents ]]
+ [ Column "binary_field" [ Dict Page [bad crc] | Uncompressed PLAIN Contents ][ Page 0 [correct crc] | Uncompressed RLE_DICTIONARY Contents ]]
+ ```
## Bloom Filter Files
Bloom filter examples have been generated by parquet-mr.
diff --git a/data/plain-dict-uncompressed-checksum.parquet b/data/plain-dict-uncompressed-checksum.parquet
new file mode 100644
index 0000000..f49f1c4
--- /dev/null
+++ b/data/plain-dict-uncompressed-checksum.parquet
Binary files differ
diff --git a/data/rle-dict-snappy-checksum.parquet b/data/rle-dict-snappy-checksum.parquet
new file mode 100644
index 0000000..4c183d8
--- /dev/null
+++ b/data/rle-dict-snappy-checksum.parquet
Binary files differ
diff --git a/data/rle-dict-uncompressed-corrupt-checksum.parquet b/data/rle-dict-uncompressed-corrupt-checksum.parquet
new file mode 100644
index 0000000..20e23aa
--- /dev/null
+++ b/data/rle-dict-uncompressed-corrupt-checksum.parquet
Binary files differ