ARROW-17904: [Parquet] Add data files with checksums on datapage v1 (#29)
diff --git a/data/README.md b/data/README.md
index 34d60ec..398a88c 100644
--- a/data/README.md
+++ b/data/README.md
@@ -32,6 +32,9 @@
| alltypes_tiny_pages.parquet | small page sizes with dictionary encoding with page index from [impala](https://github.com/apache/impala/tree/master/testdata/data/alltypes_tiny_pages.parquet). |
| alltypes_tiny_pages_plain.parquet | small page sizes with plain encoding with page index [impala](https://github.com/apache/impala/tree/master/testdata/data/alltypes_tiny_pages.parquet). |
| rle_boolean_encoding.parquet | option boolean columns with RLE encoding |
+| datapage_v1-uncompressed-checksum.parquet | uncompressed INT32 columns in v1 data pages with a matching CRC |
+| datapage_v1-snappy-compressed-checksum.parquet | compressed INT32 columns in v1 data pages with a matching CRC |
+| datapage_v1-corrupt-checksum.parquet | uncompressed INT32 columns in v1 data pages with a mismatching CRC |
TODO: Document what each file is in the table above.
@@ -65,3 +68,33 @@
cpp/src/parquet/encryption-read-configurations-test.cc
cpp/src/parquet/test-encryption-util.h
```
+
+## Checksum Files
+
+The schema for the `datapage_v1-*-checksum.parquet` test files is:
+```
+message m {
+ required int32 a;
+ required int32 b;
+}
+```
+
+The detailed structure for these files is as follows:
+
+* `data/datapage_v1-uncompressed-checksum.parquet`:
+ ```
+ [ Column "a" [ Page 0 [correct crc] | Uncompressed Contents ][ Page 1 [correct crc] | Uncompressed Contents ]]
+ [ Column "b" [ Page 0 [correct crc] | Uncompressed Contents ][ Page 1 [correct crc] | Uncompressed Contents ]]
+ ```
+
+* `data/datapage_v1-snappy-compressed-checksum.parquet`:
+ ```
+ [ Column "a" [ Page 0 [correct crc] | Snappy Contents ][ Page 1 [correct crc] | Snappy Contents ]]
+ [ Column "b" [ Page 0 [correct crc] | Snappy Contents ][ Page 1 [correct crc] | Snappy Contents ]]
+ ```
+
+* `data/datapage_v1-corrupt-checksum.parquet`:
+ ```
+ [ Column "a" [ Page 0 [bad crc] | Uncompressed Contents ][ Page 1 [correct crc] | Uncompressed Contents ]]
+ [ Column "b" [ Page 0 [correct crc] | Uncompressed Contents ][ Page 1 [bad crc] | Uncompressed Contents ]]
+ ```
diff --git a/data/datapage_v1-corrupt-checksum.parquet b/data/datapage_v1-corrupt-checksum.parquet
new file mode 100644
index 0000000..d832edc
--- /dev/null
+++ b/data/datapage_v1-corrupt-checksum.parquet
Binary files differ
diff --git a/data/datapage_v1-snappy-compressed-checksum.parquet b/data/datapage_v1-snappy-compressed-checksum.parquet
new file mode 100644
index 0000000..8fe2c86
--- /dev/null
+++ b/data/datapage_v1-snappy-compressed-checksum.parquet
Binary files differ
diff --git a/data/datapage_v1-uncompressed-checksum.parquet b/data/datapage_v1-uncompressed-checksum.parquet
new file mode 100644
index 0000000..78044f0
--- /dev/null
+++ b/data/datapage_v1-uncompressed-checksum.parquet
Binary files differ