Add large_string_map data file (#38)

* add chunked_string_map data file

* use BROTLI compression for greater space saving

* add description

* correct arrow type name

* rename file as suggested by reviewers

* update readme as suggested

* rename in docs as well

* Make wording more precise, remove Arrow vocabulary

* Add description of how the file was generated

* Add link to paragraph

---------

Co-authored-by: Antoine Pitrou <pitrou@free.fr>
diff --git a/data/README.md b/data/README.md
index 638f0d1..27c381a 100644
--- a/data/README.md
+++ b/data/README.md
@@ -44,6 +44,7 @@
 | rle-dict-snappy-checksum.parquet                 | compressed and dictionary-encoded INT32 and STRING columns in format v2 with a matching CRC |
 | plain-dict-uncompressed-checksum.parquet         | uncompressed and dictionary-encoded INT32 and STRING columns in format v1 with a matching CRC |
 | rle-dict-uncompressed-corrupt-checksum.parquet   | uncompressed and dictionary-encoded INT32 and STRING columns in format v2 with a mismatching CRC |
+| large_string_map.brotli.parquet       | MAP(STRING, INT32) with a string column chunk of more than 2GB. See [note](#large-string-map) below |
 
 TODO: Document what each file is in the table above.
 
@@ -202,3 +203,21 @@
 #   total_compressed_size: 84
 #   total_uncompressed_size: 80
 ```
+
+## Large string map
+
+The file `large_string_map.brotli.parquet` was generated with:
+```python
+import pyarrow as pa
+import pyarrow.parquet as pq
+
+arr = pa.array([[("a" * 2**30, 1)]], type = pa.map_(pa.string(), pa.int32()))
+arr = pa.chunked_array([arr, arr])
+tab = pa.table({ "arr": arr })
+
+pq.write_table(tab, "test.parquet", compression='BROTLI')
+```
+
+It is meant to exercise reading of structured data where each value
+is smaller than 2GB but the combined uncompressed column chunk size
+is greater than 2GB.
diff --git a/data/large_string_map.brotli.parquet b/data/large_string_map.brotli.parquet
new file mode 100644
index 0000000..fc5c8b2
--- /dev/null
+++ b/data/large_string_map.brotli.parquet
Binary files differ