Add large_string_map data file (#38) * add chunked_string_map data file * use BROTLI compression for greater space saving * add description * correct arrow type name * rename file as suggested by reviewers * update readme as suggested * rename in docs as well * Make wording more precise, remove Arrow vocabulary * Add description of how the file was generated * Add link to paragraph --------- Co-authored-by: Antoine Pitrou <pitrou@free.fr>

commit: d79a0101d90dfa3bbb10337626f57a3e8c4b5363 [log] [tgz]
author: Arthur Passos <arthur.ti@outlook.com> Wed Jun 21 14:01:14 2023 -0300
committer: GitHub <noreply@github.com> Wed Jun 21 19:01:14 2023 +0200
tree: c8369c0ed6bc7c1b655e2e5b5b8963c0dc22b7c7
parent: b2e7cc755159196e3a068c8594f7acbaecfdaaac [diff]
diff --git a/data/README.md b/data/README.md
index 638f0d1..27c381a 100644
--- a/data/README.md
+++ b/data/README.md

@@ -44,6 +44,7 @@
 | rle-dict-snappy-checksum.parquet                 | compressed and dictionary-encoded INT32 and STRING columns in format v2 with a matching CRC |
 | plain-dict-uncompressed-checksum.parquet         | uncompressed and dictionary-encoded INT32 and STRING columns in format v1 with a matching CRC |
 | rle-dict-uncompressed-corrupt-checksum.parquet   | uncompressed and dictionary-encoded INT32 and STRING columns in format v2 with a mismatching CRC |
+| large_string_map.brotli.parquet       | MAP(STRING, INT32) with a string column chunk of more than 2GB. See [note](#large-string-map) below |
 
 TODO: Document what each file is in the table above.
 
@@ -202,3 +203,21 @@
 #   total_compressed_size: 84
 #   total_uncompressed_size: 80
 ```
+
+## Large string map
+
+The file `large_string_map.brotli.parquet` was generated with:
+```python
+import pyarrow as pa
+import pyarrow.parquet as pq
+
+arr = pa.array([[("a" * 2**30, 1)]], type = pa.map_(pa.string(), pa.int32()))
+arr = pa.chunked_array([arr, arr])
+tab = pa.table({ "arr": arr })
+
+pq.write_table(tab, "test.parquet", compression='BROTLI')
+```
+
+It is meant to exercise reading of structured data where each value
+is smaller than 2GB but the combined uncompressed column chunk size
+is greater than 2GB.

diff --git a/data/large_string_map.brotli.parquet b/data/large_string_map.brotli.parquet
new file mode 100644
index 0000000..fc5c8b2
--- /dev/null
+++ b/data/large_string_map.brotli.parquet
Binary files differ
commit	d79a0101d90dfa3bbb10337626f57a3e8c4b5363	[log] [tgz]
author	Arthur Passos <arthur.ti@outlook.com>	Wed Jun 21 14:01:14 2023 -0300
committer	GitHub <noreply@github.com>	Wed Jun 21 19:01:14 2023 +0200
tree	c8369c0ed6bc7c1b655e2e5b5b8963c0dc22b7c7
parent	b2e7cc755159196e3a068c8594f7acbaecfdaaac [diff]