Apache parquet

Clone this repo:

Branches

  1. f038ca1 Add Hugging Face public dataset footers (#3) by Alkis Evlogimenos · 7 months ago main
  2. 213ae16 Add instructions for footer donations by Alkis Evlogimenos · 8 months ago
  3. fec07c6 First by Fokko · 8 months ago

Parquet benchmark data

This repository contains Parquet benchmark data. Such data is useful to help optimize Parquet implementations but also advance the Parquet format itself.

At this point the community requests donation of Parquet footers and especially footers that are large and slow to parse/process. Typically these are footers of wide schemata: either coming from lots of individual columns and/or deeply nested structs.

To donate Parquet footers we have built a binary parquet-dump-footer as part of parquet tools. This utility extracts footers from parquet, scrubs binary data for privacy reasons and allows to pretty print (--debug) the result for inspection before submission.

When you are ready to donate a footer please open a PR against this repository and add your footer under footer/<name>.footer.

Use parquet-dump-footer --help for explantion of all the options.

alternate parquet-dump-footer binary

You can find binaries in this repo for different architectures in bin/parquet-dump-footer.zip. The binaries are built using the following cmake configuration.

cmake .. -DCMAKE_BUILD_TYPE=Release -DARROW_ACERO=OFF -DARROW_BUILD_UTILITIES=OFF -DARROW_COMPUTE=OFF -DARROW_CSV=OFF -DARROW_DATASET=OFF -DARROW_FILESYSTEM=ON -DARROW_AZURE=ON -DARROW_HDFS=OFF -DARROW_GCS=ON -DARROW_IPC=OFF -DARROW_PARQUET=ON -DARROW_S3=ON -DARROW_JSON=OFF -DARROW_MIMALLOC=OFF -DARROW_JEMALLOC=OFF -DARROW_SUBSTRAIT=OFF -DARROW_DEPENDENCY_SOURCE=BUNDLED -DARROW_DEPENDENCY_USE_SHARED=OFF -DARROW_BUILD_STATIC=ON -DARROW_BUILD_SHARED=OFF -DPARQUET_BUILD_EXECUTABLES=ON