| .. Licensed to the Apache Software Foundation (ASF) under one |
| .. or more contributor license agreements. See the NOTICE file |
| .. distributed with this work for additional information |
| .. regarding copyright ownership. The ASF licenses this file |
| .. to you under the Apache License, Version 2.0 (the |
| .. "License"); you may not use this file except in compliance |
| .. with the License. You may obtain a copy of the License at |
| |
| .. http://www.apache.org/licenses/LICENSE-2.0 |
| |
| .. Unless required by applicable law or agreed to in writing, |
| .. software distributed under the License is distributed on an |
| .. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| .. KIND, either express or implied. See the License for the |
| .. specific language governing permissions and limitations |
| .. under the License. |
| |
| .. default-domain:: cpp |
| .. highlight:: cpp |
| |
| .. cpp:namespace:: parquet |
| |
| ================================= |
| Reading and writing Parquet files |
| ================================= |
| |
| .. seealso:: |
| :ref:`Parquet reader and writer API reference <cpp-api-parquet>`. |
| |
| The `Parquet format <https://parquet.apache.org/documentation/latest/>`__ |
| is a space-efficient columnar storage format for complex data. The Parquet |
| C++ implementation is part of the Apache Arrow project and benefits |
| from tight integration with the Arrow C++ classes and facilities. |
| |
| Supported Parquet features |
| ========================== |
| |
| The Parquet format has many features, and Parquet C++ supports a subset of them. |
| |
| Page types |
| ---------- |
| |
| +-------------------+---------+ |
| | Page type | Notes | |
| +===================+=========+ |
| | DATA_PAGE | | |
| +-------------------+---------+ |
| | DATA_PAGE_V2 | | |
| +-------------------+---------+ |
| | DICTIONARY_PAGE | | |
| +-------------------+---------+ |
| |
| *Unsupported page type:* INDEX_PAGE. When reading a Parquet file, pages of |
| this type are ignored. |
| |
| Compression |
| ----------- |
| |
| +-------------------+---------+ |
| | Compression codec | Notes | |
| +===================+=========+ |
| | SNAPPY | | |
| +-------------------+---------+ |
| | GZIP | | |
| +-------------------+---------+ |
| | BROTLI | | |
| +-------------------+---------+ |
| | LZ4 | \(1) | |
| +-------------------+---------+ |
| | ZSTD | | |
| +-------------------+---------+ |
| |
| * \(1) On the read side, Parquet C++ is able to decompress both the regular |
| LZ4 block format and the ad-hoc Hadoop LZ4 format used by the |
| `reference Parquet implementation <https://github.com/apache/parquet-mr>`__. |
| On the write side, Parquet C++ always generates the ad-hoc Hadoop LZ4 format. |
| |
| *Unsupported compression codec:* LZO. |
| |
| Encodings |
| --------- |
| |
| +--------------------------+---------+ |
| | Encoding | Notes | |
| +==========================+=========+ |
| | PLAIN | | |
| +--------------------------+---------+ |
| | PLAIN_DICTIONARY | | |
| +--------------------------+---------+ |
| | BIT_PACKED | | |
| +--------------------------+---------+ |
| | RLE | \(1) | |
| +--------------------------+---------+ |
| | RLE_DICTIONARY | \(2) | |
| +--------------------------+---------+ |
| | BYTE_STREAM_SPLIT | | |
| +--------------------------+---------+ |
| |
| * \(1) Only supported for encoding definition and repetition levels, not values. |
| |
| * \(2) On the write path, RLE_DICTIONARY is only enabled if Parquet format version |
| 2.0 (or potentially greater) is selected in :func:`WriterProperties::version`. |
| |
| *Unsupported encodings:* DELTA_BINARY_PACKED, DELTA_LENGTH_BYTE_ARRAY, |
| DELTA_BYTE_ARRAY. |
| |
| Types |
| ----- |
| |
| Physical types |
| ~~~~~~~~~~~~~~ |
| |
| +--------------------------+-------------------------+------------+ |
| | Physical type | Mapped Arrow type | Notes | |
| +==========================+=========================+============+ |
| | BOOLEAN | Boolean | | |
| +--------------------------+-------------------------+------------+ |
| | INT32 | Int32 / other | \(1) | |
| +--------------------------+-------------------------+------------+ |
| | INT64 | Int64 / other | \(1) | |
| +--------------------------+-------------------------+------------+ |
| | INT96 | Timestamp (nanoseconds) | \(2) | |
| +--------------------------+-------------------------+------------+ |
| | FLOAT | Float32 | | |
| +--------------------------+-------------------------+------------+ |
| | DOUBLE | Float64 | | |
| +--------------------------+-------------------------+------------+ |
| | BYTE_ARRAY | Binary / other | \(1) \(3) | |
| +--------------------------+-------------------------+------------+ |
| | FIXED_LENGTH_BYTE_ARRAY | FixedSizeBinary / other | \(1) | |
| +--------------------------+-------------------------+------------+ |
| |
| * \(1) Can be mapped to other Arrow types, depending on the logical type |
| (see below). |
| |
| * \(2) On the write side, :func:`ArrowWriterProperties::support_deprecated_int96_timestamps` |
| must be enabled. |
| |
| * \(3) On the write side, an Arrow LargeBinary can also mapped to BYTE_ARRAY. |
| |
| Logical types |
| ~~~~~~~~~~~~~ |
| |
| Specific logical types can override the default Arrow type mapping for a given |
| physical type. |
| |
| +-------------------+-----------------------------+----------------------------+---------+ |
| | Logical type | Physical type | Mapped Arrow type | Notes | |
| +===================+=============================+============================+=========+ |
| | NULL | Any | Null | \(1) | |
| +-------------------+-----------------------------+----------------------------+---------+ |
| | INT | INT32 | Int8 / UInt8 / Int16 / | | |
| | | | UInt16 / Int32 / UInt32 | | |
| +-------------------+-----------------------------+----------------------------+---------+ |
| | INT | INT64 | Int64 / UInt64 | | |
| +-------------------+-----------------------------+----------------------------+---------+ |
| | DECIMAL | INT32 / INT64 / BYTE_ARRAY | Decimal128 / Decimal256 | \(2) | |
| | | / FIXED_LENGTH_BYTE_ARRAY | | | |
| +-------------------+-----------------------------+----------------------------+---------+ |
| | DATE | INT32 | Date32 | \(3) | |
| +-------------------+-----------------------------+----------------------------+---------+ |
| | TIME | INT32 | Time32 (milliseconds) | | |
| +-------------------+-----------------------------+----------------------------+---------+ |
| | TIME | INT64 | Time64 (micro- or | | |
| | | | nanoseconds) | | |
| +-------------------+-----------------------------+----------------------------+---------+ |
| | TIMESTAMP | INT64 | Timestamp (milli-, micro- | | |
| | | | or nanoseconds) | | |
| +-------------------+-----------------------------+----------------------------+---------+ |
| | STRING | BYTE_ARRAY | Utf8 | \(4) | |
| +-------------------+-----------------------------+----------------------------+---------+ |
| | LIST | Any | List | \(5) | |
| +-------------------+-----------------------------+----------------------------+---------+ |
| | MAP | Any | Map | \(6) | |
| +-------------------+-----------------------------+----------------------------+---------+ |
| |
| * \(1) On the write side, the Parquet physical type INT32 is generated. |
| |
| * \(2) On the write side, a FIXED_LENGTH_BYTE_ARRAY is always emitted. |
| |
| * \(3) On the write side, an Arrow Date64 is also mapped to a Parquet DATE INT32. |
| |
| * \(4) On the write side, an Arrow LargeUtf8 is also mapped to a Parquet STRING. |
| |
| * \(5) On the write side, an Arrow LargeList or FixedSizedList is also mapped to |
| a Parquet LIST. |
| |
| * \(6) On the read side, a key with multiple values does not get deduplicated, |
| in contradiction with the |
| `Parquet specification <https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#maps>`__. |
| |
| *Unsupported logical types:* JSON, BSON, UUID. If such a type is encountered |
| when reading a Parquet file, the default physical type mapping is used (for |
| example, a Parquet JSON column may be read as Arrow Binary or FixedSizeBinary). |
| |
| Converted types |
| ~~~~~~~~~~~~~~~ |
| |
| While converted types are deprecated in the Parquet format (they are superceded |
| by logical types), they are recognized and emitted by the Parquet C++ |
| implementation so as to maximize compatibility with other Parquet |
| implementations. |
| |
| Special cases |
| ~~~~~~~~~~~~~ |
| |
| An Arrow Extension type is written out as its storage type. It can still |
| be recreated at read time using Parquet metadata (see "Roundtripping Arrow |
| types" below). |
| |
| An Arrow Dictionary type is written out as its value type. It can still |
| be recreated at read time using Parquet metadata (see "Roundtripping Arrow |
| types" below). |
| |
| Roundtripping Arrow types |
| ~~~~~~~~~~~~~~~~~~~~~~~~~ |
| |
| While there is no bijection between Arrow types and Parquet types, it is |
| possible to serialize the Arrow schema as part of the Parquet file metadata. |
| This is enabled using :func:`ArrowWriterProperties::store_schema`. |
| |
| On the read path, the serialized schema will be automatically recognized |
| and will recreate the original Arrow data, converting the Parquet data as |
| required (for example, a LargeList will be recreated from the Parquet LIST |
| type). |
| |
| As an exemple, when serializing an Arrow LargeList to Parquet: |
| |
| * The data is written out as a Parquet LIST |
| |
| * When read back, the Parquet LIST data is decoded as an Arrow LargeList if |
| :func:`ArrowWriterProperties::store_schema` was enabled when writing the file; |
| otherwise, it is decoded as an Arrow List. |
| |
| Serialization details |
| """"""""""""""""""""" |
| |
| The Arrow schema is serialized as a :ref:`Arrow IPC <format-ipc>` schema message, |
| then base64-encoded and stored under the ``ARROW:schema`` metadata key in |
| the Parquet file metadata. |
| |
| Limitations |
| ~~~~~~~~~~~ |
| |
| Writing or reading back FixedSizedList data with null entries is not supported. |
| |
| Encryption |
| ---------- |
| |
| Parquet C++ implements all features specified in the |
| `encryption specification <https://github.com/apache/parquet-format/blob/master/Encryption.md>`__, |
| except for encryption of column index and bloom filter modules. |
| |
| More specifically, Parquet C++ supports: |
| |
| * AES_GCM_V1 and AES_GCM_CTR_V1 encryption algorithms. |
| * AAD suffix for Footer, ColumnMetaData, Data Page, Dictionary Page, |
| Data PageHeader, Dictionary PageHeader module types. Other module types |
| (ColumnIndex, OffsetIndex, BloomFilter Header, BloomFilter Bitset) are not |
| supported. |
| * EncryptionWithFooterKey and EncryptionWithColumnKey modes. |
| * Encrypted Footer and Plaintext Footer modes. |
| |
| |
| Reading Parquet files |
| ===================== |
| |
| The :class:`arrow::FileReader` class reads data for an entire |
| file or row group into an :class:`::arrow::Table`. |
| |
| The :class:`StreamReader` and :class:`StreamWriter` classes allow for |
| data to be written using a C++ input/output streams approach to |
| read/write fields column by column and row by row. This approach is |
| offered for ease of use and type-safety. It is of course also useful |
| when data must be streamed as files are read and written |
| incrementally. |
| |
| Please note that the performance of the :class:`StreamReader` and |
| :class:`StreamWriter` classes will not be as good due to the type |
| checking and the fact that column values are processed one at a time. |
| |
| FileReader |
| ---------- |
| |
| The Parquet :class:`arrow::FileReader` requires a |
| :class:`::arrow::io::RandomAccessFile` instance representing the input |
| file. |
| |
| .. code-block:: cpp |
| |
| #include "arrow/parquet/arrow/reader.h" |
| |
| { |
| // ... |
| arrow::Status st; |
| arrow::MemoryPool* pool = default_memory_pool(); |
| std::shared_ptr<arrow::io::RandomAccessFile> input = ...; |
| |
| // Open Parquet file reader |
| std::unique_ptr<parquet::arrow::FileReader> arrow_reader; |
| st = parquet::arrow::OpenFile(input, pool, &arrow_reader); |
| if (!st.ok()) { |
| // Handle error instantiating file reader... |
| } |
| |
| // Read entire file as a single Arrow table |
| std::shared_ptr<arrow::Table> table; |
| st = arrow_reader->ReadTable(&table); |
| if (!st.ok()) { |
| // Handle error reading Parquet data... |
| } |
| } |
| |
| Finer-grained options are available through the |
| :class:`arrow::FileReaderBuilder` helper class. |
| |
| .. TODO write section about performance and memory efficiency |
| |
| StreamReader |
| ------------ |
| |
| The :class:`StreamReader` allows for Parquet files to be read using |
| standard C++ input operators which ensures type-safety. |
| |
| Please note that types must match the schema exactly i.e. if the |
| schema field is an unsigned 16-bit integer then you must supply a |
| uint16_t type. |
| |
| Exceptions are used to signal errors. A :class:`ParquetException` is |
| thrown in the following circumstances: |
| |
| * Attempt to read field by supplying the incorrect type. |
| |
| * Attempt to read beyond end of row. |
| |
| * Attempt to read beyond end of file. |
| |
| .. code-block:: cpp |
| |
| #include "arrow/io/file.h" |
| #include "parquet/stream_reader.h" |
| |
| { |
| std::shared_ptr<arrow::io::ReadableFile> infile; |
| |
| PARQUET_ASSIGN_OR_THROW( |
| infile, |
| arrow::io::ReadableFile::Open("test.parquet")); |
| |
| parquet::StreamReader os{parquet::ParquetFileReader::Open(infile)}; |
| |
| std::string article; |
| float price; |
| uint32_t quantity; |
| |
| while ( !os.eof() ) |
| { |
| os >> article >> price >> quantity >> parquet::EndRow; |
| // ... |
| } |
| } |
| |
| Writing Parquet files |
| ===================== |
| |
| WriteTable |
| ---------- |
| |
| The :func:`arrow::WriteTable` function writes an entire |
| :class:`::arrow::Table` to an output file. |
| |
| .. code-block:: cpp |
| |
| #include "parquet/arrow/writer.h" |
| |
| { |
| std::shared_ptr<arrow::io::FileOutputStream> outfile; |
| PARQUET_ASSIGN_OR_THROW( |
| outfile, |
| arrow::io::FileOutputStream::Open("test.parquet")); |
| |
| PARQUET_THROW_NOT_OK( |
| parquet::arrow::WriteTable(table, arrow::default_memory_pool(), outfile, 3)); |
| } |
| |
| StreamWriter |
| ------------ |
| |
| The :class:`StreamWriter` allows for Parquet files to be written using |
| standard C++ output operators. This type-safe approach also ensures |
| that rows are written without omitting fields and allows for new row |
| groups to be created automatically (after certain volume of data) or |
| explicitly by using the :type:`EndRowGroup` stream modifier. |
| |
| Exceptions are used to signal errors. A :class:`ParquetException` is |
| thrown in the following circumstances: |
| |
| * Attempt to write a field using an incorrect type. |
| |
| * Attempt to write too many fields in a row. |
| |
| * Attempt to skip a required field. |
| |
| .. code-block:: cpp |
| |
| #include "arrow/io/file.h" |
| #include "parquet/stream_writer.h" |
| |
| { |
| std::shared_ptr<arrow::io::FileOutputStream> outfile; |
| |
| PARQUET_ASSIGN_OR_THROW( |
| outfile, |
| arrow::io::FileOutputStream::Open("test.parquet")); |
| |
| parquet::WriterProperties::Builder builder; |
| std::shared_ptr<parquet::schema::GroupNode> schema; |
| |
| // Set up builder with required compression type etc. |
| // Define schema. |
| // ... |
| |
| parquet::StreamWriter os{ |
| parquet::ParquetFileWriter::Open(outfile, schema, builder.build())}; |
| |
| // Loop over some data structure which provides the required |
| // fields to be written and write each row. |
| for (const auto& a : getArticles()) |
| { |
| os << a.name() << a.price() << a.quantity() << parquet::EndRow; |
| } |
| } |