| .. Licensed to the Apache Software Foundation (ASF) under one |
| .. or more contributor license agreements. See the NOTICE file |
| .. distributed with this work for additional information |
| .. regarding copyright ownership. The ASF licenses this file |
| .. to you under the Apache License, Version 2.0 (the |
| .. "License"); you may not use this file except in compliance |
| .. with the License. You may obtain a copy of the License at |
| |
| .. http://www.apache.org/licenses/LICENSE-2.0 |
| |
| .. Unless required by applicable law or agreed to in writing, |
| .. software distributed under the License is distributed on an |
| .. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| .. KIND, either express or implied. See the License for the |
| .. specific language governing permissions and limitations |
| .. under the License. |
| |
| .. default-domain:: cpp |
| .. highlight:: cpp |
| |
| .. cpp:namespace:: parquet |
| |
| ================================= |
| Reading and writing Parquet files |
| ================================= |
| |
| .. seealso:: |
| :ref:`Parquet reader and writer API reference <cpp-api-parquet>`. |
| |
| The `Parquet format <https://parquet.apache.org/docs/>`__ |
| is a space-efficient columnar storage format for complex data. The Parquet |
| C++ implementation is part of the Apache Arrow project and benefits |
| from tight integration with the Arrow C++ classes and facilities. |
| |
| Reading Parquet files |
| ===================== |
| |
| The :class:`arrow::FileReader` class reads data into Arrow Tables and Record |
| Batches. |
| |
| The :class:`StreamReader` class allows for data to be read using a C++ input |
| stream approach to read fields column by column and row by row. This approach |
| is offered for ease of use and type-safety. It is of course also useful when |
| data must be streamed as files are read and written incrementally. |
| |
| Please note that the performance of the :class:`StreamReader` will not |
| be as good due to the type checking and the fact that column values |
| are processed one at a time. |
| |
| FileReader |
| ---------- |
| |
| To read Parquet data into Arrow structures, use :class:`arrow::FileReader`. |
| To construct, it requires a :class:`::arrow::io::RandomAccessFile` instance |
| representing the input file. To read the whole file at once, |
| use :func:`arrow::FileReader::ReadTable`: |
| |
| .. literalinclude:: ../../../cpp/examples/arrow/parquet_read_write.cc |
| :language: cpp |
| :start-after: arrow::Status ReadFullFile( |
| :end-before: return arrow::Status::OK(); |
| :emphasize-lines: 9-10,14 |
| :dedent: 2 |
| |
| Finer-grained options are available through the |
| :class:`arrow::FileReaderBuilder` helper class, which accepts the :class:`ReaderProperties` |
| and :class:`ArrowReaderProperties` classes. |
| |
| For reading as a stream of batches, use the :func:`arrow::FileReader::GetRecordBatchReader` |
| method to retrieve a :class:`arrow::RecordBatchReader`. It will use the batch |
| size set in :class:`ArrowReaderProperties`. |
| |
| .. literalinclude:: ../../../cpp/examples/arrow/parquet_read_write.cc |
| :language: cpp |
| :start-after: arrow::Status ReadInBatches( |
| :end-before: return arrow::Status::OK(); |
| :emphasize-lines: 25 |
| :dedent: 2 |
| |
| .. seealso:: |
| |
| For reading multi-file datasets or pushing down filters to prune row groups, |
| see :ref:`Tabular Datasets<cpp-dataset>`. |
| |
| Performance and Memory Efficiency |
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| |
| For remote filesystems, use read coalescing (pre-buffering) to reduce number of API calls: |
| |
| .. code-block:: cpp |
| |
| auto arrow_reader_props = parquet::ArrowReaderProperties(); |
| reader_properties.set_prebuffer(true); |
| |
| The defaults are generally tuned towards good performance, but parallel column |
| decoding is off by default. Enable it in the constructor of :class:`ArrowReaderProperties`: |
| |
| .. code-block:: cpp |
| |
| auto arrow_reader_props = parquet::ArrowReaderProperties(/*use_threads=*/true); |
| |
| If memory efficiency is more important than performance, then: |
| |
| #. Do *not* turn on read coalescing (pre-buffering) in :class:`parquet::ArrowReaderProperties`. |
| #. Read data in batches using :func:`arrow::FileReader::GetRecordBatchReader`. |
| #. Turn on ``enable_buffered_stream`` in :class:`parquet::ReaderProperties`. |
| |
| In addition, if you know certain columns contain many repeated values, you can |
| read them as :term:`dictionary encoded<dictionary-encoding>` columns. This is |
| enabled with the ``set_read_dictionary`` setting on :class:`ArrowReaderProperties`. |
| If the files were written with Arrow C++ and the ``store_schema`` was activated, |
| then the original Arrow schema will be automatically read and will override this |
| setting. |
| |
| StreamReader |
| ------------ |
| |
| The :class:`StreamReader` allows for Parquet files to be read using |
| standard C++ input operators which ensures type-safety. |
| |
| Please note that types must match the schema exactly i.e. if the |
| schema field is an unsigned 16-bit integer then you must supply a |
| ``uint16_t`` type. |
| |
| Exceptions are used to signal errors. A :class:`ParquetException` is |
| thrown in the following circumstances: |
| |
| * Attempt to read field by supplying the incorrect type. |
| |
| * Attempt to read beyond end of row. |
| |
| * Attempt to read beyond end of file. |
| |
| .. code-block:: cpp |
| |
| #include "arrow/io/file.h" |
| #include "parquet/stream_reader.h" |
| |
| { |
| std::shared_ptr<arrow::io::ReadableFile> infile; |
| |
| PARQUET_ASSIGN_OR_THROW( |
| infile, |
| arrow::io::ReadableFile::Open("test.parquet")); |
| |
| parquet::StreamReader stream{parquet::ParquetFileReader::Open(infile)}; |
| |
| std::string article; |
| float price; |
| uint32_t quantity; |
| |
| while ( !stream.eof() ) |
| { |
| stream >> article >> price >> quantity >> parquet::EndRow; |
| // ... |
| } |
| } |
| |
| Writing Parquet files |
| ===================== |
| |
| WriteTable |
| ---------- |
| |
| The :func:`arrow::WriteTable` function writes an entire |
| :class:`::arrow::Table` to an output file. |
| |
| .. literalinclude:: ../../../cpp/examples/arrow/parquet_read_write.cc |
| :language: cpp |
| :start-after: arrow::Status WriteFullFile( |
| :end-before: return arrow::Status::OK(); |
| :emphasize-lines: 19-21 |
| :dedent: 2 |
| |
| .. note:: |
| |
| Column compression is off by default in C++. See :ref:`below <parquet-writer-properties>` |
| for how to choose a compression codec in the writer properties. |
| |
| To write out data batch-by-batch, use :class:`arrow::FileWriter`. |
| |
| .. literalinclude:: ../../../cpp/examples/arrow/parquet_read_write.cc |
| :language: cpp |
| :start-after: arrow::Status WriteInBatches( |
| :end-before: return arrow::Status::OK(); |
| :emphasize-lines: 23-25,32,36 |
| :dedent: 2 |
| |
| StreamWriter |
| ------------ |
| |
| The :class:`StreamWriter` allows for Parquet files to be written using |
| standard C++ output operators, similar to reading with the :class:`StreamReader` |
| class. This type-safe approach also ensures that rows are written without |
| omitting fields and allows for new row groups to be created automatically |
| (after certain volume of data) or explicitly by using the :type:`EndRowGroup` |
| stream modifier. |
| |
| Exceptions are used to signal errors. A :class:`ParquetException` is |
| thrown in the following circumstances: |
| |
| * Attempt to write a field using an incorrect type. |
| |
| * Attempt to write too many fields in a row. |
| |
| * Attempt to skip a required field. |
| |
| .. code-block:: cpp |
| |
| #include "arrow/io/file.h" |
| #include "parquet/stream_writer.h" |
| |
| { |
| std::shared_ptr<arrow::io::FileOutputStream> outfile; |
| |
| PARQUET_ASSIGN_OR_THROW( |
| outfile, |
| arrow::io::FileOutputStream::Open("test.parquet")); |
| |
| parquet::WriterProperties::Builder builder; |
| std::shared_ptr<parquet::schema::GroupNode> schema; |
| |
| // Set up builder with required compression type etc. |
| // Define schema. |
| // ... |
| |
| parquet::StreamWriter os{ |
| parquet::ParquetFileWriter::Open(outfile, schema, builder.build())}; |
| |
| // Loop over some data structure which provides the required |
| // fields to be written and write each row. |
| for (const auto& a : getArticles()) |
| { |
| os << a.name() << a.price() << a.quantity() << parquet::EndRow; |
| } |
| } |
| |
| .. _parquet-writer-properties: |
| |
| Writer properties |
| ----------------- |
| |
| To configure how Parquet files are written, use the :class:`WriterProperties::Builder`: |
| |
| .. code-block:: cpp |
| |
| #include "parquet/arrow/writer.h" |
| #include "arrow/util/type_fwd.h" |
| |
| using parquet::WriterProperties; |
| using parquet::ParquetVersion; |
| using parquet::ParquetDataPageVersion; |
| using arrow::Compression; |
| |
| std::shared_ptr<WriterProperties> props = WriterProperties::Builder() |
| .max_row_group_length(64 * 1024) |
| .created_by("My Application") |
| .version(ParquetVersion::PARQUET_2_6) |
| .data_page_version(ParquetDataPageVersion::V2) |
| .compression(Compression::SNAPPY) |
| .build(); |
| |
| The ``max_row_group_length`` sets an upper bound on the number of rows per row |
| group that takes precedent over the ``chunk_size`` passed in the write methods. |
| |
| You can set the version of Parquet to write with ``version``, which determines |
| which logical types are available. In addition, you can set the data page version |
| with ``data_page_version``. It's V1 by default; setting to V2 will allow more |
| optimal compression (skipping compressing pages where there isn't a space |
| benefit), but not all readers support this data page version. |
| |
| Compression is off by default, but to get the most out of Parquet, you should |
| also choose a compression codec. You can choose one for the whole file or |
| choose one for individual columns. If you choose a mix, the file-level option |
| will apply to columns that don't have a specific compression codec. See |
| :class:`::arrow::Compression` for options. |
| |
| Column data encodings can likewise be applied at the file-level or at the |
| column level. By default, the writer will attempt to dictionary encode all |
| supported columns, unless the dictionary grows too large. This behavior can |
| be changed at file-level or at the column level with ``disable_dictionary()``. |
| When not using dictionary encoding, it will fallback to the encoding set for |
| the column or the overall file; by default ``Encoding::PLAIN``, but this can |
| be changed with ``encoding()``. |
| |
| .. code-block:: cpp |
| |
| #include "parquet/arrow/writer.h" |
| #include "arrow/util/type_fwd.h" |
| |
| using parquet::WriterProperties; |
| using arrow::Compression; |
| using parquet::Encoding; |
| |
| std::shared_ptr<WriterProperties> props = WriterProperties::Builder() |
| .compression(Compression::SNAPPY) // Fallback |
| ->compression("colA", Compression::ZSTD) // Only applies to column "colA" |
| ->encoding(Encoding::BIT_PACKED) // Fallback |
| ->encoding("colB", Encoding::RLE) // Only applies to column "colB" |
| ->disable_dictionary("colB") // Never dictionary-encode column "colB" |
| ->build(); |
| |
| Statistics are enabled by default for all columns. You can disable statistics for |
| all columns or specific columns using ``disable_statistics`` on the builder. |
| There is a ``max_statistics_size`` which limits the maximum number of bytes that |
| may be used for min and max values, useful for types like strings or binary blobs. |
| If a column has enabled page index using ``enable_write_page_index``, then it does |
| not write statistics to the page header because it is duplicated in the ColumnIndex. |
| |
| There are also Arrow-specific settings that can be configured with |
| :class:`parquet::ArrowWriterProperties`: |
| |
| .. code-block:: cpp |
| |
| #include "parquet/arrow/writer.h" |
| |
| using parquet::ArrowWriterProperties; |
| |
| std::shared_ptr<ArrowWriterProperties> arrow_props = ArrowWriterProperties::Builder() |
| .enable_deprecated_int96_timestamps() // default False |
| ->store_schema() // default False |
| ->build(); |
| |
| These options mostly dictate how Arrow types are converted to Parquet types. |
| Turning on ``store_schema`` will cause the writer to store the serialized Arrow |
| schema within the file metadata. Since there is no bijection between Parquet |
| schemas and Arrow schemas, storing the Arrow schema allows the Arrow reader |
| to more faithfully recreate the original data. This mapping from Parquet types |
| back to original Arrow types includes: |
| |
| * Reading timestamps with original timezone information (Parquet does not |
| support time zones); |
| * Reading Arrow types from their storage types (such as Duration from int64 |
| columns); |
| * Reading string and binary columns back into large variants with 64-bit offsets; |
| * Reading back columns as dictionary encoded (whether an Arrow column and |
| the serialized Parquet version are dictionary encoded are independent). |
| |
| Supported Parquet features |
| ========================== |
| |
| The Parquet format has many features, and Parquet C++ supports a subset of them. |
| |
| Page types |
| ---------- |
| |
| +-------------------+---------+ |
| | Page type | Notes | |
| +===================+=========+ |
| | DATA_PAGE | | |
| +-------------------+---------+ |
| | DATA_PAGE_V2 | | |
| +-------------------+---------+ |
| | DICTIONARY_PAGE | | |
| +-------------------+---------+ |
| |
| *Unsupported page type:* INDEX_PAGE. When reading a Parquet file, pages of |
| this type are ignored. |
| |
| Compression |
| ----------- |
| |
| +-------------------+---------+ |
| | Compression codec | Notes | |
| +===================+=========+ |
| | SNAPPY | | |
| +-------------------+---------+ |
| | GZIP | | |
| +-------------------+---------+ |
| | BROTLI | | |
| +-------------------+---------+ |
| | LZ4 | \(1) | |
| +-------------------+---------+ |
| | ZSTD | | |
| +-------------------+---------+ |
| |
| * \(1) On the read side, Parquet C++ is able to decompress both the regular |
| LZ4 block format and the ad-hoc Hadoop LZ4 format used by the |
| `reference Parquet implementation <https://github.com/apache/parquet-mr>`__. |
| On the write side, Parquet C++ always generates the ad-hoc Hadoop LZ4 format. |
| |
| *Unsupported compression codec:* LZO. |
| |
| Encodings |
| --------- |
| |
| +--------------------------+----------+----------+---------+ |
| | Encoding | Reading | Writing | Notes | |
| +==========================+==========+==========+=========+ |
| | PLAIN | ✓ | ✓ | | |
| +--------------------------+----------+----------+---------+ |
| | PLAIN_DICTIONARY | ✓ | ✓ | | |
| +--------------------------+----------+----------+---------+ |
| | BIT_PACKED | ✓ | ✓ | \(1) | |
| +--------------------------+----------+----------+---------+ |
| | RLE | ✓ | ✓ | \(1) | |
| +--------------------------+----------+----------+---------+ |
| | RLE_DICTIONARY | ✓ | ✓ | \(2) | |
| +--------------------------+----------+----------+---------+ |
| | BYTE_STREAM_SPLIT | ✓ | ✓ | | |
| +--------------------------+----------+----------+---------+ |
| | DELTA_BINARY_PACKED | ✓ | ✓ | | |
| +--------------------------+----------+----------+---------+ |
| | DELTA_BYTE_ARRAY | ✓ | ✓ | | |
| +--------------------------+----------+----------+---------+ |
| | DELTA_LENGTH_BYTE_ARRAY | ✓ | ✓ | | |
| +--------------------------+----------+----------+---------+ |
| |
| * \(1) Only supported for encoding definition and repetition levels, |
| and boolean values. |
| |
| * \(2) On the write path, RLE_DICTIONARY is only enabled if Parquet format version |
| 2.4 or greater is selected in :func:`WriterProperties::version`. |
| |
| Types |
| ----- |
| |
| Physical types |
| ~~~~~~~~~~~~~~ |
| |
| +--------------------------+------------------------------------+------------+ |
| | Physical type | Mapped Arrow type | Notes | |
| +==========================+====================================+============+ |
| | BOOLEAN | Boolean | | |
| +--------------------------+------------------------------------+------------+ |
| | INT32 | Int32 / other | \(1) | |
| +--------------------------+------------------------------------+------------+ |
| | INT64 | Int64 / other | \(1) | |
| +--------------------------+------------------------------------+------------+ |
| | INT96 | Timestamp (nanoseconds) | \(2) | |
| +--------------------------+------------------------------------+------------+ |
| | FLOAT | Float32 | | |
| +--------------------------+------------------------------------+------------+ |
| | DOUBLE | Float64 | | |
| +--------------------------+------------------------------------+------------+ |
| | BYTE_ARRAY | Binary / LargeBinary / BinaryView | \(1) | |
| +--------------------------+------------------------------------+------------+ |
| | FIXED_LENGTH_BYTE_ARRAY | FixedSizeBinary / other | \(1) | |
| +--------------------------+------------------------------------+------------+ |
| |
| * \(1) Can be mapped to other Arrow types, depending on the logical type |
| (see table below). |
| |
| * \(2) On the write side, :func:`ArrowWriterProperties::support_deprecated_int96_timestamps` |
| must be enabled. |
| |
| Logical types |
| ~~~~~~~~~~~~~ |
| |
| Specific logical types can override the default Arrow type mapping for a given |
| physical type. |
| |
| +-------------------+-----------------------------+------------------------------+-----------+ |
| | Logical type | Physical type | Mapped Arrow type | Notes | |
| +===================+=============================+==============================+===========+ |
| | NULL | Any | Null | \(1) | |
| +-------------------+-----------------------------+------------------------------+-----------+ |
| | INT | INT32 | Int8 / UInt8 / Int16 / | | |
| | | | UInt16 / Int32 / UInt32 | | |
| +-------------------+-----------------------------+------------------------------+-----------+ |
| | INT | INT64 | Int64 / UInt64 | | |
| +-------------------+-----------------------------+------------------------------+-----------+ |
| | DECIMAL | INT32 / INT64 / BYTE_ARRAY | Decimal32/ Decimal64 / | \(2) | |
| | | / FIXED_LENGTH_BYTE_ARRAY | Decimal128 / Decimal256 | | |
| +-------------------+-----------------------------+------------------------------+-----------+ |
| | DATE | INT32 | Date32 | \(3) | |
| +-------------------+-----------------------------+------------------------------+-----------+ |
| | TIME | INT32 | Time32 (milliseconds) | | |
| +-------------------+-----------------------------+------------------------------+-----------+ |
| | TIME | INT64 | Time64 (micro- or | | |
| | | | nanoseconds) | | |
| +-------------------+-----------------------------+------------------------------+-----------+ |
| | TIMESTAMP | INT64 | Timestamp (milli-, micro- | | |
| | | | or nanoseconds) | | |
| +-------------------+-----------------------------+------------------------------+-----------+ |
| | STRING | BYTE_ARRAY | String / LargeString / | | |
| | | | StringView | | |
| +-------------------+-----------------------------+------------------------------+-----------+ |
| | LIST | Any | List / LargeList | \(4) | |
| +-------------------+-----------------------------+------------------------------+-----------+ |
| | MAP | Any | Map | \(5) | |
| +-------------------+-----------------------------+------------------------------+-----------+ |
| | FLOAT16 | FIXED_LENGTH_BYTE_ARRAY | HalfFloat | | |
| +-------------------+-----------------------------+------------------------------+-----------+ |
| | UUID | FIXED_LENGTH_BYTE_ARRAY | Extension (``arrow.uuid``) | \(6) | |
| +-------------------+-----------------------------+------------------------------+-----------+ |
| | JSON | BYTE_ARRAY | Extension (``arrow.json``) | \(6) | |
| +-------------------+-----------------------------+------------------------------+-----------+ |
| | GEOMETRY | BYTE_ARRAY | Extension (``geoarrow.wkb``) | \(6) \(7) | |
| +-------------------+-----------------------------+------------------------------+-----------+ |
| | GEOGRAPHY | BYTE_ARRAY | Extension (``geoarrow.wkb``) | \(6) \(7) | |
| +-------------------+-----------------------------+------------------------------+-----------+ |
| |
| * \(1) On the write side, the Parquet physical type INT32 is generated. |
| |
| * \(2) On the write side, a FIXED_LENGTH_BYTE_ARRAY is always emitted |
| except if ``store_decimal_as_integer`` is set to true. |
| |
| * \(3) On the write side, an Arrow Date64 is also mapped to a Parquet DATE INT32. |
| |
| * \(4) On the write side, an Arrow FixedSizedList is also mapped to a Parquet LIST. |
| |
| * \(5) On the read side, a key with multiple values does not get deduplicated, |
| in contradiction with the |
| `Parquet specification <https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#maps>`__. |
| |
| * \(6) Requires that ``arrow_extensions_enabled`` in ``ArrowReaderProperties`` is ``true``. |
| When ``false``, the underlying storage type is read. |
| |
| * \(7) Requires that the ``geoarrow.wkb`` extension type is registered. |
| |
| *Unsupported logical types:* BSON. If such a type is encountered |
| when reading a Parquet file, the default physical type mapping is used (for |
| example, a Parquet BSON column may be read as Arrow Binary or FixedSizeBinary). |
| |
| Converted types |
| ~~~~~~~~~~~~~~~ |
| |
| While converted types are deprecated in the Parquet format (they are superceded |
| by logical types), they are recognized and emitted by the Parquet C++ |
| implementation so as to maximize compatibility with other Parquet |
| implementations. |
| |
| Special cases |
| ~~~~~~~~~~~~~ |
| |
| An Arrow Extension type is written out as its storage type. It can still |
| be recreated at read time using Parquet metadata (see "Roundtripping Arrow |
| types" below). Some extension types have Parquet LogicalType equivalents |
| (e.g., UUID, JSON, GEOMETRY, GEOGRAPHY). These are created automatically |
| if the appropriate option is set in the ``ArrowReaderProperties`` even if |
| there was no Arrow schema stored in the Parquet metadata. |
| |
| An Arrow Dictionary type is written out as its value type. It can still |
| be recreated at read time using Parquet metadata (see "Roundtripping Arrow |
| types" below). |
| |
| Roundtripping Arrow types and schema |
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| |
| While there is no bijection between Arrow types and Parquet types, it is |
| possible to serialize the Arrow schema as part of the Parquet file metadata. |
| This is enabled using :func:`ArrowWriterProperties::store_schema`. |
| |
| On the read path, the serialized schema will be automatically recognized |
| and will recreate the original Arrow data, converting the Parquet data as |
| required. |
| |
| As an example, when serializing an Arrow LargeList to Parquet: |
| |
| * The data is written out as a Parquet LIST |
| |
| * When read back, the Parquet LIST data is decoded as an Arrow LargeList if |
| :func:`ArrowWriterProperties::store_schema` was enabled when writing the file; |
| otherwise, it is decoded as an Arrow List. |
| |
| Parquet field id |
| """""""""""""""" |
| |
| The Parquet format supports an optional integer *field id* which can be assigned |
| to a given field. This is used for example in the |
| `Apache Iceberg specification <https://github.com/apache/iceberg/blob/main/format/spec.md#column-projection>`__. |
| |
| On the writer side, if ``PARQUET:field_id`` is present as a metadata key on an |
| Arrow field, then its value is parsed as a non-negative integer and is used as |
| the field id for the corresponding Parquet field. |
| |
| On the reader side, Arrow will convert such a field id to a metadata key named |
| ``PARQUET:field_id`` on the corresponding Arrow field. |
| |
| Serialization details |
| """"""""""""""""""""" |
| |
| The Arrow schema is serialized as a :ref:`Arrow IPC <format-ipc>` schema message, |
| then base64-encoded and stored under the ``ARROW:schema`` metadata key in |
| the Parquet file metadata. |
| |
| |
| Limitations |
| ~~~~~~~~~~~ |
| |
| Writing or reading back FixedSizedList data with null entries is not supported. |
| |
| Encryption |
| ---------- |
| |
| Parquet C++ implements all features specified in the |
| `encryption specification <https://github.com/apache/parquet-format/blob/master/Encryption.md>`__, |
| except for encryption of column index and bloom filter modules. |
| |
| More specifically, Parquet C++ supports: |
| |
| * AES_GCM_V1 and AES_GCM_CTR_V1 encryption algorithms. |
| * AAD suffix for Footer, ColumnMetaData, Data Page, Dictionary Page, |
| Data PageHeader, Dictionary PageHeader module types. Other module types |
| (ColumnIndex, OffsetIndex, BloomFilter Header, BloomFilter Bitset) are not |
| supported. |
| * EncryptionWithFooterKey and EncryptionWithColumnKey modes. |
| * Encrypted Footer and Plaintext Footer modes. |
| |
| Configuration |
| ~~~~~~~~~~~~~ |
| |
| Parquet encryption uses a ``parquet::encryption::CryptoFactory`` that has access to a |
| Key Management System (KMS), which stores actual encryption keys, referenced by key ids. |
| The Parquet encryption configuration only uses key ids, no actual keys. |
| |
| Parquet metadata encryption is configured via ``parquet::encryption::EncryptionConfiguration``: |
| |
| .. literalinclude:: ../../../cpp/examples/arrow/parquet_column_encryption.cc |
| :language: cpp |
| :start-at: // Set write options with encryption configuration |
| :end-before: encryption_config->column_keys |
| :dedent: 2 |
| |
| If ``encryption_config->uniform_encryption`` is set to ``true``, then all columns are |
| encrypted with the same key as the Parquet metadata. Otherwise, individual |
| columns are encrypted with individual keys as configured via |
| ``encryption_config->column_keys``. This field expects a string of the format |
| ``"columnKeyID1:colName1,colName2;columnKeyID3:colName3..."``. |
| |
| .. literalinclude:: ../../../cpp/examples/arrow/parquet_column_encryption.cc |
| :language: cpp |
| :start-at: // Set write options with encryption configuration |
| :end-at: encryption_config->column_keys |
| :emphasize-lines: 4 |
| :dedent: 2 |
| |
| See the full `Parquet column encryption example <examples/parquet_column_encryption.html>`_. |
| |
| .. note:: |
| |
| Columns with nested fields (struct or map data types) can be encrypted as a whole, or only |
| individual fields. Configure an encryption key for the root column name to encrypt all nested |
| fields with this key, or configure a key for individual leaf nested fields. |
| |
| Conventionally, the key and value fields of a map column ``m`` have the names |
| ``m.key_value.key`` and ``m.key_value.value``, respectively. |
| An inner field ``f`` of a struct column ``s`` has the name ``s.f``. |
| |
| With above example, *all* inner fields are encrypted with the same key by configuring that key |
| for column ``m`` and ``s``, respectively. |
| |
| Miscellaneous |
| ------------- |
| |
| +--------------------------+----------+----------+---------+ |
| | Feature | Reading | Writing | Notes | |
| +==========================+==========+==========+=========+ |
| | Column Index | ✓ | ✓ | \(1) | |
| +--------------------------+----------+----------+---------+ |
| | Offset Index | ✓ | ✓ | \(1) | |
| +--------------------------+----------+----------+---------+ |
| | Bloom Filter | ✓ | ✓ | \(1) | |
| +--------------------------+----------+----------+---------+ |
| | CRC checksums | ✓ | ✓ | | |
| +--------------------------+----------+----------+---------+ |
| |
| * \(1) Access to the Column Index, Offset Index and Bloom Filter structures |
| is provided, but data read APIs do not currently make any use of them. |