| <!-- |
| - Licensed to the Apache Software Foundation (ASF) under one |
| - or more contributor license agreements. See the NOTICE file |
| - distributed with this work for additional information |
| - regarding copyright ownership. The ASF licenses this file |
| - to you under the Apache License, Version 2.0 (the |
| - "License"); you may not use this file except in compliance |
| - with the License. You may obtain a copy of the License at |
| - |
| - http://www.apache.org/licenses/LICENSE-2.0 |
| - |
| - Unless required by applicable law or agreed to in writing, |
| - software distributed under the License is distributed on an |
| - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| - KIND, either express or implied. See the License for the |
| - specific language governing permissions and limitations |
| - under the License. |
| --> |
| |
| Parquet encoding definitions |
| ==== |
| |
| This file contains the specification of all supported encodings. |
| |
| Unless otherwise stated in page or encoding documentation, any encoding can be |
| used with any page type. |
| |
| <a name="PLAIN"></a> |
| ### Plain: (PLAIN = 0) |
| |
| Supported Types: all |
| |
| This is the plain encoding that must be supported for types. It is |
| intended to be the simplest encoding. Values are encoded back to back. |
| |
| The plain encoding is used whenever a more efficient encoding can not be used. It |
| stores the data in the following format: |
| - BOOLEAN: [Bit Packed](#BITPACKED), LSB first |
| - INT32: 4 bytes little endian |
| - INT64: 8 bytes little endian |
| - INT96: 12 bytes little endian (deprecated) |
| - FLOAT: 4 bytes IEEE little endian |
| - DOUBLE: 8 bytes IEEE little endian |
| - BYTE_ARRAY: length in 4 bytes little endian followed by the bytes contained in the array |
| - FIXED_LEN_BYTE_ARRAY: the bytes contained in the array |
| |
| For native types, this outputs the data as little endian. Floating |
| point types are encoded in IEEE. |
| |
| For the byte array type, it encodes the length as a 4 byte little |
| endian, followed by the bytes. |
| |
| ### Dictionary Encoding (PLAIN_DICTIONARY = 2 and RLE_DICTIONARY = 8) |
| The dictionary encoding builds a dictionary of values encountered in a given column. The |
| dictionary will be stored in a dictionary page per column chunk. The values are stored as integers |
| using the [RLE/Bit-Packing Hybrid](#RLE) encoding. If the dictionary grows too big, whether in size |
| or number of distinct values, the encoding will fall back to the plain encoding. The dictionary page is |
| written first, before the data pages of the column chunk. |
| |
| Dictionary page format: the entries in the dictionary using the [plain](#PLAIN) encoding. |
| |
| Data page format: the bit width used to encode the entry ids stored as 1 byte (max bit width = 32), |
| followed by the values encoded using RLE/Bit packed described above (with the given bit width). |
| |
| Using the `PLAIN_DICTIONARY` enum value is deprecated, use `RLE_DICTIONARY` |
| in a data page and `PLAIN` in a dictionary page for new Parquet files. |
| |
| <a name="RLE"></a> |
| ### Run Length Encoding / Bit-Packing Hybrid (RLE = 3) |
| |
| This encoding uses a combination of bit-packing and run length encoding to more efficiently store repeated values. |
| |
| The grammar for this encoding looks like this, given a fixed bit-width known in advance: |
| ``` |
| rle-bit-packed-hybrid: <length> <encoded-data> |
| // length is not always prepended, please check the table below for more detail |
| length := length of the <encoded-data> in bytes stored as 4 bytes little endian (unsigned int32) |
| encoded-data := <run>* |
| run := <bit-packed-run> | <rle-run> |
| bit-packed-run := <bit-packed-header> <bit-packed-values> |
| bit-packed-header := varint-encode(<bit-pack-scaled-run-len> << 1 | 1) |
| // we always bit-pack a multiple of 8 values at a time, so we only store the number of values / 8 |
| bit-pack-scaled-run-len := (bit-packed-run-len) / 8 |
| bit-packed-run-len := *see 3 below* |
| bit-packed-values := *see 1 below* |
| rle-run := <rle-header> <repeated-value> |
| rle-header := varint-encode( (rle-run-len) << 1) |
| rle-run-len := *see 3 below* |
| repeated-value := value that is repeated, using a fixed-width of round-up-to-next-byte(bit-width) |
| ``` |
| |
| 1. The bit-packing here is done in a different order than the one in the [deprecated bit-packing](#BITPACKED) encoding. |
| The values are packed from the least significant bit of each byte to the most significant bit, |
| though the order of the bits in each value remains in the usual order of most significant to least |
| significant. For example, to pack the same values as the example in the deprecated encoding above: |
| |
| The numbers 1 through 7 using bit width 3: |
| ``` |
| dec value: 0 1 2 3 4 5 6 7 |
| bit value: 000 001 010 011 100 101 110 111 |
| bit label: ABC DEF GHI JKL MNO PQR STU VWX |
| ``` |
| |
| would be encoded like this where spaces mark byte boundaries (3 bytes): |
| ``` |
| bit value: 10001000 11000110 11111010 |
| bit label: HIDEFABC RMNOJKLG VWXSTUPQ |
| ``` |
| |
| The reason for this packing order is to have fewer word-boundaries on little-endian hardware |
| when deserializing more than one byte at at time. This is because 4 bytes can be read into a |
| 32 bit register (or 8 bytes into a 64 bit register) and values can be unpacked just by |
| shifting and ORing with a mask. (to make this optimization work on a big-endian machine, |
| you would have to use the ordering used in the [deprecated bit-packing](#BITPACKED) encoding) |
| |
| 2. varint-encode() is ULEB-128 encoding, see https://en.wikipedia.org/wiki/LEB128 |
| |
| 3. bit-packed-run-len and rle-run-len must be in the range \[1, 2<sup>31</sup> - 1\]. |
| This means that a Parquet implementation can always store the run length in a signed |
| 32-bit integer. This length restriction was not part of the Parquet 2.5.0 and earlier |
| specifications, but longer runs were not readable by the most common Parquet |
| implementations so, in practice, were not safe for Parquet writers to emit. |
| |
| |
| Note that the RLE encoding method is only supported for the following types of |
| data: |
| |
| * Repetition and definition levels |
| * Dictionary indices |
| * Boolean values in data pages, as an alternative to PLAIN encoding |
| |
| Whether prepending the four-byte `length` to the `encoded-data` is summarized as the table below: |
| ``` |
| +--------------+------------------------+-----------------+ |
| | Page kind | RLE-encoded data kind | Prepend length? | |
| +--------------+------------------------+-----------------+ |
| | Data page v1 | Definition levels | Y | |
| | | Repetition levels | Y | |
| | | Dictionary indices | N | |
| | | Boolean values | Y | |
| +--------------+------------------------+-----------------+ |
| | Data page v2 | Definition levels | N | |
| | | Repetition levels | N | |
| | | Dictionary indices | N | |
| | | Boolean values | Y | |
| +--------------+------------------------+-----------------+ |
| ``` |
| |
| <a name="BITPACKED"></a> |
| ### Bit-packed (Deprecated) (BIT_PACKED = 4) |
| |
| This is a bit-packed only encoding, which is deprecated and will be replaced by the [RLE/bit-packing](#RLE) hybrid encoding. |
| Each value is encoded back to back using a fixed width. |
| There is no padding between values (except for the last byte, which is padded with 0s). |
| For example, if the max repetition level was 3 (2 bits) and the max definition level as 3 |
| (2 bits), to encode 30 values, we would have 30 * 2 = 60 bits = 8 bytes. |
| |
| This implementation is deprecated because the [RLE/bit-packing](#RLE) hybrid is a superset of this implementation. |
| For compatibility reasons, this implementation packs values from the most significant bit to the least significant bit, |
| which is not the same as the [RLE/bit-packing](#RLE) hybrid. |
| |
| For example, the numbers 1 through 7 using bit width 3: |
| ``` |
| dec value: 0 1 2 3 4 5 6 7 |
| bit value: 000 001 010 011 100 101 110 111 |
| bit label: ABC DEF GHI JKL MNO PQR STU VWX |
| ``` |
| would be encoded like this where spaces mark byte boundaries (3 bytes): |
| ``` |
| bit value: 00000101 00111001 01110111 |
| bit label: ABCDEFGH IJKLMNOP QRSTUVWX |
| ``` |
| |
| Note that the BIT_PACKED encoding method is only supported for encoding |
| repetition and definition levels. |
| |
| <a name="DELTAENC"></a> |
| ### Delta Encoding (DELTA_BINARY_PACKED = 5) |
| Supported Types: INT32, INT64 |
| |
| This encoding is adapted from the Binary packing described in |
| ["Decoding billions of integers per second through vectorization"](https://arxiv.org/pdf/1209.2137v5.pdf) |
| by D. Lemire and L. Boytsov. |
| |
| In delta encoding we make use of variable length integers for storing various |
| numbers (not the deltas themselves). For unsigned values, we use ULEB128, |
| which is the unsigned version of LEB128 (https://en.wikipedia.org/wiki/LEB128#Unsigned_LEB128). |
| For signed values, we use zigzag encoding (https://developers.google.com/protocol-buffers/docs/encoding#signed-integers) |
| to map negative values to positive ones and apply ULEB128 on the result. |
| |
| Delta encoding consists of a header followed by blocks of delta encoded values |
| binary packed. Each block is made of miniblocks, each of them binary packed with its own bit width. |
| |
| The header is defined as follows: |
| ``` |
| <block size in values> <number of miniblocks in a block> <total value count> <first value> |
| ``` |
| * the block size is a multiple of 128; it is stored as a ULEB128 int |
| * the miniblock count per block is a divisor of the block size such that their |
| quotient, the number of values in a miniblock, is a multiple of 32; it is |
| stored as a ULEB128 int |
| * the total value count is stored as a ULEB128 int |
| * the first value is stored as a zigzag ULEB128 int |
| |
| Each block contains |
| ``` |
| <min delta> <list of bitwidths of miniblocks> <miniblocks> |
| ``` |
| * the min delta is a zigzag ULEB128 int (we compute a minimum as we need |
| positive integers for bit packing) |
| * the bitwidth of each block is stored as a byte |
| * each miniblock is a list of bit packed ints according to the bit width |
| stored at the beginning of the block |
| |
| To encode a block, we will: |
| |
| 1. Compute the differences between consecutive elements. For the first |
| element in the block, use the last element in the previous block or, in |
| the case of the first block, use the first value of the whole sequence, |
| stored in the header. |
| |
| 2. Compute the frame of reference (the minimum of the deltas in the block). |
| Subtract this min delta from all deltas in the block. This guarantees that |
| all values are non-negative. |
| |
| 3. Encode the frame of reference (min delta) as a zigzag ULEB128 int followed |
| by the bit widths of the miniblocks and the delta values (minus the min |
| delta) bit-packed per miniblock. |
| |
| Having multiple blocks allows us to adapt to changes in the data by changing |
| the frame of reference (the min delta) which can result in smaller values |
| after the subtraction which, again, means we can store them with a lower bit width. |
| |
| If there are not enough values to fill the last miniblock, we pad the miniblock |
| so that its length is always the number of values in a full miniblock multiplied |
| by the bit width. The values of the padding bits should be zero, but readers |
| must accept paddings consisting of arbitrary bits as well. |
| |
| If, in the last block, less than ```<number of miniblocks in a block>``` |
| miniblocks are needed to store the values, the bytes storing the bit widths |
| of the unneeded miniblocks are still present, their value should be zero, |
| but readers must accept arbitrary values as well. There are no additional |
| padding bytes for the miniblock bodies though, as if their bit widths were 0 |
| (regardless of the actual byte values). The reader knows when to stop reading |
| by keeping track of the number of values read. |
| |
| Subtractions in steps 1) and 2) may incur signed arithmetic overflow, and so |
| will the corresponding additions when decoding. Overflow should be allowed |
| and handled as wrapping around in 2's complement notation so that the original |
| values are correctly restituted. This may require explicit care in some programming |
| languages (for example by doing all arithmetic in the unsigned domain). Writers |
| must not use more bits when bit packing the miniblock data than would be required |
| to PLAIN encode the physical type (e.g. INT32 data must not use more than 32 bits). |
| |
| The following examples use 8 as the block size to keep the examples short, |
| but in real cases it would be invalid. |
| |
| #### Example 1 |
| 1, 2, 3, 4, 5 |
| |
| After step 1), we compute the deltas as: |
| |
| 1, 1, 1, 1 |
| |
| The minimum delta is 1 and after step 2, the relative deltas become: |
| |
| 0, 0, 0, 0 |
| |
| The final encoded data is: |
| |
| header: |
| 8 (block size), 1 (miniblock count), 5 (value count), 1 (first value) |
| |
| block: |
| 1 (minimum delta), 0 (bitwidth), (no data needed for bitwidth 0) |
| |
| #### Example 2 |
| 7, 5, 3, 1, 2, 3, 4, 5, the deltas would be |
| |
| -2, -2, -2, 1, 1, 1, 1 |
| |
| The minimum is -2, so the relative deltas are: |
| |
| 0, 0, 0, 3, 3, 3, 3 |
| |
| The encoded data is |
| |
| header: |
| 8 (block size), 1 (miniblock count), 8 (value count), 7 (first value) |
| |
| block: |
| -2 (minimum delta), 2 (bitwidth), 00000011111111b (0,0,0,3,3,3,3 packed on 2 bits) |
| |
| #### Characteristics |
| This encoding is similar to the [RLE/bit-packing](#RLE) encoding. However the [RLE/bit-packing](#RLE) encoding is specifically used when the range of ints is small over the entire page, as is true of repetition and definition levels. It uses a single bit width for the whole page. |
| The delta encoding algorithm described above stores a bit width per miniblock and is less sensitive to variations in the size of encoded integers. It is also somewhat doing RLE encoding as a block containing all the same values will be bit packed to a zero bit width thus being only a header. |
| |
| ### Delta-length byte array: (DELTA_LENGTH_BYTE_ARRAY = 6) |
| |
| Supported Types: BYTE_ARRAY |
| |
| This encoding is always preferred over PLAIN for byte array columns. |
| |
| For this encoding, we will take all the byte array lengths and encode them using delta |
| encoding (DELTA_BINARY_PACKED). The byte array data follows all of the length data just |
| concatenated back to back. The expected savings is from the cost of encoding the lengths |
| and possibly better compression in the data (it is no longer interleaved with the lengths). |
| |
| The data stream looks like: |
| ``` |
| <Delta Encoded Lengths> <Byte Array Data> |
| ``` |
| |
| For example, if the data was "Hello", "World", "Foobar", "ABCDEF" |
| |
| then the encoded data would be comprised of the following segments: |
| - DeltaEncoding(5, 5, 6, 6) (the string lengths) |
| - "HelloWorldFoobarABCDEF" |
| |
| ### Delta Strings: (DELTA_BYTE_ARRAY = 7) |
| |
| Supported Types: BYTE_ARRAY, FIXED_LEN_BYTE_ARRAY |
| |
| This is also known as incremental encoding or front compression: for each element in a |
| sequence of strings, store the prefix length of the previous entry plus the suffix. |
| |
| For a longer description, see https://en.wikipedia.org/wiki/Incremental_encoding. |
| |
| This is stored as a sequence of delta-encoded prefix lengths (DELTA_BINARY_PACKED), followed by |
| the suffixes encoded as delta length byte arrays (DELTA_LENGTH_BYTE_ARRAY). |
| |
| For example, if the data was "axis", "axle", "babble", "babyhood" |
| |
| then the encoded data would be comprised of the following segments: |
| - DeltaEncoding(0, 2, 0, 3) (the prefix lengths) |
| - DeltaEncoding(4, 2, 6, 5) (the suffix lengths) |
| - "axislebabbleyhood" |
| |
| Note that, even for FIXED_LEN_BYTE_ARRAY, all lengths are encoded despite the redundancy. |
| |
| ### Byte Stream Split: (BYTE_STREAM_SPLIT = 9) |
| |
| Supported Types: FLOAT, DOUBLE, INT32, INT64, FIXED_LEN_BYTE_ARRAY |
| |
| This encoding does not reduce the size of the data but can lead to a significantly better |
| compression ratio and speed when a compression algorithm is used afterwards. |
| |
| This encoding creates K byte-streams of length N where K is the size in bytes of the data |
| type and N is the number of elements in the data sequence. For example, K is 4 for FLOAT |
| type and 8 for DOUBLE type. |
| |
| The bytes of each value are scattered to the corresponding streams. The 0-th byte goes to the |
| 0-th stream, the 1-st byte goes to the 1-st stream and so on. |
| The streams are concatenated in the following order: 0-th stream, 1-st stream, etc. |
| The total length of encoded streams is K * N bytes. Because it does not have any metadata |
| to indicate the total length, the end of the streams is also the end of data page. No padding |
| is allowed inside the data page. |
| |
| Example: |
| Original data is three 32-bit floats and for simplicity we look at their raw representation. |
| ``` |
| Element 0 Element 1 Element 2 |
| Bytes AA BB CC DD 00 11 22 33 A3 B4 C5 D6 |
| ``` |
| After applying the transformation, the data has the following representation: |
| ``` |
| Bytes AA 00 A3 BB 11 B4 CC 22 C5 DD 33 D6 |
| ``` |