Parquet encoding definitions

This file contains the specification of all supported encodings.

Plain:

Supported Types: all

This is the plain encoding that must be supported for types. It is intended to be the simplest encoding. Values are encoded back to back.

For native types, this outputs the data as little endian. Floating point types are encoded in IEEE.

For the byte array type, it encodes the length as a 4 byte little endian, followed by the bytes.

GroupVarInt:

Supported Types: INT32, INT64

32-bit ints are encoded in groups of 4 with 1 leading bytes to encode the byte length of the following 4 ints.

64-bit are encoded in groups of 5, with 2 leading bytes to encode the byte length of the 5 ints.

For 32-bit ints, the leading byte contains 2 bits per int. Each length encoding specifies the number of bytes minus 1 for that int. For example a byte value of 0b00101101, indicates that:

  • the first int has 1 byte (0b00 + 1),
  • the second int has 3 bytes (0b10 + 1),
  • the third int has 4 bytes (0b11 + 1)
  • the 4th int has 2 bytes (0b01 + 1)

In this case, the entire row group would be: 1 + (1 + 3 + 4 + 2) = 11 bytes.
The bytes that follow the leading byte is just the int data encoded in little endian. With this example:

  • the first int starts at byte offset 1 with a max value of 0xFF,
  • the second int starts at byte offset 2 with a max value of 0xFFFFFF,
  • the third int starts at byte offset 5 with a max value of 0xFFFFFFFF, and
  • the 4th int starts at byte offset 9 with a max value of 0xFFFF.

For 64-bit ints, the lengths of the 5 ints is encoded as 3 bits. Combined, this uses 15 bits and fits in 2 bytes. The msb of the two bytes is unused. Like the 32-bit case, after the length bytes, the data bytes follow.

In the case where the data does not make a complete group, (e.g. 3 32-bit ints), a complete group should still be output with 0's filling in for the remainder. For example, if the input was (1,2,3,4,5): the resulting encoding should behave as if the input was (1,2,3,4,5,0,0,0) and the two groups should be encoded back to back.

Delta Encoding

Supported Types: INT32, INT64

This encoding is adapted from the Binary packing described in “Decoding billions of integers per second through vectorization” by D. Lemire and L. Boytsov

Delta encoding consists of a header followed by blocks of delta encoded values binary packed. Each block is made of miniblocks each binary packed with its own bit width. When there are not enough values to encode a full block we pad with zeros (added to the frame of reference). The header contains:

<block size in values> <number of miniblocks in a block> <total value count> <first value>

the block size is a multiple of 128 stored as VLQ int the miniblock count per block is a diviser of the block size stored as VLQ int the number of values in the miniblock is a multiple of 32. the total value count is stored as a VLQ int the first value is stored as a zigzag VLQ int

Each block contains

<min delta> <list of bitwidths of miniblocks> <miniblocks>

the min delta is a VLQ int (we compute a minimum as we need positive integers for bit packing) the bitwidth of each block is stored as a byte each miniblock is a list of bit packed ints according to the bit width stored at the begining of the block

Having multiple blocks allows us to escape values and restart from a new base value.

To encode each delta block, we will:

  1. Compute the deltas

  2. Encode the first value as zigzag VLQ int

  3. For each block, compute the frame of reference(minimum of the deltas) for the deltas. This guarantees all deltas are positive.

  4. encode the frame of reference delta as VLQ int followed by the delta values (minus the minimum) encoded as bit packed per miniblock.

Steps 2 and 3 are skipped if the number of values in the block is 1.

Example 1

1, 2, 3, 4, 5

After step 1), we compute the deltas as:

1, 1, 1, 1

The minimum delta is 1 and after step 2, the deltas become

0, 0, 0, 0

The final encoded data is:

header: 8 (block size), 1 (miniblock count), 5 (value count), 1 (first value)

block 1 (minimum delta), 0 (bitwidth), (no data needed for bitwidth 0)

Example 2

7, 5, 3, 1, 2, 3, 4, 5, the deltas would be

-2, -2, -2, 1, 1, 1, 1

The minimum is -2, so the relative deltas are:

0, 0, 0, 3, 3, 3, 3

The encoded data is

header: 8 (block size), 1 (miniblock count), 8 (value count), 7 (first value)

block 0 (minimum delta), 2 (bitwidth), 000000111111b (0,0,0,3,3,3 packed on 2 bits)