This file contains the specification of all supported encodings.
Supported Types: all
This is the plain encoding that must be supported for types. It is intended to be the simplest encoding. Values are encoded back to back.
For native types, this outputs the data as little endian. Floating point types are encoded in IEEE.
For the byte array type, it encodes the length as a 4 byte little endian, followed by the bytes.
Supported Types: INT32, INT64
This encoding is adapted from the Binary packing described in “Decoding billions of integers per second through vectorization” by D. Lemire and L. Boytsov
Delta encoding consists of a header followed by blocks of delta encoded values binary packed. Each block is made of miniblocks each binary packed with its own bit width. When there are not enough values to encode a full block we pad with zeros (added to the frame of reference). The header contains:
<block size in values> <number of miniblocks in a block> <total value count> <first value>
Each block contains
<min delta> <list of bitwidths of miniblocks> <miniblocks>
Having multiple blocks allows us to escape values and restart from a new base value.
To encode each delta block, we will:
Compute the deltas
Encode the first value as zigzag VLQ int
For each block, compute the frame of reference(minimum of the deltas) for the deltas. This guarantees all deltas are positive.
encode the frame of reference delta as VLQ int followed by the delta values (minus the minimum) encoded as bit packed per miniblock.
Steps 2 and 3 are skipped if the number of values in the block is 1.
1, 2, 3, 4, 5
After step 1), we compute the deltas as:
1, 1, 1, 1
The minimum delta is 1 and after step 2, the deltas become
0, 0, 0, 0
The final encoded data is:
header: 8 (block size), 1 (miniblock count), 5 (value count), 1 (first value)
block 1 (minimum delta), 0 (bitwidth), (no data needed for bitwidth 0)
7, 5, 3, 1, 2, 3, 4, 5, the deltas would be
-2, -2, -2, 1, 1, 1, 1
The minimum is -2, so the relative deltas are:
0, 0, 0, 3, 3, 3, 3
The encoded data is
header: 8 (block size), 1 (miniblock count), 8 (value count), 7 (first value)
block 0 (minimum delta), 2 (bitwidth), 000000111111b (0,0,0,3,3,3 packed on 2 bits)
Supported Types: BYTE_ARRAY
This encoding is always preferred over PLAIN for byte array columns.
For this encoding, we will take all the byte array lengths and encode them using delta encoding (DELTA_BINARY_PACKED). The byte array data follows all of the length data just concatenated back to back. The expected savings is from the cost of encoding the lengths and possibly better compression in the data (it is no longer interleaved with the lengths).
The data stream looks like:
For example, if the data was “Hello”, “World”, “Foobar”, “ABCDEF”:
The encoded data would be DeltaEncoding(5, 5, 6, 6) “HelloWorldFoobarABCDEF”
Supported Types: BYTE_ARRAY
This is also known as incremental encoding or front compression: for each element in a sequence of strings, store the prefix length of the previous entry plus the suffix.
For a longer description, see http://en.wikipedia.org/wiki/Incremental_encoding.
This is stored as a sequence of delta-encoded prefix lengths (DELTA_BINARY_PACKED), followed by the suffixes encoded as delta length byte arrays (DELTA_LENGTH_BYTE_ARRAY).