CFile

CFile is an on-disk columnar storage format which holds data and associated B-Tree indexes. Within a DiskRowSet each column and DeltaFile will map to a single CFile. In addition, the DiskRowSet's bloomfilter will be stored in a CFile, and if the table has a compound primary key, then the associated ad-hoc index will be stored in a separate CFile.

Despite the name, a CFile does not necessarily correspond one-to-one with a file in the filesystem. The mapping of CFiles to the filesystem is determined by the BlockManager abstraction. A CFile is written to a single BlockManager Block (not to be confused with CFile blocks, which are internal to CFiles, and discussed for the remainder of this document).

A CFile is composed of a header, a variable number of blocks, and a footer. There are a few different types of blocks, each with a different format: data blocks, nullable data blocks, index blocks, and dictionary blocks.

Data blocks consist of a sequence of consecutive values. Each value is assigned an ordinal index, or offset, which is unique within the CFile. Thus, a block can be identified within a CFile by the range of ordinal indexes it contains; for instance, the first three blocks of a CFile might have the ordinal index ranges [0, 55], [56, 132], and [133, 511].

In order to support efficient seeks to an arbitrary ordinal index within the CFile, a positional index may be included which contains a mapping of starting ordinal index to data block. See CFile Index for more details.

For CFiles which contain data in sorted order (for example, a CFile containing the Primary Key column), a value index may be created. The value index supports efficiently seeking to an arbitrary value in the CFile. See CFile Index for more details.

File Format

The CFile header, blocks, and footer are written consecutively to the file without padding.

Header Layout

field	width (bytes)	notes
magic	8	see below
length	4	32-bit unsigned length of the following Protobuf message
message	variable	encoded `CFileHeaderPB` Protobuf message
checksum	4	optional CRC-32 checksum of magic, length, and message

Footer Layout

field	width (bytes)	notes
checksum	4	optional CRC-32 checksum of message, magic, and length
message	variable	encoded `CFileFooterPB` Protobuf message
magic	8	see below
length	4	unsigned 32-bit length of the preceding Protobuf message

Versioning

The CFile header and footer each contain a ‘magic’ string which indicates the CFile's version.

version	string
CFile v1	`kuducfil`
CFile v2	`kuducfl2`

CFile v2 was introduced in Kudu 1.3 in order to make a breaking change in the CFile format. At the same time, two sets of feature flags were introduced to the footer in order to make more granular forwards compatibility possible in the future. See commit f82cf6918c for details.

Data Block Layout

field	width (bytes)	notes
data	variable	encoded data values
checksum	4	optional CRC-32 checksum of the data

If the CFile is configured to use compression then the encoded data is compressed before the checksum is appended.

Nullable Data Block Layout

Columns marked as nullable in the schema are stored in a nullable block, which includes a bitmap which tracks which rows (ordinal offsets) are null and not null.

field	width (bytes)	notes
value count	variable	LEB128 encoded count of values
bitmap length	variable	LEB128 encoded length of the following null bitmap
null bitmap	variable	RLE encoded bitmap
data	variable	encoded non-null data values
checksum	4	optional CRC-32 checksum of the value count, bitmap length, null bitmap, and data

If the CFile is configured to use compression then the value count, bitmap length, null bitmap, and encoded data are compressed before the checksum is appended.

Checksums

Checksums can be optionally written and verified.

When checksums for the header, data, and footer are written in the CFile, the incompatible_features bitset in the CFileFooterPB message is used. A “checksum” bit is set to ensure the reader knows if checksums exist.

When reading a CFile the footer should be read first to find if the file contains checksums. If the incompatible_features bitset indicates checksums exist, the reader can optionally validate them against the read data.

Data Encodings

Block data is encoded before being stored. If compression is enabled, then the block is encoded first, and then compressed.

Data blocks are best-effort size limited to --cfile_default_block_size bytes, at which point a new block will be added to the CFile.

Plain Encoding

The simplest encoding type is plain encoding, which stores values in their natural representation.

Plain encoding for fixed-width (integer) types consists of the little-endian values, followed by a trailer containing two little-endian uint32s: the count of values, and the ordinal position of the block.

Plain encoding for BINARY or STRING (variable-width) columns is laid out as follows:

ordinal position	little-endian `uint32`
num values	little-endian `uint32`
offsets position	little-endian `uint32`
values	sequential values with no padding
offsets	group-varint frame-of-reference encoded value offsets

Dictionary Encoding

Dictionary encoding may be used for BINARY or STRING columns. All dictionary encoded blocks in a CFile share the same dictionary. If the dictionary becomes full, subsequent blocks in the CFile switch to plain encoding. The dictionary is stored as a plain encoded binary block, and the data codewords are stored as bitshuffle encoded uint32s.

Prefix Encoding

Currently used for BINARY and STRING data blocks. This is based on the encoding used by LevelDB for its data blocks, more or less.

Starts with a header of four Group Varint encoded uint32s:

field
number of elements
ordinal position
restart interval
unused

Followed by prefix-compressed values. Each value is stored relative to the value preceding it using the following format:

field	type
`shared_bytes`	`varint32`
`unshared_bytes`	`varint32`
`delta`	`char[unshared_bytes]`

Periodically there will be a “restart point” which is necessary for faster binary searching. At a “restart point”, shared_bytes is 0 but otherwise the encoding is the same.

At the end of the block is a trailer with the offsets of the restart points:

field	type
`restart_points`	`uint32[num_restarts]`
`num_restarts`	`uint32`

The restart points are offsets relative to the start of the block, including the header.

Run-length Encoding

Integer and bool types may be run-length encoded, which has a simply layout: the number of values and ordinal position of the block as little-endian uint32s, followed by the run-length encoded data. The encoder will automatically fall back to a bit-packed (literal) encoding if the data is not conducive to run-length encoding.

Bitshuffle Encoding

Integer types may be bitshuffle encoded. Bitshuffle-encoded blocks are automatically LZ4 compressed, so additional compression is not recommended.

Group Varint Frame-of-Reference Encoding

Used internally for UINT32 blocks. Kudu does not support unsigned integer columns, so this encoding is not used for column data.

Starts with a header of four group-varint encoded uint32s:

field
number of elements
minimum element
ordinal position
unused

The ordinal position is the ordinal position of the first item in the file. For example, the first data block in the file has ordinal position 0. If that block had 400 values, then the second data block would have ordinal position 400.

Followed by the actual data, each set of 4 integers using group-varint. The last group is padded with 0s. Each integer is relative to the min element in the header.

CFile Index

CFiles may optionally include a positional (ordinal) index and value index. Positional indexes are used to satisfy queries such as: “seek to the data block containing the Nth entry in this CFile”. Value indexes are used to satisfy queries such as: “seek to the data block containing 123 in this CFile”. Value indexes are only present in CFiles which contain data stored in sorted order (e.g. the primary key column).

Ordinal and value indices are organized as a B-Tree of index blocks. As data blocks are written, entries are appended to the end of a leaf index block. When a leaf index block reaches the configured index block size, it is added to another index block higher up the tree, and a new leaf is started. If the intermediate index block fills, it will start a new intermediate block and spill into an even higher-layer internal block.

For example:

                      [Int 0]
           -----------------------------
           |                           |
        [Int 1]                     [Int 2]
   -----------------             --------------
   |       |       |             |            |
[Leaf 0]  ...   [Leaf N]     [Leaf N+1]   [Leaf N+2]

In this case, we wrote N leaf blocks, which filled up the internal node labeled Int 1. At this point, the writer would create Int 0 with one entry pointing to Int 1. Further leaf blocks (N+1 and N+2) would be written to a new internal node (Int 2). When the file is completed, Int 2 will spill, adding its entry into Int 0 as well.

Note that this strategy doesn't result in a fully balanced B-tree, but instead results in a 100% “fill factor” on all nodes in each level except for the last one written.

An index block is encoded similarly for both types of indexes:

<key> <block offset> <block size>
<key> <block offset> <block size>
...
   key: vint64 for positional, otherwise varint-length-prefixed string
   offset: vint64
   block size: vint32

<offset to first key>   (fixed32)
<offset to second key>  (fixed32)
...
   These offsets are relative to the start of the block.

<trailer>
   A IndexBlockTrailerPB protobuf
<trailer length>

Index blocks are written using a post-order traversal, and the index blocks may intersperse with the data blocks.

The trailer protobuf includes a field which designates whether the block is a leaf node or internal node of the B-Tree, allowing a reader to know whether the pointer is to another index block or to a data block.

DeltaFile

Every DeltaFile in a DiskRowSet is written to a CFile; in fact, a DeltaFile is just a wrapper around CFile which specializes it for REDO and UNDO delta data. The deltas associated with a row update are grouped into a RowChangeList and written to as a binary values to the underlying CFile. Each value is prefixed with a DeltaKey, which consists of the ordinal row index (within the DiskRowSet), and the timestamp. The CFile includes a value index so that the delta belonging to a specific row can be located efficiently.

BloomFile

BloomFile is a wrapper around CFile which stores a bloomfilter datastructure.