CFile is an on-disk columnar storage format which holds data and associated B-Tree indexes. Within a DiskRowSet each column and DeltaFile will map to a single CFile. In addition, the DiskRowSet's bloomfilter will be stored in a CFile, and if the table has a compound primary key, then the associated ad-hoc index will be stored in a separate CFile.
Despite the name, a CFile does not necessarily correspond one-to-one with a file in the filesystem. The mapping of CFiles to the filesystem is determined by the BlockManager abstraction. A CFile is written to a single BlockManager Block (not to be confused with CFile blocks, which are internal to CFiles, and discussed for the remainder of this document).
A CFile is composed of a header, a variable number of blocks, and a footer. There are a few different types of blocks, each with a different format: data blocks, nullable data blocks, index blocks, and dictionary blocks.
Data blocks consist of a sequence of consecutive values. Each value is assigned an ordinal index, or offset, which is unique within the CFile. Thus, a block can be identified within a CFile by the range of ordinal indexes it contains; for instance, the first three blocks of a CFile might have the ordinal index ranges
[56, 132], and
In order to support efficient seeks to an arbitrary ordinal index within the CFile, a positional index may be included which contains a mapping of starting ordinal index to data block. See CFile Index for more details.
For CFiles which contain data in sorted order (for example, a CFile containing the Primary Key column), a value index may be created. The value index supports efficiently seeking to an arbitrary value in the CFile. See CFile Index for more details.
The CFile header, blocks, and footer are written consecutively to the file without padding.
|length||4||32-bit unsigned length of the following Protobuf message|
|checksum||4||optional CRC-32 checksum of magic, length, and message|
|checksum||4||optional CRC-32 checksum of message, magic, and length|
|length||4||unsigned 32-bit length of the preceding Protobuf message|
The CFile header and footer each contain a ‘magic’ string which indicates the CFile's version.
CFile v2 was introduced in Kudu 1.3 in order to make a breaking change in the CFile format. At the same time, two sets of feature flags were introduced to the footer in order to make more granular forwards compatibility possible in the future. See commit
f82cf6918c for details.
|data||variable||encoded data values|
|checksum||4||optional CRC-32 checksum of the data|
If the CFile is configured to use compression then the encoded data is compressed before the checksum is appended.
Columns marked as nullable in the schema are stored in a nullable block, which includes a bitmap which tracks which rows (ordinal offsets) are null and not null.
|value count||variable||LEB128 encoded count of values|
|bitmap length||variable||LEB128 encoded length of the following null bitmap|
|null bitmap||variable||RLE encoded bitmap|
|data||variable||encoded non-null data values|
|checksum||4||optional CRC-32 checksum of the value count, bitmap length, null bitmap, and data|
If the CFile is configured to use compression then the value count, bitmap length, null bitmap, and encoded data are compressed before the checksum is appended.
Checksums can be optionally written and verified.
When checksums for the header, data, and footer are written in the CFile, the
incompatible_features bitset in the CFileFooterPB message is used. A “checksum” bit is set to ensure the reader knows if checksums exist.
When reading a CFile the footer should be read first to find if the file contains checksums. If the
incompatible_features bitset indicates checksums exist, the reader can optionally validate them against the read data.
Block data is encoded before being stored. If compression is enabled, then the block is encoded first, and then compressed.
Data blocks are best-effort size limited to
--cfile_default_block_size bytes, at which point a new block will be added to the CFile.
The simplest encoding type is plain encoding, which stores values in their natural representation.
Plain encoding for fixed-width (integer) types consists of the little-endian values, followed by a trailer containing two little-endian
uint32s: the count of values, and the ordinal position of the block.
Plain encoding for BINARY or STRING (variable-width) columns is laid out as follows:
|ordinal position||little-endian |
|num values||little-endian |
|offsets position||little-endian |
|values||sequential values with no padding|
|offsets||group-varint frame-of-reference encoded value offsets|
Dictionary encoding may be used for BINARY or STRING columns. All dictionary encoded blocks in a CFile share the same dictionary. If the dictionary becomes full, subsequent blocks in the CFile switch to plain encoding. The dictionary is stored as a plain encoded binary block, and the data codewords are stored as bitshuffle encoded
Currently used for
STRING data blocks. This is based on the encoding used by LevelDB for its data blocks, more or less.
Starts with a header of four Group Varint encoded
|number of elements|
Followed by prefix-compressed values. Each value is stored relative to the value preceding it using the following format:
Periodically there will be a “restart point” which is necessary for faster binary searching. At a “restart point”,
shared_bytes is 0 but otherwise the encoding is the same.
At the end of the block is a trailer with the offsets of the restart points:
The restart points are offsets relative to the start of the block, including the header.
Integer and bool types may be run-length encoded, which has a simply layout: the number of values and ordinal position of the block as little-endian
uint32s, followed by the run-length encoded data. The encoder will automatically fall back to a bit-packed (literal) encoding if the data is not conducive to run-length encoding.
Integer types may be bitshuffle encoded. Bitshuffle-encoded blocks are automatically LZ4 compressed, so additional compression is not recommended.
Used internally for
UINT32 blocks. Kudu does not support unsigned integer columns, so this encoding is not used for column data.
Starts with a header of four group-varint encoded
|number of elements|
The ordinal position is the ordinal position of the first item in the file. For example, the first data block in the file has ordinal position 0. If that block had 400 values, then the second data block would have ordinal position 400.
Followed by the actual data, each set of 4 integers using group-varint. The last group is padded with 0s. Each integer is relative to the min element in the header.
CFiles may optionally include a positional (ordinal) index and value index. Positional indexes are used to satisfy queries such as: “seek to the data block containing the Nth entry in this CFile”. Value indexes are used to satisfy queries such as: “seek to the data block containing
123 in this CFile”. Value indexes are only present in CFiles which contain data stored in sorted order (e.g. the primary key column).
Ordinal and value indices are organized as a B-Tree of index blocks. As data blocks are written, entries are appended to the end of a leaf index block. When a leaf index block reaches the configured index block size, it is added to another index block higher up the tree, and a new leaf is started. If the intermediate index block fills, it will start a new intermediate block and spill into an even higher-layer internal block.
[Int 0] ----------------------------- | | [Int 1] [Int 2] ----------------- -------------- | | | | | [Leaf 0] ... [Leaf N] [Leaf N+1] [Leaf N+2]
In this case, we wrote N leaf blocks, which filled up the internal node labeled Int 1. At this point, the writer would create Int 0 with one entry pointing to Int 1. Further leaf blocks (N+1 and N+2) would be written to a new internal node (Int 2). When the file is completed, Int 2 will spill, adding its entry into Int 0 as well.
Note that this strategy doesn't result in a fully balanced B-tree, but instead results in a 100% “fill factor” on all nodes in each level except for the last one written.
An index block is encoded similarly for both types of indexes:
<key> <block offset> <block size> <key> <block offset> <block size> ... key: vint64 for positional, otherwise varint-length-prefixed string offset: vint64 block size: vint32 <offset to first key> (fixed32) <offset to second key> (fixed32) ... These offsets are relative to the start of the block. <trailer> A IndexBlockTrailerPB protobuf <trailer length>
Index blocks are written using a post-order traversal, and the index blocks may intersperse with the data blocks.
The trailer protobuf includes a field which designates whether the block is a leaf node or internal node of the B-Tree, allowing a reader to know whether the pointer is to another index block or to a data block.
Every DeltaFile in a DiskRowSet is written to a CFile; in fact, a DeltaFile is just a wrapper around CFile which specializes it for REDO and UNDO delta data. The deltas associated with a row update are grouped into a RowChangeList and written to as a binary values to the underlying CFile. Each value is prefixed with a DeltaKey, which consists of the ordinal row index (within the DiskRowSet), and the timestamp. The CFile includes a value index so that the delta belonging to a specific row can be located efficiently.
BloomFile is a wrapper around CFile which stores a bloomfilter datastructure.