| <!--- |
| Licensed to the Apache Software Foundation (ASF) under one |
| or more contributor license agreements. See the NOTICE file |
| distributed with this work for additional information |
| regarding copyright ownership. The ASF licenses this file |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this file except in compliance |
| with the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License. |
| --> |
| |
| # Big Trie-Indexed (BTI) SSTable format |
| |
| This document describes the BTI SSTable format, which is introduced to |
| Cassandra with [CEP-25](https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-25%3A+Trie-indexed+SSTable+format). |
| |
| The format is called BTI, which stands for "Big Trie-Indexed", because it shares |
| the data format of the existing BIG format and only changes the primary indexes |
| inside SSTables. The format relies on byte order and tries, and uses a combination |
| of features to make the indexing structures compact and efficient. The paragraphs |
| below describe the format's features and mechanisms together with the motivation |
| behind them, and conclude with detailed description of the on-disk format. |
| |
| # Prerequisites |
| |
| ## Byte-comparable types |
| |
| The property of being byte-comparable (also called byte-ordered) for a |
| key denotes that there is a serialisation of that key to a sequence of |
| bytes where the lexicographic comparison of the unsigned bytes produces |
| the same result as performing a typed comparison of the key. |
| |
| For Cassandra, such a representation is given by |
| [CASSANDRA-6936](https://issues.apache.org/jira/browse/CASSANDRA-6936). |
| Detailed description of the mechanics of the translation are provided in |
| the [included documentation](../../../../utils/bytecomparable/ByteComparable.md). |
| |
| ## Tries |
| |
| A trie is a data structure that describes a mapping between sequences |
| and associated values. It is very similar to a deterministic finite |
| state automaton, with the main difference that an automaton is allowed |
| to have cycles, while a trie is not. |
| |
| Because the theory and main usage of the structure is for encoding words |
| of a language, the trie terminology talks about "characters", "words" |
| and "alphabet", which in our case map to bytes of the byte-ordered |
| representation, the sequence that encodes it, and the possible values of |
| a byte[^1]. |
| |
| [^1]: For simplicity this description assumes we directly map a byte to |
| a character. Other options are also possible (e.g. using hex digits |
| as the alphabet and two transitions per byte). |
| |
| A trie can be defined as a tree graph in which vertices are states, some |
| of which can be final and contain associated information, and where |
| edges are labelled with characters. A valid word in the trie is encoded |
| by a path starting from the root of the trie where each edge is labelled |
| with the next character of the word, and ending in a final state which |
| contains the 'payload' associated with the word. |
| |
| ```mermaid |
| graph TD |
| Node_(( )) |
| style Node_ fill:darkgrey |
| Node_ --"a"--> Node_a(((a))) |
| Node_a --"l"--> Node_al(( )) |
| Node_al --"l"--> Node_all(( )) |
| Node_all --"o"--> Node_allo(( )) |
| Node_allo --"w"--> Node_allow(((allow))) |
| Node_a --"n"--> Node_an(((an))) |
| Node_an --"d"--> Node_and(((and))) |
| Node_an --"y"--> Node_any(((any))) |
| Node_a --"r"--> Node_ar(( )) |
| Node_ar --"e"--> Node_are(((are))) |
| Node_a --"s"--> Node_as(((as))) |
| Node_ --"n"--> Node_n(( )) |
| Node_n --"o"--> Node_no(( )) |
| Node_no --"d"--> Node_nod(( )) |
| Node_nod --"e"--> Node_node(((node))) |
| Node_ --"o"--> Node_o(( )) |
| Node_o --"f"--> Node_of(((of))) |
| Node_o --"n"--> Node_on(((on))) |
| Node_ --"t"--> Node_t(( )) |
| Node_t --"h"--> Node_th(( )) |
| Node_th --"e"--> Node_the(((the))) |
| Node_th --"i"--> Node_thi(( )) |
| Node_thi --"s"--> Node_this(((this))) |
| Node_t --"o"--> Node_to(((to))) |
| Node_t --"r"--> Node_tr(( )) |
| Node_tr --"i"--> Node_tri(( )) |
| Node_tri --"e"--> Node_trie(((trie))) |
| Node_t --"y"--> Node_ty(( )) |
| Node_ty --"p"--> Node_typ(( )) |
| Node_typ --"e"--> Node_type(( )) |
| Node_type --"s"--> Node_types(((types))) |
| Node_ --"w"--> Node_w(( )) |
| Node_w --"i"--> Node_wi(( )) |
| Node_wi --"t"--> Node_wit(( )) |
| Node_wit --"h"--> Node_with(((with))) |
| Node_with --"o"--> Node_witho(( )) |
| Node_witho --"u"--> Node_withou(( )) |
| Node_withou --"t"--> Node_without(((without))) |
| ``` |
| |
| This means that in a constructed trie finding the payload associated |
| with a word is a matter of following the edges (also called |
| "transitions") from the initial state labelled with the consecutive |
| characters of the word, and retrieving the payload associated with the |
| state at which we end up. If that's not a final state, or if at any |
| point in this we did not find a transition in the trie matching the |
| character, the trie does not have an association for the word. The |
| complexity of lookup is thus _O_(len(word)) transitions, where the cost of |
| taking a transition is usually constant, thus this complexity is |
| theoretically optimal. |
| |
| From a storage space perspective, one of the main benefits of a trie as |
| a data structure for storing a map is the fact that it completely avoids |
| storing redundant prefixes. All words that start with the same sequence |
| store a representation of that sequence only once. If prefixes are |
| commonly shared, this can save a great deal of space. |
| |
| When the items stored in a trie are lexicographically (=byte) ordered, a |
| trie is also an ordered structure. A trie can be walked in order and it |
| is also possible to efficiently list the items between two given keys. |
| |
| In fact, one can efficiently (and lazily) apply set algebra over tries, |
| and slicing can be seen as a simple application of intersection, where |
| the intersecting trie is generated on the fly. The set operations |
| benefit from the same prefix-sharing effect — we apply union / |
| intersection / difference to a state, which has the effect of applying |
| the operation to all words that share the prefix denoted by that state. |
| |
| ```mermaid |
| graph TD |
| Node_(( )) |
| style Node_ fill:darkgrey |
| Node_ --"a"--> Node_a(((a))) |
| style Node_a stroke:lightgrey,color:lightgrey |
| linkStyle 0 stroke:lightgrey,color:lightgrey |
| Node_a --"l"--> Node_al(( )) |
| style Node_al stroke:lightgrey,color:lightgrey |
| linkStyle 1 stroke:lightgrey,color:lightgrey |
| Node_al --"l"--> Node_all(( )) |
| style Node_all stroke:lightgrey,color:lightgrey |
| linkStyle 2 stroke:lightgrey,color:lightgrey |
| Node_all --"o"--> Node_allo(( )) |
| style Node_allo stroke:lightgrey,color:lightgrey |
| linkStyle 3 stroke:lightgrey,color:lightgrey |
| Node_allo --"w"--> Node_allow(((allow))) |
| style Node_allow stroke:lightgrey,color:lightgrey |
| linkStyle 4 stroke:lightgrey,color:lightgrey |
| Node_a --"n"--> Node_an(((an))) |
| style Node_an stroke:lightgrey,color:lightgrey |
| linkStyle 5 stroke:lightgrey,color:lightgrey |
| Node_an --"d"--> Node_and(((and))) |
| style Node_and stroke:lightgrey,color:lightgrey |
| linkStyle 6 stroke:lightgrey,color:lightgrey |
| Node_an --"y"--> Node_any(((any))) |
| style Node_any stroke:lightgrey,color:lightgrey |
| linkStyle 7 stroke:lightgrey,color:lightgrey |
| Node_a --"r"--> Node_ar(( )) |
| style Node_ar stroke:lightgrey,color:lightgrey |
| linkStyle 8 stroke:lightgrey,color:lightgrey |
| Node_ar --"e"--> Node_are(((are))) |
| style Node_are stroke:lightgrey,color:lightgrey |
| linkStyle 9 stroke:lightgrey,color:lightgrey |
| Node_a --"s"--> Node_as(((as))) |
| style Node_as stroke:lightgrey,color:lightgrey |
| linkStyle 10 stroke:lightgrey,color:lightgrey |
| Node_ --"n"--> Node_n(( )) |
| Node_n --"o"--> Node_no(( )) |
| Node_no --"d"--> Node_nod(( )) |
| Node_nod --"e"--> Node_node(((node))) |
| Node_ --"o"--> Node_o(( )) |
| Node_o --"f"--> Node_of(((of))) |
| Node_o --"n"--> Node_on(((on))) |
| Node_ --"t"--> Node_t(( )) |
| Node_t --"h"--> Node_th(( )) |
| Node_th --"e"--> Node_the(((the))) |
| Node_th --"i"--> Node_thi(( )) |
| Node_thi --"s"--> Node_this(((this))) |
| style Node_this stroke:lightgrey,color:lightgrey |
| linkStyle 22 stroke:lightgrey,color:lightgrey |
| Node_t --"o"--> Node_to(((to))) |
| style Node_to stroke:lightgrey,color:lightgrey |
| linkStyle 23 stroke:lightgrey,color:lightgrey |
| Node_t --"r"--> Node_tr(( )) |
| style Node_tr stroke:lightgrey,color:lightgrey |
| linkStyle 24 stroke:lightgrey,color:lightgrey |
| Node_tr --"i"--> Node_tri(( )) |
| style Node_tri stroke:lightgrey,color:lightgrey |
| linkStyle 25 stroke:lightgrey,color:lightgrey |
| Node_tri --"e"--> Node_trie(((trie))) |
| style Node_trie stroke:lightgrey,color:lightgrey |
| linkStyle 26 stroke:lightgrey,color:lightgrey |
| Node_t --"y"--> Node_ty(( )) |
| style Node_ty stroke:lightgrey,color:lightgrey |
| linkStyle 27 stroke:lightgrey,color:lightgrey |
| Node_ty --"p"--> Node_typ(( )) |
| style Node_typ stroke:lightgrey,color:lightgrey |
| linkStyle 28 stroke:lightgrey,color:lightgrey |
| Node_typ --"e"--> Node_type(( )) |
| style Node_type stroke:lightgrey,color:lightgrey |
| linkStyle 29 stroke:lightgrey,color:lightgrey |
| Node_type --"s"--> Node_types(((types))) |
| style Node_types stroke:lightgrey,color:lightgrey |
| linkStyle 30 stroke:lightgrey,color:lightgrey |
| Node_ --"w"--> Node_w(( )) |
| style Node_w stroke:lightgrey,color:lightgrey |
| linkStyle 31 stroke:lightgrey,color:lightgrey |
| Node_w --"i"--> Node_wi(( )) |
| style Node_wi stroke:lightgrey,color:lightgrey |
| linkStyle 32 stroke:lightgrey,color:lightgrey |
| Node_wi --"t"--> Node_wit(( )) |
| style Node_wit stroke:lightgrey,color:lightgrey |
| linkStyle 33 stroke:lightgrey,color:lightgrey |
| Node_wit --"h"--> Node_with(((with))) |
| style Node_with stroke:lightgrey,color:lightgrey |
| linkStyle 34 stroke:lightgrey,color:lightgrey |
| Node_with --"o"--> Node_witho(( )) |
| style Node_witho stroke:lightgrey,color:lightgrey |
| linkStyle 35 stroke:lightgrey,color:lightgrey |
| Node_witho --"u"--> Node_withou(( )) |
| style Node_withou stroke:lightgrey,color:lightgrey |
| linkStyle 36 stroke:lightgrey,color:lightgrey |
| Node_withou --"t"--> Node_without(((without))) |
| style Node_without stroke:lightgrey,color:lightgrey |
| linkStyle 37 stroke:lightgrey,color:lightgrey |
| |
| ``` |
| |
| (An example of slicing the trie above with the range "bit"-"thing". |
| Processing only applies on boundary nodes (root, "t", "th", "thi"), |
| where we throw away the transitions outside the range. Subtries like the |
| ones for "n" and "o" fall completely between "b" and "t" thus are fully |
| inside the range and can be processed without any further restrictions.) |
| |
| A trie can be used as a modifiable in-memory data structure where one |
| can add and remove individual elements. It can also be constructed from |
| sorted input, incrementally storing the data directly to disk and |
| building an efficient read-only on-disk data structure. |
| |
| For more formal information about the concept and applications of tries |
| and finite state automata, try [Introduction to Automata Theory, |
| Languages, and Computation](https://books.google.com/books?id=dVipBwAAQBAJ). |
| There are many variations of the concept, and of the implementation of |
| states and transitions that can be put to use to achieve even further |
| efficiency gains; some of these will be detailed below. |
| |
| # Indexing with tries |
| |
| Since a trie is generally an ordered byte source to payload map, we can |
| apply the concept directly to the components of Cassandra that are most |
| affected by the inefficiency of using comparison-based structures: the |
| indices. |
| |
| This can be done in the following way: |
| |
| - When we write the index, we map each key into its byte-ordered |
| representation and create an on-disk trie of byte-ordered |
| representations of keys mapping into positions in the data file. |
| |
| - When we need an exact match for a key, we create a (lazily |
| generated) byte-ordered representation of the key and look for it |
| in the trie. |
| |
| - If we find a match, we know the data file position. |
| |
| - If there is no match, there is no data associated with the key. |
| |
| - When we need a greater-than/greater-or-equal match, we use the |
| byte-ordered representation to create a path that leads to the |
| first matching data position in the sstable. |
| |
| - We can then use this path to iterate the greater keys in the |
| sstable. |
| |
| This works, but isn't very efficient. Lookup in it is _O_(len(key)), |
| which can even mean that many seeks on disk, and we have to store |
| a transition (which defines the size of the structure) for every |
| non-prefix character in the dataset. |
| |
| We can do much better. |
| |
| ## Trimming the fat |
| |
| The primary purpose of the index is to find a position in the data file |
| for the given key. It needs to be able to find the correct position for |
| any existing key, but there is no need for it to be exact on keys that |
| are not present in the file — since our data files contain a copy of |
| the key at the start of each partition, we can simply check if the key |
| we are searching for matches the key at the position returned by the |
| index. |
| |
| This allows us to use a simple optimization: instead of storing the full |
| key in the index trie, we can store only a prefix of the key that is |
| unique among all partitions in the table. This means that we have |
| intermediate nodes in the trie only if a prefix is shared by multiple |
| keys, which normally reduces the number of nodes and transitions in the |
| trie to about 2*n*. |
| |
| ```mermaid |
| graph TD |
| Node_(( )) |
| style Node_ fill:darkgrey |
| Node_ --"a"--> Node_a((( ))) |
| Node_a --"l"--> Node_al((( ))) |
| Node_a --"n"--> Node_an((( ))) |
| Node_an --"d"--> Node_and((( ))) |
| Node_an --"y"--> Node_any((( ))) |
| Node_a --"r"--> Node_ar((( ))) |
| Node_a --"s"--> Node_as((( ))) |
| Node_ --"n"--> Node_n((( ))) |
| Node_ --"o"--> Node_o(( )) |
| Node_o --"f"--> Node_of((( ))) |
| Node_o --"n"--> Node_on((( ))) |
| Node_ --"t"--> Node_t(( )) |
| Node_t --"h"--> Node_th(( )) |
| Node_th --"e"--> Node_the((( ))) |
| Node_th --"i"--> Node_thi((( ))) |
| Node_t --"o"--> Node_to((( ))) |
| Node_t --"r"--> Node_tr((( ))) |
| Node_t --"y"--> Node_ty((( ))) |
| Node_ --"w"--> Node_w(( )) |
| Node_w --"ith"--> Node_with((( ))) |
| Node_with --"o"--> Node_without((( ))) |
| ``` |
| |
| This also reduces the number of steps we need to take in the trie. In a |
| well-balanced key set (such as the one where the byte-ordered key starts |
| with a hash as in Murmur or Random-partitioned primary keys) the lookup |
| complexity becomes _O_(log _n_) transitions[^2]. |
| |
| [^2]: For comparison, the complexity of binary search in a sorted |
| primary index is also _O_(log _n_), but in key comparisons whose |
| complexity on average in a well-balanced key set is another _O_(log _n_) |
| for a total _O_(log<sup>2</sup> _n_). |
| |
| ## Taking hardware into account |
| |
| The point above improves the number of transitions significantly, but |
| the out-of-cache efficiency is still pretty bad if we have to read a new |
| disk page every time we examine a node. Fortunately we can take some |
| extra care during construction to make sure we make the most of every |
| disk page brought up during lookup. |
| |
| The idea of this is to pack wide sections of the trie in pages, so that |
| every time we open a page we can be certain to be able to follow several |
| transitions before leaving that page. |
| |
| ```mermaid |
| graph TD |
| subgraph p1 [ ] |
| Node_(( )) |
| style Node_ fill:darkgrey |
| Node_ --"a"--> Node_a((( ))) |
| Node_ --"t"--> Node_t(( )) |
| end |
| |
| subgraph p2 [ ] |
| Node_a --"l"--> Node_al((( ))) |
| Node_a --"n"--> Node_an((( ))) |
| Node_an --"d"--> Node_and((( ))) |
| Node_an --"y"--> Node_any((( ))) |
| end |
| |
| subgraph p3 [ ] |
| Node_a --"r"--> Node_ar((( ))) |
| Node_a --"s"--> Node_as((( ))) |
| Node_ --"n"--> Node_n((( ))) |
| end |
| |
| subgraph p4 [ ] |
| Node_ --"o"--> Node_o(( )) |
| Node_o --"f"--> Node_of((( ))) |
| Node_o --"n"--> Node_on((( ))) |
| Node_t --"o"--> Node_to((( ))) |
| end |
| |
| subgraph p5 [ ] |
| Node_t --"h"--> Node_th(( )) |
| Node_th --"e"--> Node_the((( ))) |
| Node_th --"i"--> Node_thi((( ))) |
| Node_t --"r"--> Node_tr((( ))) |
| end |
| |
| subgraph p6 [ ] |
| Node_t --"y"--> Node_ty((( ))) |
| Node_ --"w"--> Node_w(( )) |
| Node_w --"ith"--> Node_with((( ))) |
| Node_with --"o"--> Node_without((( ))) |
| end |
| |
| p2 ~~~ p3 ~~~ p4 ~~~ p5 ~~~ p6 |
| ``` |
| |
| One way to generate something like this is to start from the root and do |
| a breadth-first walk, placing the encountered nodes on disk until a page |
| is filled and their target transitions in a queue for which the process |
| is repeated to fill other pages. |
| |
| Another approach, more suitable to our application because it can be |
| done as part of the incremental construction process, is to do the |
| packing from the bottom up — when the incremental construction |
| algorithm completes a node we do not immediately write it, but wait |
| until we have formed a branch that is bigger than a page. When this |
| happens we lay out the node's children (each smaller than a page but |
| root of a biggest branch that would fit) and let the parent node be |
| treated like a leaf from there on. In turn it will become part of a |
| branch that is bigger than a page and will be laid packaged together |
| with its related nodes, resulting in a picture similar to the above. |
| |
| In fact the bottom-up process has a little performance benefit over the |
| top-down: with the top-down construction the root page is full and leaf |
| pages take combinations of unrelated smaller branches; with the |
| bottom-up the leaf pages take as much information as possible about a |
| branch, while the root often remains unfilled. For the best possible |
| out-of-cache efficiency we would prefer the set of non-leaf pages to be |
| as small as possible. Having larger leaf page branches means more of the |
| trie data is in the leaf branches and thus the size of that intermediate |
| node set is smaller. |
| |
| See [`IncrementalTrieWriterPageAware`](../../../tries/IncrementalDeepTrieWriterPageAware.java) |
| for details on how the page-aware |
| trie construction is implemented. |
| |
| ## Storing the trie |
| |
| Another interesting question about the format of the trie is how one |
| stores the information about the transitions in a node. If we want to |
| maintain that the size of the structure is proportional to the number of |
| overall transitions, we need to be able to store node transitions |
| sparsely. Typically this is done using a list of transition characters |
| and binary searching among them to make a transition. |
| |
| This binary search can theoretically be taken to use constant time |
| (since the alphabet size is small and predefined), but isn't the most |
| efficient operation in practice due to the unpredictable branch |
| instructions necessary for its implementation. It is preferable to avoid |
| it as much as possible. |
| |
| To do this, and to shave a few additional bytes in common cases, our |
| implementation of on-disk tries uses typed nodes. A node can be: |
| |
| - Final with no transitions (`PAYLOAD_ONLY`). |
| |
| - Having one transition (`SINGLE`), which has to store only the |
| character and target for that transition. |
| |
| - Having a binary-searched list of transitions (`SPARSE`), where the |
| number of characters, each character and the targets are stored. |
| |
| - Having a consecutive range of transitions (`DENSE`), where the first |
| and last character and targets are stored, possibly including some |
| null transitions. |
| |
| We use one byte per node to store four bits of node type as well as four |
| bits of payload information. |
| |
| In a well-balanced and populated trie the nodes where lookup spends most |
| time (the nodes closest to the root) are `DENSE` nodes, where finding the |
| target for the transition is a direct calculation from the code of the |
| character. On the other hand, most of the nodes (the ones closest to the |
| leaves) are `PAYLOAD_ONLY`, `SINGLE` or `SPARSE` to avoid taking any more |
| space than necessary. |
| |
| The main objective for the trie storage format is to achieve the |
| smallest possible packing (and thus smallest cache usage and fewest disk |
| reads), thus we choose the type that results in the smallest |
| representation of the node. `DENSE` type gets chosen naturally when its |
| encoding (which avoids storing the character list but may include null |
| targets) is smaller than `SPARSE`. |
| |
| ## Pointer Sizes |
| |
| The next optimization we make in the storage format is based on the fact |
| that most nodes in the trie are in the lower levels of the tree and thus |
| close to leaves. As such, the distance between the node and its target |
| transitions when laid out during the construction process is small and |
| thus it is a huge win to store pointers as distances with variable size. |
| |
| This is even more true for the page-aware layout we use — all internal |
| transitions within the page (i.e. >99% of all transitions in the trie!) |
| can be stored using just an offset within the page, using just 12 bits. |
| |
| This is heavily used via further specialization of the node types: e.g. |
| we have `DENSE_12`, `DENSE_16` to `DENSE_40` as well as `DENSE_LONG` |
| subtypes which differ in the size of pointer they use. |
| |
| # Primary indexing in the BTI format |
| |
| The purpose of the primary index of an sstable is to be able to map a |
| key containing partition and clustering components to a position in the |
| sstable data file which holds the relevant row or the closest row with a |
| greater key and enables iteration of rows from that point on. |
| |
| Partition keys are normally fully specified, while clustering keys are |
| often given partially or via a comparison relation. They are also |
| treated differently by all the infrastructure and have historically had |
| different index structures; we chose to retain this distinction for the |
| time being and implement similar replacement structures using tries. |
| |
| ## Partition index implementation details |
| |
| The primary purpose of the partition index is to map a specified |
| partition key to a row index for the partition. It also needs to support |
| iteration from a (possibly partially specified) partition position. The |
| description below details mapping only; iteration is a trivial |
| application of the trie machinery to the described structure. |
| |
| In addition to wide partitions where a row index is mandatory, Cassandra |
| is often used for tables where the partitions have only a |
| couple of rows, including also ones where the partition key is the only |
| component of the primary key, i.e. where row and partition are the same |
| thing. For these situations it makes no sense to actually have a row |
| index and the partition index should point directly to the data. |
| |
| The application of tries to Cassandra's partition index uses the trie |
| infrastructure described above to create a trie mapping unique |
| byte-ordered partition key prefixes to either: |
| |
| - A position in the row index file which contains the index of the |
| rows within that partition, or |
| |
| - A position in the data file containing the relevant partition (if a |
| row index for it is not necessary). |
| |
| A single table can have both indexed and non-indexed rows. For |
| efficiency the partition index stores the position as a single long, |
| using its sign bit to differentiate between the two options[^3]. This |
| value is stored with variable length — more precisely, we use the four |
| bits provided in the node type byte to store the length of the pointer. |
| |
| [^3]: It needs to differentiate between 0 with index and 0 without |
| index, however, so we use ~pos instead of -pos to encode |
| direct-to-data mappings. This still allows sign expansion |
| instructions to be used to convert e.g. `int` to `long`. |
| |
| Lookup in this index is accomplished by converting the decorated |
| partition key to its byte-ordered representation and following the |
| transitions for its bytes while the trie has any. If at any point the |
| trie does not offer a transition for the next byte but is not a leaf |
| node, the sstable does not contain a mapping for the given key. |
| |
| If a leaf of the trie is reached, then the prefix of the partition key |
| matches some content in the file, but we are not yet sure if it is a |
| full match for the partition key. The leaf node points to a place in the |
| row index or data file. In either case the first bytes at the specified |
| position contain a serialization of the partition key, which we can |
| compare to the key being mapped. If it matches, we have found the |
| partition. If not, since the stored prefixes are unique, no data for |
| this partition exists in this sstable. |
| |
| ### Efficiency |
| |
| If everything is in cache this lookup is extremely efficient: it follows |
| a few transitions in `DENSE` nodes plus one or two binary searches in |
| `SPARSE` or `SINGLE`, and finishes with a direct comparison of a byte buffer |
| with contents of a file. No object allocation or deserialization is |
| necessary. |
| |
| If not all data is in cache, the performance of this lookup most heavily |
| depends on the number of pages that must be fetched from persistent |
| storage. The expectation on which this implementation is based, is that |
| if an sstable is in use all non-leaf pages of the index will tend to |
| remain cached. If that expectation is met, lookup will only require |
| fetching one leaf index page and one data/row index page for the full |
| key comparison. On a match the latter fetch will be required anyway, |
| since we would want to read the data at that position. |
| |
| An important consideration in the design of this feature was to make |
| sure there is no situation in which the trie indices perform worse than |
| the earlier code, thus we should aim to do at most as many reads. The |
| number of random accesses for the earlier index implementation where an |
| index summary is forced in memory is one _seek_ required to start |
| reading from the partition index (though usually multiple consecutive |
| pages need to be read), and one seek needed to start reading the actual |
| data. Since the index summary ends up being of similar size to the |
| non-leaf pages of the trie index, the memory usage and number of seeks |
| for the trie index on match ends up being the same but we read less data |
| and do much less processing. |
| |
| On mismatch, though, we may be making one additional seek. However, we |
| can drastically reduce the chance of mismatch, which we currently do in |
| two ways: |
| |
| - By using a bloom filter before lookup. The chance of getting a bloom |
| filter hit as well as a prefix match for the wrong key is pretty |
| low and gets lower with increasing sstable size. |
| |
| - By storing some of the key hash bits that are not part of the token |
| at the payload node and comparing them with the mapped key's hash |
| bits. |
| |
| Currently we use a combination of both by default as the best performing |
| option. The user can disable or choose to have a smaller bloom filter, |
| and the code also supports indices that do not contain hash bits (though |
| to reduce configuration complexity we do not have plans to expose that |
| option). |
| |
| For fully cold sstables we have to perform more random fetches from disk |
| than the earlier implementation, but we read less. Testing showed that |
| having a bloom filter is enough to make the trie index faster; if a |
| bloom filter is not present, we try going through the byte contents of |
| the index file on boot to prefetch it which ends up taking not too long |
| (since it is read sequentially rather than randomly) and boosting cold |
| performance dramatically. |
| |
| ### Building and early open |
| |
| The partition index is built using the page-aware incremental |
| construction described earlier, where we also delay writing each key |
| until we have seen the next so that we can find the shortest prefix that |
| is enough to differentiate it from the previous and next keys (this also |
| differentiates it from all others in the sstable because the contents |
| are sorted). Only that prefix is written to the trie. |
| |
| One last complication is the support for early opening of sstables which |
| allows newly-compacted tables to gradually occupy the page cache. Though |
| the index building is incremental, the partially-written trie is not |
| usable directly because the root of the trie as well as the path from it |
| to the last written nodes is not yet present in the file. |
| |
| This problem can be easily overcome, though, by dumping these |
| intermediate nodes to an in-memory buffer (without the need for |
| page-aware packing) and forming an index by attaching this buffer at the |
| end of the partially written file using |
| [`TailOverridingRebufferer`](../../../util/TailOverridingRebufferer.java). |
| |
| ## Row index implementation details |
| |
| Unlike the partition index, the main use of the row index is to iterate |
| from a given clustering key in forward or reverse direction (where exact |
| key lookup is just a special case). |
| |
| Rows are often very small (they could contain a single int or no columns |
| at all) and thus there is a real possibility for the row indices to |
| become bigger than the data they represent. This is not a desirable |
| outcome, which is part of the reason why Cassandra's row index has |
| historically operated on blocks of rows rather than indexing every row |
| in the partition. This is a concern we also have with the trie-based |
| index, thus we also index blocks of rows (by default, a block of rows |
| that is at least 16kb in size — this will be called the index |
| _granularity_ below, specified by the `column_index_size` |
| `cassandra.yaml` parameter). |
| |
| Our row index implementation thus creates a map from clustering keys or |
| prefixes to the data position at the start of the index block which is |
| the earliest that could contain a key equal or greater than the given |
| one. Additionally, if there is an active deletion at the beginning of |
| the block, the index must specify it so that it can be taken into |
| account when merging data from multiple sstables. |
| |
| Each index block will contain at least one key, but generally it will |
| have different first and last keys. We don't store these keys, but |
| instead we index the positions between blocks by storing a "separator", |
| some key that is greater than the last key of the previous block and |
| smaller than or equal to the first key of the next[^4]. Then, when we |
| look up a given clustering, we follow its bytes as long as we can in the |
| trie and we can be certain that all blocks before the closest |
| less-than-or-equal entry in the trie cannot contain any data that is |
| greater than or equal to the given key. |
| |
| [^4]: Another way to interpret this is that we index the start of each |
| block only, but for efficiency we don't use the first key of the |
| block as its beginning, but instead something closer to the last key |
| of the previous block (but still greater than it). |
| |
| It may happen that the identified block actually doesn't contain any |
| matching data (e.g. because the looked-up key ends up between the last |
| key in the block and the separator), but this only affects efficiency as |
| the iteration mechanism does not expect the data position returned by |
| the index to be guaranteed to start with elements that fit the criteria; |
| it would only have to walk a whole block forward to find the matching |
| key. |
| |
| It is important to keep the number of these false positives low, and at |
| the same time we aim for the smallest possible size of the index for a |
| given granularity. The choice of separator affects this balance[^5]; the |
| option we use, as a good tradeoff in the vein of the unique prefix |
| approach used in the partition index, is to use the shortest prefix of |
| the next block's beginning key that separates it from the previous |
| block's end key, adjusted so that the last byte of it is 1 greater than |
| that end key. |
| |
| [^5]: For example, the best separator for false positives is the next |
| possible byte sequence after the previous block's final key, which |
| is obtained by adding a 00 byte to its end. This, however, means all |
| the bytes of the byte-ordered representation of this key must be |
| present in the index, which inflates the index's size and lookup |
| complexity. |
| |
| For example, if block 2 covers "something" to "somewhere" and block 3 |
| — "sorry" to "tease", then the sequence "son" is used as the separator |
| between blocks 2 and 3. This leaves things like "sommelier" in the area |
| that triggers false positives, but stores and has to walk just three |
| bytes to find the starting point for iteration. |
| |
| ### Efficiency |
| |
| Finding the candidate block in the trie involves walking the byte |
| ordered representation of the clustering key in the trie and finding the |
| closest less-than-or-equal value. The number of steps is proportional to |
| the length of the separators — the lower their number the shorter that |
| sequence is, though we can't expect _O_(log _n_) complexity since there may |
| be many items sharing the same long prefixes (e.g. if there are long |
| strings in the components of the clustering keys before the last). Even |
| so, such repeating prefixes are addressed very well by the page-packing |
| and `SINGLE_NOPAYLOAD_4` node type, resulting in very efficient walks. |
| |
| After this step we also perform a linear walk within the data file to |
| find the actual start of the matching data. This is usually costlier and |
| may involve object allocation and deserialization. |
| |
| The tradeoff between the size of the index and the time it takes to find |
| the relevant rows is controlled by the index granularity. The lower it |
| is, the more efficient lookup (especially exact match lookup) becomes at |
| the expense of bigger index size. The 16kb default is chosen pretty |
| conservatively[^6]; if users don't mind bigger indices something like 4, |
| 2 or 1kb granularity should be quite a bit more efficient. It is also |
| possible to index every row by choosing a granularity of 0kb; at these |
| settings in-cache trie-indexed sstables tend to outperform |
| `ConcurrentSkipListMap` memtables for reads. |
| |
| [^6]: This was chosen with the aim to match the size of the trie index |
| compared to the earlier version of the row index at its default |
| granularity of 64kb. |
| |
| ### Reverse lookup |
| |
| To perform a reverse lookup, we can use the same mechanism as above |
| (with greater-than-or-equal) to find the initial block for the |
| iteration. However, in the forward direction we could simply walk the |
| data file to find the next rows, but this isn't possible going |
| backwards. |
| |
| To solve this problem the index helps the iteration machinery by |
| providing an iterator of index blocks in reverse order. For each index |
| block the iteration walks it forward and creates a stack of all its row |
| positions, then starts issuing rows by popping and examining rows from |
| that stack. When the stack is exhausted it requests the previous block |
| from the index and applies the same procedure there. |
| |
| # Code structure |
| |
| The implementation is mostly in two packages, `o.a.c.io.tries` contains |
| the generic code to construct and read on-disk tries, and |
| `o.a.c.io.sstable.format.bti`, which implements the specifics of the |
| format and the two indexes. |
| |
| ## Building tries |
| |
| Tries are built from sorted keys using an [`IncrementalTrieWriter`](../../../tries/IncrementalTrieWriter.java). |
| The code contains three implementations with increasing complexity: |
| - [`IncrementalTrieWriterSimple`](../../../tries/IncrementalTrieWriterSimple.java) |
| implements simple incremental construction of tries from sorted input, |
| - [`IncrementalTrieWriterPageAware`](../../../tries/IncrementalTrieWriterPageAware.java) |
| adds packing of nodes to disk pages, |
| - [`IncrementalDeepTrieWriterPageAware`](../../../tries/IncrementalDeepTrieWriterPageAware.java) |
| adds the ability to transition to on-heap recursion for all stages of the construction |
| process to be able to handle very large keys. |
| |
| Only the latter is used, but we provide (and test) the other two as a form of |
| documentation. |
| |
| The builders take a `TrieSerializer` as parameter, which determines how the nodes |
| are written. The indexes implement this using `TrieNode`, writing any payload they |
| need immediately after the node serialization. |
| |
| ## Reading tries |
| |
| The BTI format tries are used directly in their on-disk format. To achieve this, |
| all node types are implemented as static objects in `TrieNode`. Reading nodes in |
| a file is encapsulated in [`Walker`](../../../tries/Walker.java), |
| which provides a method to `go` to a specific node and use it, i.e. |
| get any associated data, search in the children list and |
| follow transitions to children. It also provides functionality to find the |
| mapping for a given key, floors and ceilings as well as some combinations. |
| Iterating the payloads between two key bounds is implemented by |
| [`ValueIterator`](../../../tries/ValueIterator.java), |
| and [`ReverseValueIterator`](../../../tries/ReverseValueIterator.java). |
| |
| Special care is given to prefixes to make sure the semantics of searches matches |
| what the format needs. |
| |
| ## SSTable format implementation |
| |
| The two indexes are implemented, respectively, by [`PartitionIndex`](PartitionIndex.java) |
| /[`PartitionIndexBuilder`](PartitionIndexBuilder.java) |
| and [`RowIndexReader`](RowIndexReader.java)/[`RowIndexWriter`](RowIndexWriter.java). |
| The format implementation extends the filtered |
| base class and follows the structure of the BIG implementation, where |
| all references to the primary index are replaced with calls to these two |
| classes. |
| |
| # Index file format in BTI |
| |
| ## Trie nodes |
| Implemented in [`TrieNode.java`](../../../tries/TrieNode.java) |
| |
| Nodes start with four bits of node type, followed by 4 payload bits |
| (_pb_), which are 0 if the node has no associated payload; otherwise the |
| node type gives an option to compute the starting position for the |
| payload (_ppos_) from the starting position of the node (_npos_). |
| The layout of the node depends on its type. |
| |
| `PAYLOAD_ONLY` nodes: |
| |
| - 4 type bits, 0 |
| |
| - 4 payload bits |
| |
| - payload if _pb_ ≠ 0, _ppos_ is _npos_ + 1 |
| |
| `SINGLE_NOPAYLOAD_4` and `SINGLE_NOPAYLOAD_12` nodes: |
| |
| - 4 type bits |
| |
| - 4 pointer bits |
| |
| - 8 pointer bits (for `SINGLE_NOPAYLOAD_12`) |
| |
| - 8 bits transition byte |
| |
| - _pb_ is assumed 0 |
| |
| `SINGLE_8/16`: |
| |
| - 4 type bits |
| |
| - 4 payload bits |
| |
| - 8 bits transition byte |
| |
| - 8/16 pointer bits |
| |
| - payload if _pb_ ≠ 0, _ppos_ is _npos_ + 3/4 |
| |
| `SPARSE_8/12/16/24/40`: |
| |
| - 4 type bits |
| |
| - 4 payload bits |
| |
| - 8 bit child count |
| |
| - 8 bits per child, the transition bytes |
| |
| - 8/12/16/24/40 bits per child, the pointers |
| |
| - payload if _pb_ ≠ 0, _ppos_ is _npos_ + 2 + (2/2.5/3/4/6)*(_child |
| count_) (rounded up) |
| |
| `DENSE_12/16/24/32/40/LONG`: |
| |
| - 4 type bits |
| |
| - 4 payload bits |
| |
| - 8 bit start byte value |
| |
| - 8 bit _length_-1 |
| |
| - _length_ * 12/16/24/32/40/64 bits per child, the pointers |
| |
| - payload if _pb_ ≠ 0, _ppos_ is _npos_ + 3 + (1.5/2/3/4/5/8)*(_length_) |
| (rounded up) |
| |
| This is the space taken by each node type (_CS_ stands for child span, |
| i.e. largest - smallest + 1, _CC_ is child count): |
| |
| |Type | Size in bytes excl. payload |Size for 1 child|Size for 9 dense children (01-08, 10)|Size for 10 sparse children (01 + i*10)|Why the type is needed | |
| |:--------------------|:----------------------------|---------------:|-------------------:|-------------------:|:-------------------------------| |
| |`PAYLOAD_ONLY` | 1 | - | - | - | Leaves dominate the trie | |
| |`SINGLE_NOPAYLOAD_4` | 2 |2 | - | - | Single-transition chains | |
| |`SINGLE_8` | 3 |3 | - | - | Payload within chain | |
| |`SPARSE_8` | 2 + _CC_ * 2 |4 | 20 | 22 | Most common type after leaves | |
| |`SINGLE_NOPAYLOAD_12`| 3 |3 | - | - | 12 bits cover all in-page transitions | |
| |`SPARSE_12` | 2 + _CC_ * 2.5 |5 | 25 | 27 | Size of sparse is proportional to number of children | |
| |`DENSE_12` | 3 + _CS_ * 1.5 |5 | 18 | 140 | Lookup in dense is faster, size smaller if few holes | |
| |`SINGLE_16` | 4 |4 | - | - | | |
| |`SPARSE_16` | 2 + _CC_ * 3 |5 | 29 | 32 | | |
| |`DENSE_16` | 3 + _CS_ * 2 |5 | 23 | 185 | | |
| |`SPARSE_24` | 2 + _CC_ * 4 |6 | 38 | 42 | | |
| |`DENSE_24` | 3 + _CS_ * 3 |6 | 33 | 276 | | |
| |`DENSE_32` | 3 + _CS_ * 4 |7 | 43 | 367 | Nodes with big subtrees are usually dense | |
| |`SPARSE_40` | 2 + _CC_ * 6 |8 | 56 | 62 | | |
| |`DENSE_40` | 3 + _CS_ * 5 |8 | 53 | 458 | | |
| |`DENSE_LONG` | 3 + _CS_ * 8 |11 | 83 | 731 | Catch-all | |
| |
| All pointers are stored as distances, and since all tries are written |
| from the bottom up (and hence a child is always before the parent in the |
| file), the distance is subtracted from the position of the current node |
| to obtain the position of the child node. |
| |
| Note: All nodes are placed in such a way that they do not cross a page |
| boundary. I.e. if a reader (e.g. [`Walker`](../../../tries/Walker.java)) |
| is positioned at a node, it is |
| guaranteed that all reads of the node's data can complete without |
| requiring a different page to be fetched from disk. |
| |
| ## Partition index |
| Implemented in [`PartitionIndex.java`](PartitionIndex.java) |
| |
| Layout: |
| ``` |
| [nodes page, 4096 bytes] |
| ... |
| [nodes page, 4096 bytes] |
| [nodes page including root node, ≤4096 bytes] |
| [smallest key, with short length] |
| [largest key, with short length] |
| [smallest key pos, long] |
| [key count, long] |
| [root pos, long] |
| ``` |
| |
| The SSTable's partition index is stored in the -Partitions.db file. The |
| file itself is written from the bottom up, and its "header" is at the |
| end of the file. |
| |
| More precisely, the last 3 longs in the file contain: |
| |
| - A file position where the smallest and greatest key are written. |
| |
| - The exact number of keys in the file. |
| |
| - A file position for the root node of the index. |
| |
| These three longs are preceded by the serialization of the first and |
| last key, and before that are the trie contents. |
| |
| To find a match for the key, start at the root position, decode the node |
| (see the "Trie nodes" section above) and follow the transitions |
| according to the bytes of the byte-ordered representation of the key |
| while the node has children and there are bytes left in the key. |
| |
| If a leaf node is reached, that node contains the following payload: |
| |
| - If _pb_ < 8, let |
| |
| - _idxpos_ be the sign-extended integer value of length _pb_ at |
| _ppos_ |
| |
| - If _pb_ ≥ 8 (always the case in Cassandra 5 files), let |
| |
| - _hash_ be the byte at _ppos_ |
| |
| - _idxpos_ be the sign-extended integer value of length _pb-7_ at |
| _ppos+1_ |
| |
| If at any step there is no transition corresponding to the byte of the |
| key, or if _hash_ is present and the lowest-order byte of the key's hash |
| value does not match it, the index and sstable have no mapping for the |
| given key. |
| |
| Otherwise _idxpos_ specifies: |
| |
| - if _idxpos_ ≥ 0, the row index file contains an index for the |
| given key (see row index section below) at position _idxpos_ |
| |
| - otherwise, the data associated with the key starts at position |
| ~_idxpos_ in the data file |
| |
| In either case the content in the respective file starts with the |
| serialization of the partition key, which must be compared with the key |
| requested to ensure they match. |
| |
| ## Row index |
| Implemented in [`RowIndexReader.java`](RowIndexReader.java) |
| |
| Layout: |
| ``` |
| [row index, padded to page boundary] |
| ... |
| [row index, padded to page boundary] |
| ``` |
| Where each row index contains: |
| ``` |
| [nodes page, 4096 bytes] |
| ... |
| [nodes page, 4096 bytes] |
| [nodes page including root node, ≤4096 bytes] |
| [partition key, with short length] |
| [position of the partition in the data file, unsigned vint] |
| [position of the root node, vint encoding the difference between the |
| start of the data file position and the position of the root node] |
| [number of rows in the partition, unsigned vint] |
| [partition deletion time, 12 bytes] |
| ``` |
| |
| The entries in the partition index point to the position at the start of |
| the partition key. |
| |
| The payload reachable at the end of a traversal contains: |
| |
| - Integer of _pb_&7 bytes specifying the offset within the partition |
| where the relevant row is contained |
| |
| - If _pb_ ≥ 8, 12 bytes of deletion time active at the start of the |
| row index block |
| |