guide_8099.md - cassandra - Git at Google

 Overview of CASSANDRA-8099 changes
 ==================================

 The goal of this document is to provide an overview of the main changes done in
 CASSANDRA-8099 so it's easier to dive into the patch. This assumes knowledge of
 the pre-existing code.

 CASSANDRA-8099 refactors the abstractions used by the storage engine and as
 such impact most of the code of said engine. The changes can be though of as
 following two main guidelines:
 1. the new abstractions are much more iterator-based (i.e. it tries harder to
    avoid materializing everything in memory),
 2. and they are closer to the CQL representation of things (i.e. the storage
    engine is aware of more structure, making it able to optimize accordingly).
 Note that while those changes have heavy impact on the actual code, the basic
 mechanisms of the read and write paths are largely unchanged.

 In the following, I'll start by describe the new abstractions introduced by the
 patch. I'll then provide a quick reference of existing class to what it becomes
 in the patch, after which I'll discuss how the refactor handles a number of
 more specific points. Lastly, the patch introduces some change to the on-wire
 and on-disk format so I'll discuss those quickly.


 Main new abstractions
 ---------------------

 ### Atom: Row and RangeTombstoneMarker

 Where the existing storage engine is mainly handling cells, the new engine
 groups cells into rows, and rows becomes the more central building block. A
 `Row` is identified by the value of it's clustering columns which are stored in
 a `Clustering` object (see below), and it associate a number of cells to each
 of its non-PK non-static columns (we'll discuss static columns more
 specifically later).

 The patch distinguishes 2 kind of columns: simple and complex ones. The
 _simple_ columns can have only 1 cell associated to them (or none), while the
 _complex_ ones will have an arbitrary number of cells associated.  Currently,
 the complex columns are only the non frozen collections (but we'll have
 non-frozen udt at some point and who knows what in the future).

 Like before, we also have to deal with range tombstones. However, instead of
 dealing with full range tombstones, we generally deal with
 `RangeTombstoneMarker` which is just one of the bound of the range tombstone
 (so that a range tombstone is composed of 2 "marker" in practice, its start and
 its end). I'll discuss the reasoning for this a bit more later. A
 `RangeTombstoneMarker` is identified by a `Slice.Bound` (which is to RT markers
 what the `Clustering` is to `Row`) and simply store its deletion information.

 The engine thus mainly work with rows and range tombstone markers, and they are
 both grouped under the common `Atom` interface. An "unfiltered" is thus just that:
 either a row or a range tombstone marker.

 > Side Note: the "Atom" naming is pretty bad. I've reused it mainly because it
 > plays a similar role to the existing OnDiskAtom, but it's arguably crappy now
 > because a row is definitively not "indivisible". Anyway, renaming suggestions
 > are more than welcome. The only alternative I've come up so far are "Block"
 > or "Element" but I'm not entirely convinced by either.

 ### ClusteringPrefix, Clustering, Slice.Bound and ClusteringComparator

 Atoms are sorted (within a partition). They are ordered by their
 `ClusteringPrefix`, which is mainly a common interface for the `Clustering` of
 `Row`, and the `Slice.Bound` of `RangeTombstoneMarker`. More generally, a
 `ClusteringPrefix` is a prefix of the clustering values for the clustering
 columns of the table involved, with a `Clustering` being the special case where
 all values are provided. A `Slice.Bound` can be a true prefix however, having
 only some of the clustering values. Further, a `Slice.Bound` can be either a
 start or end bound, and it can be inclusive or exclusive. Sorting make sure that
 a start bound is before anything it "selects" (and conversely for the end). A
 `Slice` is then just 2 `Slice.Bound`: a start and a end, and selects anything
 that sorts between those.

 `ClusteringPrefix` are compared through the table `ClusteringComparator`, which
 is like our existing table comparator except that it only include comparators
 for the clustering column values. In particular, it includes neither the
 comparator for the column names themselves, nor the post-column-name comparator
 for collections (the latter being handled through the `CellPath`, see below).
 There is a also a `Clusterable` interface that `Atom` implements and that
 simply marks object that can be compared by a `ClusteringComparator`, i.e.
 objects that have a `ClusteringPrefix`.

 ### Cell

 A cell holds the informations on a single value. It corresponds to a (CQL)
 column, has a value, a timestamp and optional ttl and local deletion time.
 Further, as said above, complex columns (collections) will have multiple
 associated cells. Those cells are distinguished by their `CellPath`, which are
 compared through a comparator that depends on the column type. The cells of
 simple columns just have a `null` cell path.

 ### AtomIterator and PartitionIterator

 As often as possible, atoms are manipulated through iterators. And this through
 `AtomIterator`, which is an iterator over the atoms of a single partition, and
 `PartitionIterator`, which is an iterator of `AtomIterator`, i.e. an iterator
 over multiple partitions. In other words a single partition query fundamentally
 returns an `AtomIterator`, while a range query returns a `PartitionIterator`.
 Those iterators are closeable and the code has to be make sure to always close
 them as they will often have resources to clean (like an OpOrder.Group to close
 or files to release).

 The read path mainly consists in getting unfiltered and partition iterators from
 sstables and memtable and merging, filtering and transforming them. There is a
 number of functions to do just that (merging, filtering and transforming) in
 `AtomIterators` and `PartitionIterators`, but there is also a number of classes
 (`RowFilteringAtomIterator`, `CountingAtomIterator`, ...) that wraps one of
 those iterator type to apply some filtering/transformation.

 `AtomIterator` and `PartitionIterator` also have their doppelgänger
 `RowIterator` and `DataIterator` which exists for the sake of making it easier
 for the upper layers (StorageProxy and above) to deal with deletion. We'll
 discuss those later.

 ### Partition: PartitionUpdate, AtomicBTreePartition and CachedPartition

 While we avoid materializing partitions in memory as much as possible (favoring
 the iterative approach), there is cases where we need/want to hold some subpart
 of a partition in memory and we have a generic `Partition` interface for those.
 A `Partition` basically corresponds to materializing an `AtomIterator` and
 is thus somewhat equivalent to the existing
 `ColumnFamily` (but again, many existing usage of `ColumnFamily` simply use
 `AtomIterator`). `Partition` is mainly used through the following
 implementations:
 * `PartitionUpdate`: this is what a `Mutation` holds and is used to gather
   update and apply them to memtables.
 * `AtomicBTreePartition`: this is the direct counterpart of AtomicBTreeColumns.
   The difference being that the BTree holds rows instead of cells. On updates, we
   merge the rows together to create a new one.
 * `CachedPartition`: this is used by the row cache.

 ### Read commands

 The `ReadCommand` class still exists, but instead of being just for single
 partition reads, it's now a common abstract class for both single partition and
 range reads. It then has 2 subclass: `SinglePartitionReadCommand` and
 `PartitionRangeReadCommand`, the former of which has 2 subclasses itself:
 `SinglePartitionSliceCommand` and `SinglePartitionNamesCommand`. All `ReadCommand`,
 have a `ColumnFilter` and a `DataLimits` (see below). `PartitionRangeReadCommand`
 additionally has a `DataRange`, which is mostly just a range of partition key
 with a `PartitionFilter`, while `SinglePartitionReadCommand` has a partition
 key and a `PartitionFilter` (see below too).

 The code to execute those queries locally, which used to be in `ColumnFamilyStore`
 and `CollationController` is now in those `ReadCommand` classes. For instance, the
 `CollationController` code for names queries is in `SinglePartitionNamesCommand`,
 and the code to decide if we use a 2ndary index or not is directly in
 `ReadCommand.executeLocally()`.

 Note that because they share a common class, all `ReadCommand` actually
 return a `PartitionIterator` (an iterator over partitions), even single partition
 ones (that iterator will just return one or zero result). It actually allows to
 generalize (and simplify) the "response resolver". Instead of having separate resolver
 for range and single partition queries, we only have `DataResolver` and
 `DigestResolver` that work for any read command. This does mean that the patch
 fixes CASSANDRA-2986, and that we could use digest queries for range if we
 wanted to (not necessarily saying it's a good idea).

 ### ColumnFilter

 `ColumnFilter` is the new `List<IndexExpression>`. It holds those column restrictions that
 can't be directly fulfilled by the `PartitionFilter`, i.e. those that require either a
 2ndary index, or filtering.

 ### PartitionFilter

 `PartitionFilter` is the new `IDiskAtomFilter`/`QueryFilter`. There is still 2 variants:
 `SlicePartitionFilter` and `NamesPartitionFilter`. Both variant includes the actual columns
 that are queried (as we don't return full CQL rows anymore), and both can be
 reversed. A names filter queries a bunch of rows by names, i.e. has a set of
 `Clustering`. A slice filter queries one or more slice of rows. A slice filter
 does not however include a limit since that is dealt with by `DataLimits` which
 is in `ReadCommand` directly.


 ### DataLimits

 `DataLimits` implement the limits on a query. This is meant to abstract the differences between
 how we count for thrift and for CQL. Further, for CQL, this allow to have a limit per partition,
 which clean up how DISTINCT queries are handled and allow for CASSANDRA-7017 (the patch doesn't
 add support for the PER PARTITION LIMIT syntax of that ticket, but handle it
 internally otherwise).

 ### SliceableAtomIterator

 The code also use the `SliceableAtomIterator` abstraction. A
 `SliceableAtomIterator` is an `AtomIterator` for which we basically know how to seek
 into efficiently. In particular, we have a `SSTableIterator` which is a
 `SliceableAtomIterator`. That `SSTableIterator` replaces both the existing
 `SSTableNamesIterator` and `SSTableSliceIterator`, and the respective
 `PartitionFilter` uses that `SliceableAtomIterator` interface to query what
 they want exactly.


 What did that become again?
 ---------------------------

 For quick reference, here's the rough correspondence of old classes to new classes:
 * `ColumnFamily`: for writes, this is handled by `PartitionUpdate` and
  `AtomicBTreePartition`. For reads, this is replaced by `AtomIterator` (or
  `RowIterator`, see the parts on tombstones below). For the row cache, there is
  a specific `CachedPartition` (which is actually an interface, the implementing
  class being `ArrayBackedPartition`)
 * `Cell`: there is still a `Cell` class, which is roughly the same thing than
   the old one, except that instead of having a cell name, cells are now in a
   row and correspond to a column.
 * `QueryFilter`: doesn't really exists anymore. What it was holding is now in
   `SinglePartitionReadCommand`.
 * `AbstractRangeCommand` is now `PartitionRangeReadCommand`.
 * `IDiskAtomFilter` is now `PartitionFilter`.
 * `List<IndexExpression>` is now `ColumnFilter`.
 * `RowDataResolver`, `RowDigestResolver` and `RangeSliceResponseResolver` are
   now `DataResolver` and `DigestResolver`.
 * `Row` is now `AtomIterator`.
 * `List<Row>` is now `PartitionIterator` (or `DataIterator`, see the part about
   tombstones below).
 * `AbstractCompactedRow` and`LazilyCompactedRow` are not really needed anymore.
   Their corresponding code is in `CompactionIterable`.


 Noteworthy points
 -----------------

 ### Dealing with tombstones and shadowed cells

 There is a few aspects worth noting regarding the handling of deleted and
 shadowed data:
 1. it's part of the contract of an `AtomIterator` that it must not shadow it's
    own data. In other words, it should not return a cell that is deleted by one
    of its own range tombstone, or by its partition level deletion. In practice
    this means that we get rid of shadowed data quickly (a good thing) and that
    there is a limited amount of places that have to deal with shadowing
    (merging being one).
 2. Upper layer of the code (anything above StorageProxy) don't care about
    deleted data (and deletion informations in general). Basically, as soon as
    we've merge results for the replica, we don't need tombstones anymore.
    So, instead of requiring all those upper layer to filter tombstones
    themselves (which is error prone), we get rid of them as soon as we can,
    i.e. as soon as we've resolved replica responses (so in the
    `ResponseResolver`). To do that and to make it clear when tombstones have
    been filtered (and make the code cleaner), we transform an `AtomIterator`
    into a `RowIterator`. Both being essentially the same thing, except that a
    `RowIterator` only return live stuffs. Which mean in particular that it's an
    iterator of `Row` (since an unfiltered is either a row or a range tombstone and
    we've filtered tombstones). Similarly, a `PartitionIterator` becomes a
    `DataIterator`, which is just an iterator of `RowIterator`.
 3. In the existing code a CQL row deletion involves a range tombstone. But as
    row deletions are pretty frequent and range tombstone have inherent
    inefficiencies, the patch adds the notion of row deletion, which is just
    some optional deletion information on the `Row`. This can be though as just
    an optimization of range tombstones that span only a single row.
 4. As mentioned at the beginning of this document, the code splits range
    tombstones into 2 range tombstone marker, one for each bound. The problem
    with storing full range tombstone as we currently do is that it makes it
    harder to merge them efficiently, and in particular to "normalize"
    overlapping ones. In practice, a given `AtomIterator` guarantees that there
    is only one range tombstone to worry about at any given time. In other
    words, at any point of the iterator, either there is a single open range
    tombstone or there is none, which makes things easier and more efficient.
    It's worth noting that we still have the `RangeTombstone` class because for
    in-memory structures (`PartitionUpdate`, `AtomicBTreePartition`, ...) we
    currently use the existing `RangeTombstoneList` out of convenience. This is
    kind of an implementation detail that could change in the future.

 ### statics

 Static columns are handled separately from the rest of the rows. In practice,
 each `Partition`/`AtomIterator` has a separate "static row" that holds the
 value for all the static columns (that row can of course be empty). That row
 doesn't really correspond to any CQL row, it's content is simply "merged" to
 the result set in `SelectStatement` (very much like we already do).

 ### Row liveness

 In CQL, a row can exists even if only its PK columns have values. In other
 words, a `Row` can be live even if it doesn't have any cells. Currently, we
 handle this through the "row marker" (i.e. through a specific cell). The patch
 makes this slightly less hacky by adding a timestamp (and potentially a ttl) to
 each row, which can be though as the timestamp (and ttl) of the PK columns.

 Further, when we query only some specific columns in a row, we need to add a
 row (with nulls) in the result set if a row is live (it has live cells or the
 timestamp we described above) even if it has no actual values for the queried
 columns.  We currently deal with that by querying entire row all the time, but
 the patch change that. This does mean that even when we query only some
 columns, we still need to have the information on whether the row itself is
 live or not. And because we'll merge results from different sources (which can
 include deletions), a boolean is not enough, so we include in a `Row` object
 the maximum live timestamp known for this row. Which currently mean that in
 practice, we do scan the full row on disk but filter cells we're not interested
 by right away (and in fact, we don't even deserialize the value of such cells).
 This might be optimizable later on, but expiring data makes that harder (we
 typically can't just pre-compute that max live timestamp when writing the
 sstable since it can depend on the time of the query).

 ### Flyweight pattern

 The patch makes relatively heavy use of a "flyweight"-like pattern. Typically,
 `Row` and `Cell` data are stored in arrays, and a given `AtomIterator` will
 only use a single `Row` object (and `Cell` object) that points in those arrays
 (and thus change at each call to `hasNext()/next()`). This does mean that the
 objects returned that way shouldn't be aliased (taken a reference of) without
 care. The patch uses an `Aliasable` interface to mark object that may use this
 pattern and should thus potentially be copied in the rare cases where a
 reference should be kept on them.

 ### Reversed queries

 The patch slightly change the way reversed queries are handled. First, while
 a reverse query should currently reverse its slices in `SliceQueryFilter`, this
 is not the case anymore: `SlicePartitionFilter` always keep its slices in
 clustering order and simply handles those from the end to the beginning on
 reversed queries. Further, if a query is reversed, the results are returned in
 this reverse order all the way through. Which differs from the current code
 where `ColumnFamily` actually always holds cells in forward order (making it a
 mess for paging and forcing us to re-reverse everything in the result set).

 ### Compaction

 As the storage engine works iteratively, compaction simply has to get iterators
 on the sstables, merge them and write the result back, along with a simple
 filter that skip purgeable tombstones in the process. So there isn't really a
 need for `AbstractCompactedRow`/`LazilyCompactedRow` anymore and the
 compaction/purging code is now in `CompactionIterable` directly.

 ### Short reads

 The current way to handle short reads would require us to consume the whole
 result before deciding on a retry (and we currently retry the whole command),
 which doesn't work too well in an iterative world.  So the patch moves read
 protection in `DataResolver` (where it kind of belong anyway) and we don't
 retry the full command anymore. Instead, if we realize that a given node has a
 short read while its result is consumed, we simply query this node (and this
 node only) for a continuation of the result. On top of avoiding the retry of
 the whole read (and limiting the number of node queried on the retry), this
 also make it trivial to solve CASSANDRA-8933.


 Storage format (on-disk and on-wire)
 ------------------------------------

 Given that the main abstractions are changed, the existing on-wire
 `ColumnFamily` format is not appropriate anymore and the patch switches to a
 new format. The same can be told of the on-disk format, and while it is not an
 objective of CASSANDRA-8099 to get fancy on the on-disk format, using the
 on-wire format as on-disk format was actually relatively simple (until more
 substantial changes land with CASSANDRA-7447) and the patch does that too.

 For a given partition, the format simply serialize rows one after another
 (atoms in practice). For the on-disk format, this means that it is now rows
 that are indexed, not cells. The format uses a header that is written at the
 beginning of each partition for the on-wire format (in
 `AtomIteratorSerializer`) and is kept as a new sstable `Component` for
 sstables. The details of the format are described in the javadoc of
 `AtomIteratorSerializer` (only used by the on-wire format) and of
 `AtomSerializer`, so let me just point the following differences compared to
 the current format:
 * Clustering values are only serialized once per row (we even skip serializing the
   number of elements since that is fixed for a given table).
 * Column names are not written for every row. Instead they are written once in
   the header. For a given row, we support two small variant: dense and sparse.
   When dense, cells come in the order the columns have in the header, meaning
   that if a row doesn't have a particular column, this column will still use
   a byte. When sparse, we don't have anything if the column doesn't have a cell,
   but each cell has an additional 2 bytes which points into the header (so with
   vint it should rarely take more than 1 byte in practice). The variant used
   is automatically decided based on stats on how many columns set a row has on
   average for the source we serialize.
 * Values for fixed-width cell values are serialized without a size.
 * If a cell has the same timestamp than its row, that timestamp is not repeated
   for the cell. Same for the ttl (if applicable).
 * Timestamps, ttls and local deletion times are delta encoded so that they are
   ripe for vint encoding. The current version of the patch does not yet
   activate vint encoding however (neither for on-wire or on-disk).