| = Dynamo |
| |
| Apache Cassandra relies on a number of techniques from Amazon's |
| http://courses.cse.tamu.edu/caverlee/csce438/readings/dynamo-paper.pdf[Dynamo] |
| distributed storage key-value system. Each node in the Dynamo system has |
| three main components: |
| |
| * Request coordination over a partitioned dataset |
| * Ring membership and failure detection |
| * A local persistence (storage) engine |
| |
| Cassandra primarily draws from the first two clustering components, |
| while using a storage engine based on a Log Structured Merge Tree |
| (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.44.2782&rep=rep1&type=pdf[LSM]). |
| In particular, Cassandra relies on Dynamo style: |
| |
| * Dataset partitioning using consistent hashing |
| * Multi-master replication using versioned data and tunable consistency |
| * Distributed cluster membership and failure detection via a gossip |
| protocol |
| * Incremental scale-out on commodity hardware |
| |
| Cassandra was designed this way to meet large-scale (PiB+) |
| business-critical storage requirements. In particular, as applications |
| demanded full global replication of petabyte scale datasets along with |
| always available low-latency reads and writes, it became imperative to |
| design a new kind of database model as the relational database systems |
| of the time struggled to meet the new requirements of global scale |
| applications. |
| |
| == Dataset Partitioning: Consistent Hashing |
| |
| Cassandra achieves horizontal scalability by |
| https://en.wikipedia.org/wiki/Partition_(database)[partitioning] all |
| data stored in the system using a hash function. Each partition is |
| replicated to multiple physical nodes, often across failure domains such |
| as racks and even datacenters. As every replica can independently accept |
| mutations to every key that it owns, every key must be versioned. Unlike |
| in the original Dynamo paper where deterministic versions and vector |
| clocks were used to reconcile concurrent updates to a key, Cassandra |
| uses a simpler last write wins model where every mutation is timestamped |
| (including deletes) and then the latest version of data is the "winning" |
| value. Formally speaking, Cassandra uses a Last-Write-Wins Element-Set |
| conflict-free replicated data type for each CQL row, or |
| https://en.wikipedia.org/wiki/Conflict-free_replicated_data_type LWW-Element-Set_(Last-Write-Wins-Element-Set)[LWW-Element-Set |
| CRDT], to resolve conflicting mutations on replica sets. |
| |
| === Consistent Hashing using a Token Ring |
| |
| Cassandra partitions data over storage nodes using a special form of |
| hashing called |
| https://en.wikipedia.org/wiki/Consistent_hashing[consistent hashing]. In |
| naive data hashing, you typically allocate keys to buckets by taking a |
| hash of the key modulo the number of buckets. For example, if you want |
| to distribute data to 100 nodes using naive hashing you might assign |
| every node to a bucket between 0 and 100, hash the input key modulo 100, |
| and store the data on the associated bucket. In this naive scheme, |
| however, adding a single node might invalidate almost all of the |
| mappings. |
| |
| Cassandra instead maps every node to one or more tokens on a continuous |
| hash ring, and defines ownership by hashing a key onto the ring and then |
| "walking" the ring in one direction, similar to the |
| https://pdos.csail.mit.edu/papers/chord:sigcomm01/chord_sigcomm.pdf[Chord] |
| algorithm. The main difference of consistent hashing to naive data |
| hashing is that when the number of nodes (buckets) to hash into changes, |
| consistent hashing only has to move a small fraction of the keys. |
| |
| For example, if we have an eight node cluster with evenly spaced tokens, |
| and a replication factor (RF) of 3, then to find the owning nodes for a |
| key we first hash that key to generate a token (which is just the hash |
| of the key), and then we "walk" the ring in a clockwise fashion until we |
| encounter three distinct nodes, at which point we have found all the |
| replicas of that key. This example of an eight node cluster with |
| gRF=3 can be visualized as follows: |
| |
| image::ring.svg[image] |
| |
| You can see that in a Dynamo like system, ranges of keys, also known as |
| *token ranges*, map to the same physical set of nodes. In this example, |
| all keys that fall in the token range excluding token 1 and including |
| token 2 (grange(t1, t2]) are stored on nodes 2, 3 and 4. |
| |
| === Multiple Tokens per Physical Node (vnodes) |
| |
| Simple single token consistent hashing works well if you have many |
| physical nodes to spread data over, but with evenly spaced tokens and a |
| small number of physical nodes, incremental scaling (adding just a few |
| nodes of capacity) is difficult because there are no token selections |
| for new nodes that can leave the ring balanced. Cassandra seeks to avoid |
| token imbalance because uneven token ranges lead to uneven request load. |
| For example, in the previous example there is no way to add a ninth |
| token without causing imbalance; instead we would have to insert `8` |
| tokens in the midpoints of the existing ranges. |
| |
| The Dynamo paper advocates for the use of "virtual nodes" to solve this |
| imbalance problem. Virtual nodes solve the problem by assigning multiple |
| tokens in the token ring to each physical node. By allowing a single |
| physical node to take multiple positions in the ring, we can make small |
| clusters look larger and therefore even with a single physical node |
| addition we can make it look like we added many more nodes, effectively |
| taking many smaller pieces of data from more ring neighbors when we add |
| even a single node. |
| |
| Cassandra introduces some nomenclature to handle these concepts: |
| |
| * *Token*: A single position on the dynamo style hash |
| ring. |
| * *Endpoint*: A single physical IP and port on the network. |
| * *Host ID*: A unique identifier for a single "physical" node, usually |
| present at one gEndpoint and containing one or more |
| gTokens. |
| * *Virtual Node* (or *vnode*): A gToken on the hash ring |
| owned by the same physical node, one with the same gHost |
| ID. |
| |
| The mapping of *Tokens* to *Endpoints* gives rise to the *Token Map* |
| where Cassandra keeps track of what ring positions map to which physical |
| endpoints. For example, in the following figure we can represent an |
| eight node cluster using only four physical nodes by assigning two |
| tokens to every node: |
| |
| image::vnodes.svg[image] |
| |
| Multiple tokens per physical node provide the following benefits: |
| |
| [arabic] |
| . When a new node is added it accepts approximately equal amounts of |
| data from other nodes in the ring, resulting in equal distribution of |
| data across the cluster. |
| . When a node is decommissioned, it loses data roughly equally to other |
| members of the ring, again keeping equal distribution of data across the |
| cluster. |
| . If a node becomes unavailable, query load (especially token aware |
| query load), is evenly distributed across many other nodes. |
| |
| Multiple tokens, however, can also have disadvantages: |
| |
| [arabic] |
| . Every token introduces up to `2 * (RF - 1)` additional neighbors on |
| the token ring, which means that there are more combinations of node |
| failures where we lose availability for a portion of the token ring. The |
| more tokens you have, |
| https://jolynch.github.io/pdf/cassandra-availability-virtual.pdf[the |
| higher the probability of an outage]. |
| . Cluster-wide maintenance operations are often slowed. For example, as |
| the number of tokens per node is increased, the number of discrete |
| repair operations the cluster must do also increases. |
| . Performance of operations that span token ranges could be affected. |
| |
| Note that in Cassandra `2.x`, the only token allocation algorithm |
| available was picking random tokens, which meant that to keep balance |
| the default number of tokens per node had to be quite high, at `256`. |
| This had the effect of coupling many physical endpoints together, |
| increasing the risk of unavailability. That is why in `3.x +` the new |
| deterministic token allocator was added which intelligently picks tokens |
| such that the ring is optimally balanced while requiring a much lower |
| number of tokens per physical node. |
| |
| == Multi-master Replication: Versioned Data and Tunable Consistency |
| |
| Cassandra replicates every partition of data to many nodes across the |
| cluster to maintain high availability and durability. When a mutation |
| occurs, the coordinator hashes the partition key to determine the token |
| range the data belongs to and then replicates the mutation to the |
| replicas of that data according to the |
| `Replication Strategy`. |
| |
| All replication strategies have the notion of a *replication factor* |
| (`RF`), which indicates to Cassandra how many copies of the partition |
| should exist. For example with a `RF=3` keyspace, the data will be |
| written to three distinct *replicas*. Replicas are always chosen such |
| that they are distinct physical nodes which is achieved by skipping |
| virtual nodes if needed. Replication strategies may also choose to skip |
| nodes present in the same failure domain such as racks or datacenters so |
| that Cassandra clusters can tolerate failures of whole racks and even |
| datacenters of nodes. |
| |
| === Replication Strategy |
| |
| Cassandra supports pluggable *replication strategies*, which determine |
| which physical nodes act as replicas for a given token range. Every |
| keyspace of data has its own replication strategy. All production |
| deployments should use the `NetworkTopologyStrategy` while the |
| `SimpleStrategy` replication strategy is useful only for testing |
| clusters where you do not yet know the datacenter layout of the cluster. |
| |
| [[network-topology-strategy]] |
| ==== `NetworkTopologyStrategy` |
| |
| `NetworkTopologyStrategy` requires a specified replication factor |
| for each datacenter in the cluster. Even if your cluster only uses a |
| single datacenter, `NetworkTopologyStrategy` is recommended over |
| `SimpleStrategy` to make it easier to add new physical or virtual |
| datacenters to the cluster later, if required. |
| |
| In addition to allowing the replication factor to be specified |
| individually by datacenter, `NetworkTopologyStrategy` also attempts to |
| choose replicas within a datacenter from different racks as specified by |
| the `Snitch`. If the number of racks is greater than or equal |
| to the replication factor for the datacenter, each replica is guaranteed |
| to be chosen from a different rack. Otherwise, each rack will hold at |
| least one replica, but some racks may hold more than one. Note that this |
| rack-aware behavior has some potentially |
| https://issues.apache.org/jira/browse/CASSANDRA-3810[surprising |
| implications]. For example, if there are not an even number of nodes in |
| each rack, the data load on the smallest rack may be much higher. |
| Similarly, if a single node is bootstrapped into a brand new rack, it |
| will be considered a replica for the entire ring. For this reason, many |
| operators choose to configure all nodes in a single availability zone or |
| similar failure domain as a single "rack". |
| |
| [[simple-strategy]] |
| ==== `SimpleStrategy` |
| |
| `SimpleStrategy` allows a single integer `replication_factor` to be |
| defined. This determines the number of nodes that should contain a copy |
| of each row. For example, if `replication_factor` is 3, then three |
| different nodes should store a copy of each row. |
| |
| `SimpleStrategy` treats all nodes identically, ignoring any configured |
| datacenters or racks. To determine the replicas for a token range, |
| Cassandra iterates through the tokens in the ring, starting with the |
| token range of interest. For each token, it checks whether the owning |
| node has been added to the set of replicas, and if it has not, it is |
| added to the set. This process continues until `replication_factor` |
| distinct nodes have been added to the set of replicas. |
| |
| ==== Transient Replication |
| |
| Transient replication is an experimental feature in Cassandra {40_version} not |
| present in the original Dynamo paper. This feature allows configuration of a |
| subset of replicas to replicate only data that hasn't been incrementally |
| repaired. This configuration decouples data redundancy from availability. |
| For instance, if you have a keyspace replicated at RF=3, and alter it to |
| RF=5 with two transient replicas, you go from tolerating one |
| failed replica to tolerating two, without corresponding |
| increase in storage usage. Now, three nodes will replicate all |
| the data for a given token range, and the other two will only replicate |
| data that hasn't been incrementally repaired. |
| |
| To use transient replication, first enable the option in |
| `cassandra.yaml`. Once enabled, both `SimpleStrategy` and |
| `NetworkTopologyStrategy` can be configured to transiently replicate |
| data. Configure it by specifying replication factor as |
| `<total_replicas>/<transient_replicas` Both `SimpleStrategy` and |
| `NetworkTopologyStrategy` support configuring transient replication. |
| |
| Transiently replicated keyspaces only support tables created with |
| `read_repair` set to `NONE`; monotonic reads are not currently |
| supported. You also can't use `LWT`, logged batches, or counters in {40_version}. |
| You will possibly never be able to use materialized views with |
| transiently replicated keyspaces and probably never be able to use |
| secondary indices with them. |
| |
| Transient replication is an experimental feature that is not ready |
| for production use. The expected audience is experienced users of |
| Cassandra capable of fully validating a deployment of their particular |
| application. That means being able check that operations like reads, |
| writes, decommission, remove, rebuild, repair, and replace all work with |
| your queries, data, configuration, operational practices, and |
| availability requirements. |
| |
| Anticipated additional features in `4.next` are support for monotonic reads with |
| transient replication, as well as LWT, logged batches, and counters. |
| |
| === Data Versioning |
| |
| Cassandra uses mutation timestamp versioning to guarantee eventual |
| consistency of data. Specifically all mutations that enter the system do |
| so with a timestamp provided either from a client clock or, absent a |
| client provided timestamp, from the coordinator node's clock. Updates |
| resolve according to the conflict resolution rule of last write wins. |
| Cassandra's correctness does depend on these clocks, so make sure a |
| proper time synchronization process is running such as NTP. |
| |
| Cassandra applies separate mutation timestamps to every column of every |
| row within a CQL partition. Rows are guaranteed to be unique by primary |
| key, and each column in a row resolve concurrent mutations according to |
| last-write-wins conflict resolution. This means that updates to |
| different primary keys within a partition can actually resolve without |
| conflict! Furthermore the CQL collection types such as maps and sets use |
| this same conflict free mechanism, meaning that concurrent updates to |
| maps and sets are guaranteed to resolve as well. |
| |
| ==== Replica Synchronization |
| |
| As replicas in Cassandra can accept mutations independently, it is |
| possible for some replicas to have newer data than others. Cassandra has |
| many best-effort techniques to drive convergence of replicas including |
| `Replica read repair <read-repair>` in the read path and |
| `Hinted handoff <hints>` in the write path. |
| |
| These techniques are only best-effort, however, and to guarantee |
| eventual consistency Cassandra implements `anti-entropy |
| repair <repair>` where replicas calculate hierarchical hash-trees over |
| their datasets called https://en.wikipedia.org/wiki/Merkle_tree[Merkle |
| trees] that can then be compared across replicas to identify mismatched |
| data. Like the original Dynamo paper Cassandra supports full repairs |
| where replicas hash their entire dataset, create Merkle trees, send them |
| to each other and sync any ranges that don't match. |
| |
| Unlike the original Dynamo paper, Cassandra also implements sub-range |
| repair and incremental repair. Sub-range repair allows Cassandra to |
| increase the resolution of the hash trees (potentially down to the |
| single partition level) by creating a larger number of trees that span |
| only a portion of the data range. Incremental repair allows Cassandra to |
| only repair the partitions that have changed since the last repair. |
| |
| === Tunable Consistency |
| |
| Cassandra supports a per-operation tradeoff between consistency and |
| availability through *Consistency Levels*. Cassandra's consistency |
| levels are a version of Dynamo's `R + W > N` consistency mechanism where |
| operators could configure the number of nodes that must participate in |
| reads (`R`) and writes (`W`) to be larger than the replication factor |
| (`N`). In Cassandra, you instead choose from a menu of common |
| consistency levels which allow the operator to pick `R` and `W` behavior |
| without knowing the replication factor. Generally writes will be visible |
| to subsequent reads when the read consistency level contains enough |
| nodes to guarantee a quorum intersection with the write consistency |
| level. |
| |
| The following consistency levels are available: |
| |
| `ONE`:: |
| Only a single replica must respond. |
| `TWO`:: |
| Two replicas must respond. |
| `THREE`:: |
| Three replicas must respond. |
| `QUORUM`:: |
| A majority (n/2 + 1) of the replicas must respond. |
| `ALL`:: |
| All of the replicas must respond. |
| `LOCAL_QUORUM`:: |
| A majority of the replicas in the local datacenter (whichever |
| datacenter the coordinator is in) must respond. |
| `EACH_QUORUM`:: |
| A majority of the replicas in each datacenter must respond. |
| `LOCAL_ONE`:: |
| Only a single replica must respond. In a multi-datacenter cluster, |
| this also gaurantees that read requests are not sent to replicas in a |
| remote datacenter. |
| `ANY`:: |
| A single replica may respond, or the coordinator may store a hint. If |
| a hint is stored, the coordinator will later attempt to replay the |
| hint and deliver the mutation to the replicas. This consistency level |
| is only accepted for write operations. |
| |
| Write operations *are always sent to all replicas*, regardless of |
| consistency level. The consistency level simply controls how many |
| responses the coordinator waits for before responding to the client. |
| |
| For read operations, the coordinator generally only issues read commands |
| to enough replicas to satisfy the consistency level. The one exception |
| to this is when speculative retry may issue a redundant read request to |
| an extra replica if the original replicas have not responded within a |
| specified time window. |
| |
| ==== Picking Consistency Levels |
| |
| It is common to pick read and write consistency levels such that the |
| replica sets overlap, resulting in all acknowledged writes being visible |
| to subsequent reads. This is typically expressed in the same terms |
| Dynamo does, in that `W + R > RF`, where `W` is the write consistency |
| level, `R` is the read consistency level, and `RF` is the replication |
| factor. For example, if `RF = 3`, a `QUORUM` request will require |
| responses from at least `2/3` replicas. If `QUORUM` is used for both |
| writes and reads, at least one of the replicas is guaranteed to |
| participate in _both_ the write and the read request, which in turn |
| guarantees that the quorums will overlap and the write will be visible |
| to the read. |
| |
| In a multi-datacenter environment, `LOCAL_QUORUM` can be used to provide |
| a weaker but still useful guarantee: reads are guaranteed to see the |
| latest write from within the same datacenter. This is often sufficient |
| as clients homed to a single datacenter will read their own writes. |
| |
| If this type of strong consistency isn't required, lower consistency |
| levels like `LOCAL_ONE` or `ONE` may be used to improve throughput, |
| latency, and availability. With replication spanning multiple |
| datacenters, `LOCAL_ONE` is typically less available than `ONE` but is |
| faster as a rule. Indeed `ONE` will succeed if a single replica is |
| available in any datacenter. |
| |
| == Distributed Cluster Membership and Failure Detection |
| |
| The replication protocols and dataset partitioning rely on knowing which |
| nodes are alive and dead in the cluster so that write and read |
| operations can be optimally routed. In Cassandra liveness information is |
| shared in a distributed fashion through a failure detection mechanism |
| based on a gossip protocol. |
| |
| === Gossip |
| |
| Gossip is how Cassandra propagates basic cluster bootstrapping |
| information such as endpoint membership and internode network protocol |
| versions. In Cassandra's gossip system, nodes exchange state information |
| not only about themselves but also about other nodes they know about. |
| This information is versioned with a vector clock of |
| `(generation, version)` tuples, where the generation is a monotonic |
| timestamp and version is a logical clock the increments roughly every |
| second. These logical clocks allow Cassandra gossip to ignore old |
| versions of cluster state just by inspecting the logical clocks |
| presented with gossip messages. |
| |
| Every node in the Cassandra cluster runs the gossip task independently |
| and periodically. Every second, every node in the cluster: |
| |
| [arabic] |
| . Updates the local node's heartbeat state (the version) and constructs |
| the node's local view of the cluster gossip endpoint state. |
| . Picks a random other node in the cluster to exchange gossip endpoint |
| state with. |
| . Probabilistically attempts to gossip with any unreachable nodes (if |
| one exists) |
| . Gossips with a seed node if that didn't happen in step 2. |
| |
| When an operator first bootstraps a Cassandra cluster they designate |
| certain nodes as seed nodes. Any node can be a seed node and the only |
| difference between seed and non-seed nodes is seed nodes are allowed to |
| bootstrap into the ring without seeing any other seed nodes. |
| Furthermore, once a cluster is bootstrapped, seed nodes become |
| hotspots for gossip due to step 4 above. |
| |
| As non-seed nodes must be able to contact at least one seed node in |
| order to bootstrap into the cluster, it is common to include multiple |
| seed nodes, often one for each rack or datacenter. Seed nodes are often |
| chosen using existing off-the-shelf service discovery mechanisms. |
| |
| [NOTE] |
| .Note |
| ==== |
| Nodes do not have to agree on the seed nodes, and indeed once a cluster |
| is bootstrapped, newly launched nodes can be configured to use any |
| existing nodes as seeds. The only advantage to picking the same nodes |
| as seeds is it increases their usefullness as gossip hotspots. |
| ==== |
| |
| Currently, gossip also propagates token metadata and schema |
| _version_ information. This information forms the control plane for |
| scheduling data movements and schema pulls. For example, if a node sees |
| a mismatch in schema version in gossip state, it will schedule a schema |
| sync task with the other nodes. As token information propagates via |
| gossip it is also the control plane for teaching nodes which endpoints |
| own what data. |
| |
| === Ring Membership and Failure Detection |
| |
| Gossip forms the basis of ring membership, but the *failure detector* |
| ultimately makes decisions about if nodes are `UP` or `DOWN`. Every node |
| in Cassandra runs a variant of the |
| https://www.computer.org/csdl/proceedings-article/srds/2004/22390066/12OmNvT2phv[Phi |
| Accrual Failure Detector], in which every node is constantly making an |
| independent decision of if their peer nodes are available or not. This |
| decision is primarily based on received heartbeat state. For example, if |
| a node does not see an increasing heartbeat from a node for a certain |
| amount of time, the failure detector "convicts" that node, at which |
| point Cassandra will stop routing reads to it (writes will typically be |
| written to hints). If/when the node starts heartbeating again, Cassandra |
| will try to reach out and connect, and if it can open communication |
| channels it will mark that node as available. |
| |
| [NOTE] |
| .Note |
| ==== |
| `UP` and `DOWN` state are local node decisions and are not propagated with |
| gossip. Heartbeat state is propagated with gossip, but nodes will not |
| consider each other as `UP` until they can successfully message each |
| other over an actual network channel. |
| ==== |
| |
| Cassandra will never remove a node from gossip state without |
| explicit instruction from an operator via a decommission operation or a |
| new node bootstrapping with a `replace_address_first_boot` option. This |
| choice is intentional to allow Cassandra nodes to temporarily fail |
| without causing data to needlessly re-balance. This also helps to |
| prevent simultaneous range movements, where multiple replicas of a token |
| range are moving at the same time, which can violate monotonic |
| consistency and can even cause data loss. |
| |
| == Incremental Scale-out on Commodity Hardware |
| |
| Cassandra scales-out to meet the requirements of growth in data size and |
| request rates. Scaling-out means adding additional nodes to the ring, |
| and every additional node brings linear improvements in compute and |
| storage. In contrast, scaling-up implies adding more capacity to the |
| existing database nodes. Cassandra is also capable of scale-up, and in |
| certain environments it may be preferable depending on the deployment. |
| Cassandra gives operators the flexibility to chose either scale-out or |
| scale-up. |
| |
| One key aspect of Dynamo that Cassandra follows is to attempt to run on |
| commodity hardware, and many engineering choices are made under this |
| assumption. For example, Cassandra assumes nodes can fail at any time, |
| auto-tunes to make the best use of CPU and memory resources available |
| and makes heavy use of advanced compression and caching techniques to |
| get the most storage out of limited memory and storage capabilities. |
| |
| === Simple Query Model |
| |
| Cassandra, like Dynamo, chooses not to provide cross-partition |
| transactions that are common in SQL Relational Database Management |
| Systems (RDBMS). This both gives the programmer a simpler read and write |
| API, and allows Cassandra to more easily scale horizontally since |
| multi-partition transactions spanning multiple nodes are notoriously |
| difficult to implement and typically very latent. |
| |
| Instead, Cassanda chooses to offer fast, consistent, latency at any |
| scale for single partition operations, allowing retrieval of entire |
| partitions or only subsets of partitions based on primary key filters. |
| Furthermore, Cassandra does support single partition compare and swap |
| functionality via the lightweight transaction CQL API. |
| |
| === Simple Interface for Storing Records |
| |
| Cassandra, in a slight departure from Dynamo, chooses a storage |
| interface that is more sophisticated then "simple key value" stores but |
| significantly less complex than SQL relational data models. Cassandra |
| presents a wide-column store interface, where partitions of data contain |
| multiple rows, each of which contains a flexible set of individually |
| typed columns. Every row is uniquely identified by the partition key and |
| one or more clustering keys, and every row can have as many columns as |
| needed. |
| |
| This allows users to flexibly add new columns to existing datasets as |
| new requirements surface. Schema changes involve only metadata changes |
| and run fully concurrently with live workloads. Therefore, users can |
| safely add columns to existing Cassandra databases while remaining |
| confident that query performance will not degrade. |