| = Guarantees |
| |
| Apache Cassandra is a highly scalable and reliable database. Cassandra |
| is used in web based applications that serve large number of clients and |
| the quantity of data processed is web-scale (Petabyte) large. Cassandra |
| makes some guarantees about its scalability, availability and |
| reliability. To fully understand the inherent limitations of a storage |
| system in an environment in which a certain level of network partition |
| failure is to be expected and taken into account when designing the |
| system it is important to first briefly introduce the CAP theorem. |
| |
| == What is CAP? |
| |
| According to the CAP theorem it is not possible for a distributed data |
| store to provide more than two of the following guarantees |
| simultaneously. |
| |
| * Consistency: Consistency implies that every read receives the most |
| recent write or errors out |
| * Availability: Availability implies that every request receives a |
| response. It is not guaranteed that the response contains the most |
| recent write or data. |
| * Partition tolerance: Partition tolerance refers to the tolerance of a |
| storage system to failure of a network partition. Even if some of the |
| messages are dropped or delayed the system continues to operate. |
| |
| CAP theorem implies that when using a network partition, with the |
| inherent risk of partition failure, one has to choose between |
| consistency and availability and both cannot be guaranteed at the same |
| time. CAP theorem is illustrated in Figure 1. |
| |
| image::Figure_1_guarantees.jpg[image] |
| |
| Figure 1. CAP Theorem |
| |
| High availability is a priority in web based applications and to this |
| objective Cassandra chooses Availability and Partition Tolerance from |
| the CAP guarantees, compromising on data Consistency to some extent. |
| |
| Cassandra makes the following guarantees. |
| |
| * High Scalability |
| * High Availability |
| * Durability |
| * Eventual Consistency of writes to a single table |
| * Lightweight transactions with linearizable consistency |
| * Batched writes across multiple tables are guaranteed to succeed |
| completely or not at all |
| * Secondary indexes are guaranteed to be consistent with their local |
| replicas data |
| |
| == High Scalability |
| |
| Cassandra is a highly scalable storage system in which nodes may be |
| added/removed as needed. Using gossip-based protocol a unified and |
| consistent membership list is kept at each node. |
| |
| == High Availability |
| |
| Cassandra guarantees high availability of data by implementing a |
| fault-tolerant storage system. Failure detection in a node is detected |
| using a gossip-based protocol. |
| |
| == Durability |
| |
| Cassandra guarantees data durability by using replicas. Replicas are |
| multiple copies of a data stored on different nodes in a cluster. In a |
| multi-datacenter environment the replicas may be stored on different |
| datacenters. If one replica is lost due to unrecoverable node/datacenter |
| failure the data is not completely lost as replicas are still available. |
| |
| == Eventual Consistency |
| |
| Meeting the requirements of performance, reliability, scalability and |
| high availability in production Cassandra is an eventually consistent |
| storage system. Eventually consistent implies that all updates reach all |
| replicas eventually. Divergent versions of the same data may exist |
| temporarily but they are eventually reconciled to a consistent state. |
| Eventual consistency is a tradeoff to achieve high availability and it |
| involves some read and write latencies. |
| |
| == Lightweight transactions with linearizable consistency |
| |
| Data must be read and written in a sequential order. Paxos consensus |
| protocol is used to implement lightweight transactions. Paxos protocol |
| implements lightweight transactions that are able to handle concurrent |
| operations using linearizable consistency. Linearizable consistency is |
| sequential consistency with real-time constraints and it ensures |
| transaction isolation with compare and set (CAS) transaction. With CAS |
| replica data is compared and data that is found to be out of date is set |
| to the most consistent value. Reads with linearizable consistency allow |
| reading the current state of the data, which may possibly be |
| uncommitted, without making a new addition or update. |
| |
| == Batched Writes |
| |
| The guarantee for batched writes across multiple tables is that they |
| will eventually succeed, or none will. Batch data is first written to |
| batchlog system data, and when the batch data has been successfully |
| stored in the cluster the batchlog data is removed. The batch is |
| replicated to another node to ensure the full batch completes in the |
| event the coordinator node fails. |
| |
| == Secondary Indexes |
| |
| A secondary index is an index on a column and is used to query a table |
| that is normally not queryable. Secondary indexes when built are |
| guaranteed to be consistent with their local replicas. |