| = Read repair |
| |
| Read Repair is the process of repairing data replicas during a read |
| request. If all replicas involved in a read request at the given read |
| consistency level are consistent the data is returned to the client and |
| no read repair is needed. But if the replicas involved in a read request |
| at the given consistency level are not consistent a read repair is |
| performed to make replicas involved in the read request consistent. The |
| most up-to-date data is returned to the client. The read repair runs in |
| the foreground and is blocking in that a response is not returned to the |
| client until the read repair has completed and up-to-date data is |
| constructed. |
| |
| == Expectation of Monotonic Quorum Reads |
| |
| Cassandra uses a blocking read repair to ensure the expectation of |
| "monotonic quorum reads" i.e. that in 2 successive quorum reads, it’s |
| guaranteed the 2nd one won't get something older than the 1st one, and |
| this even if a failed quorum write made a write of the most up to date |
| value only to a minority of replicas. "Quorum" means majority of nodes |
| among replicas. |
| |
| == Table level configuration of monotonic reads |
| |
| Cassandra 4.0 adds support for table level configuration of monotonic |
| reads |
| (https://issues.apache.org/jira/browse/CASSANDRA-14635[CASSANDRA-14635]). |
| The `read_repair` table option has been added to table schema, with the |
| options `blocking` (default), and `none`. |
| |
| The `read_repair` option configures the read repair behavior to allow |
| tuning for various performance and consistency behaviors. Two |
| consistency properties are affected by read repair behavior. |
| |
| * Monotonic Quorum Reads: Provided by `BLOCKING`. Monotonic quorum reads |
| prevents reads from appearing to go back in time in some circumstances. |
| When monotonic quorum reads are not provided and a write fails to reach |
| a quorum of replicas, it may be visible in one read, and then disappear |
| in a subsequent read. |
| * Write Atomicity: Provided by `NONE`. Write atomicity prevents reads |
| from returning partially applied writes. Cassandra attempts to provide |
| partition level write atomicity, but since only the data covered by a |
| `SELECT` statement is repaired by a read repair, read repair can break |
| write atomicity when data is read at a more granular level than it is |
| written. For example read repair can break write atomicity if you write |
| multiple rows to a clustered partition in a batch, but then select a |
| single row by specifying the clustering column in a `SELECT` statement. |
| |
| The available read repair settings are: |
| |
| === Blocking |
| |
| The default setting. When `read_repair` is set to `BLOCKING`, and a read |
| repair is started, the read will block on writes sent to other replicas |
| until the CL is reached by the writes. Provides monotonic quorum reads, |
| but not partition level write atomicity. |
| |
| === None |
| |
| When `read_repair` is set to `NONE`, the coordinator will reconcile any |
| differences between replicas, but will not attempt to repair them. |
| Provides partition level write atomicity, but not monotonic quorum |
| reads. |
| |
| An example of using the `NONE` setting for the `read_repair` option is |
| as follows: |
| |
| [source,none] |
| ---- |
| CREATE TABLE ks.tbl (k INT, c INT, v INT, PRIMARY KEY (k,c)) with read_repair='NONE'"); |
| ---- |
| |
| == Read Repair Example |
| |
| To illustrate read repair with an example, consider that a client sends |
| a read request with read consistency level `TWO` to a 5-node cluster as |
| illustrated in Figure 1. Read consistency level determines how many |
| replica nodes must return a response before the read request is |
| considered successful. |
| |
| image::Figure_1_read_repair.jpg[image] |
| |
| Figure 1. Client sends read request to a 5-node Cluster |
| |
| Three nodes host replicas for the requested data as illustrated in |
| Figure 2. With a read consistency level of `TWO` two replica nodes must |
| return a response for the read request to be considered successful. If |
| the node the client sends request to hosts a replica of the data |
| requested only one other replica node needs to be sent a read request |
| to. But if the receiving node does not host a replica for the requested |
| data the node becomes a coordinator node and forwards the read request |
| to a node that hosts a replica. A direct read request is forwarded to |
| the fastest node (as determined by dynamic snitch) as shown in Figure 2. |
| A direct read request is a full read and returns the requested data. |
| |
| image::Figure_2_read_repair.jpg[image] |
| |
| Figure 2. Direct Read Request sent to Fastest Replica Node |
| |
| Next, the coordinator node sends the requisite number of additional |
| requests to satisfy the consistency level, which is `TWO`. The |
| coordinator node needs to send one more read request for a total of two. |
| All read requests additional to the first direct read request are digest |
| read requests. A digest read request is not a full read and only returns |
| the hash value of the data. Only a hash value is returned to reduce the |
| network data traffic. In the example being discussed the coordinator |
| node sends one digest read request to a node hosting a replica as |
| illustrated in Figure 3. |
| |
| image::Figure_3_read_repair.jpg[image] |
| |
| Figure 3. Coordinator Sends a Digest Read Request |
| |
| The coordinator node has received a full copy of data from one node and |
| a hash value for the data from another node. To compare the data |
| returned a hash value is calculated for the full copy of data. The two |
| hash values are compared. If the hash values are the same no read repair |
| is needed and the full copy of requested data is returned to the client. |
| The coordinator node only performed a total of two replica read request |
| because the read consistency level is `TWO` in the example. If the |
| consistency level were higher such as `THREE`, three replica nodes would |
| need to respond to a read request and only if all digest or hash values |
| were to match with the hash value of the full copy of data would the |
| read request be considered successful and the data returned to the |
| client. |
| |
| But, if the hash value/s from the digest read request/s are not the same |
| as the hash value of the data from the full read request of the first |
| replica node it implies that an inconsistency in the replicas exists. To |
| fix the inconsistency a read repair is performed. |
| |
| For example, consider that that digest request returns a hash value that |
| is not the same as the hash value of the data from the direct full read |
| request. We would need to make the replicas consistent for which the |
| coordinator node sends a direct (full) read request to the replica node |
| that it sent a digest read request to earlier as illustrated in Figure |
| 4. |
| |
| image::Figure_4_read_repair.jpg[image] |
| |
| Figure 4. Coordinator sends Direct Read Request to Replica Node it had |
| sent Digest Read Request to |
| |
| After receiving the data from the second replica node the coordinator |
| has data from two of the replica nodes. It only needs two replicas as |
| the read consistency level is `TWO` in the example. Data from the two |
| replicas is compared and based on the timestamps the most recent replica |
| is selected. Data may need to be merged to construct an up-to-date copy |
| of data if one replica has data for only some of the columns. In the |
| example, if the data from the first direct read request is found to be |
| outdated and the data from the second full read request to be the latest |
| read, repair needs to be performed on Replica 2. If a new up-to-date |
| data is constructed by merging the two replicas a read repair would be |
| needed on both the replicas involved. For example, a read repair is |
| performed on Replica 2 as illustrated in Figure 5. |
| |
| image::Figure_5_read_repair.jpg[image] |
| |
| Figure 5. Coordinator performs Read Repair |
| |
| The most up-to-date data is returned to the client as illustrated in |
| Figure 6. From the three replicas Replica 1 is not even read and thus |
| not repaired. Replica 2 is repaired. Replica 3 is the most up-to-date |
| and returned to client. |
| |
| image::Figure_6_read_repair.jpg[image] |
| |
| Figure 6. Most up-to-date Data returned to Client |
| |
| == Read Consistency Level and Read Repair |
| |
| The read consistency is most significant in determining if a read repair |
| needs to be performed. As discussed in Table 1 a read repair is not |
| needed for all of the consistency levels. |
| |
| Table 1. Read Repair based on Read Consistency Level |
| |
| [width="93%",cols="35%,65%",] |
| |=== |
| |Read Consistency Level |Description |
| |
| |ONE |Read repair is not performed as the data from the first direct |
| read request satisfies the consistency level ONE. No digest read |
| requests are involved for finding mismatches in data. |
| |
| |TWO |Read repair is performed if inconsistencies in data are found as |
| determined by the direct and digest read requests. |
| |
| |THREE |Read repair is performed if inconsistencies in data are found as |
| determined by the direct and digest read requests. |
| |
| |LOCAL_ONE |Read repair is not performed as the data from the direct |
| read request from the closest replica satisfies the consistency level |
| LOCAL_ONE.No digest read requests are involved for finding mismatches in |
| data. |
| |
| |LOCAL_QUORUM |Read repair is performed if inconsistencies in data are |
| found as determined by the direct and digest read requests. |
| |
| |QUORUM |Read repair is performed if inconsistencies in data are found |
| as determined by the direct and digest read requests. |
| |=== |
| |
| If read repair is performed it is made only on the replicas that are not |
| up-to-date and that are involved in the read request. The number of |
| replicas involved in a read request would be based on the read |
| consistency level; in the example it is two. |
| |
| == Improved Read Repair Blocking Behavior in Cassandra 4.0 |
| |
| Cassandra 4.0 makes two improvements to read repair blocking behavior |
| (https://issues.apache.org/jira/browse/CASSANDRA-10726[CASSANDRA-10726]). |
| |
| [arabic] |
| . Speculative Retry of Full Data Read Requests. Cassandra 4.0 makes use |
| of speculative retry in sending read requests (full, not digest) to |
| replicas if a full data response is not received, whether in the initial |
| full read request or a full data read request during read repair. With |
| speculative retry if it looks like a response may not be received from |
| the initial set of replicas Cassandra sent messages to, to satisfy the |
| consistency level, it speculatively sends additional read request to |
| un-contacted replica/s. Cassandra 4.0 will also speculatively send a |
| repair mutation to a minority of nodes not involved in the read repair |
| data read / write cycle with the combined contents of all |
| un-acknowledged mutations if it looks like one may not respond. |
| Cassandra accepts acks from them in lieu of acks from the initial |
| mutations sent out, so long as it receives the same number of acks as |
| repair mutations transmitted. |
| . Only blocks on Full Data Responses to satisfy the Consistency Level. |
| Cassandra 4.0 only blocks for what is needed for resolving the digest |
| mismatch and wait for enough full data responses to meet the consistency |
| level, no matter whether it’s speculative retry or read repair chance. |
| As an example, if it looks like Cassandra might not receive full data |
| requests from everyone in time, it sends additional requests to |
| additional replicas not contacted in the initial full data read. If the |
| collection of nodes that end up responding in time end up agreeing on |
| the data, the response from the disagreeing replica that started the |
| read repair is not considered, and won't be included in the response to |
| the client, preserving the expectation of monotonic quorum reads. |
| |
| == Diagnostic Events for Read Repairs |
| |
| Cassandra 4.0 adds diagnostic events for read repair |
| (https://issues.apache.org/jira/browse/CASSANDRA-14668[CASSANDRA-14668]) |
| that can be used for exposing information such as: |
| |
| * Contacted endpoints |
| * Digest responses by endpoint |
| * Affected partition keys |
| * Speculated reads / writes |
| * Update oversized |
| |
| == Background Read Repair |
| |
| Background read repair, which was configured using `read_repair_chance` |
| and `dclocal_read_repair_chance` settings in `cassandra.yaml` is removed |
| Cassandra 4.0 |
| (https://issues.apache.org/jira/browse/CASSANDRA-13910[CASSANDRA-13910]). |
| |
| Read repair is not an alternative for other kind of repairs such as full |
| repairs or replacing a node that keeps failing. The data returned even |
| after a read repair has been performed may not be the most up-to-date |
| data if consistency level is other than one requiring response from all |
| replicas. |