| = Transient Replication |
| |
| *Note*: |
| |
| Transient Replication |
| (https://issues.apache.org/jira/browse/CASSANDRA-14404[CASSANDRA-14404]) |
| is an experimental feature designed for expert Apache Cassandra users |
| who are able to validate every aspect of the database for their |
| application and deployment. That means being able to check that |
| operations like reads, writes, decommission, remove, rebuild, repair, |
| and replace all work with your queries, data, configuration, operational |
| practices, and availability requirements. Apache Cassandra 4.0 has the |
| initial implementation of transient replication. Future releases of |
| Cassandra will make this feature suitable for a wider audience. It is |
| anticipated that a future version will support monotonic reads with |
| transient replication as well as LWT, logged batches, and counters. |
| Being experimental, Transient replication is *not* recommended for |
| production use. |
| |
| == Objective |
| |
| The objective of transient replication is to decouple storage |
| requirements from data redundancy (or consensus group size) using |
| incremental repair, in order to reduce storage overhead. Certain nodes |
| act as full replicas (storing all the data for a given token range), and |
| some nodes act as transient replicas, storing only unrepaired data for |
| the same token ranges. |
| |
| The optimization that is made possible with transient replication is |
| called "Cheap quorums", which implies that data redundancy is increased |
| without corresponding increase in storage usage. |
| |
| Transient replication is useful when sufficient full replicas are |
| unavailable to receive and store all the data. Transient replication |
| allows you to configure a subset of replicas to only replicate data that |
| hasn't been incrementally repaired. As an optimization, we can avoid |
| writing data to a transient replica if we have successfully written data |
| to the full replicas. |
| |
| After incremental repair, transient data stored on transient replicas |
| can be discarded. |
| |
| == Enabling Transient Replication |
| |
| Transient replication is not enabled by default. Transient replication |
| must be enabled on each node in a cluster separately by setting the |
| following configuration property in `cassandra.yaml`. |
| |
| .... |
| enable_transient_replication: true |
| .... |
| |
| Transient replication may be configured with both `SimpleStrategy` and |
| `NetworkTopologyStrategy`. Transient replication is configured by |
| setting replication factor as `<total_replicas>/<transient_replicas>`. |
| |
| As an example, create a keyspace with replication factor (RF) 3. |
| |
| .... |
| CREATE KEYSPACE CassandraKeyspaceSimple WITH replication = {'class': 'SimpleStrategy', |
| 'replication_factor' : 4/1}; |
| .... |
| |
| As another example, `some_keysopace keyspace` will have 3 replicas in |
| DC1, 1 of which is transient, and 5 replicas in DC2, 2 of which are |
| transient: |
| |
| .... |
| CREATE KEYSPACE some_keysopace WITH replication = {'class': 'NetworkTopologyStrategy', |
| 'DC1' : '3/1'', 'DC2' : '5/2'}; |
| .... |
| |
| Transiently replicated keyspaces only support tables with `read_repair` |
| set to `NONE`. |
| |
| Important Restrictions: |
| |
| * RF cannot be altered while some endpoints are not in a normal state |
| (no range movements). |
| * You can't add full replicas if there are any transient replicas. You |
| must first remove all transient replicas, then change the # of full |
| replicas, then add back the transient replicas. |
| * You can only safely increase number of transients one at a time with |
| incremental repair run in between each time. |
| |
| Additionally, transient replication cannot be used for: |
| |
| * Monotonic Reads |
| * Lightweight Transactions (LWTs) |
| * Logged Batches |
| * Counters |
| * Keyspaces using materialized views |
| * Secondary indexes (2i) |
| |
| == Cheap Quorums |
| |
| Cheap quorums are a set of optimizations on the write path to avoid |
| writing to transient replicas unless sufficient full replicas are not |
| available to satisfy the requested consistency level. Hints are never |
| written for transient replicas. Optimizations on the read path prefer |
| reading from transient replicas. When writing at quorum to a table |
| configured to use transient replication the quorum will always prefer |
| available full replicas over transient replicas so that transient |
| replicas don't have to process writes. Tail latency is reduced by rapid |
| write protection (similar to rapid read protection) when full replicas |
| are slow or unavailable by sending writes to transient replicas. |
| Transient replicas can serve reads faster as they don't have to do |
| anything beyond bloom filter checks if they have no data. With vnodes |
| and large cluster sizes they will not have a large quantity of data even |
| for failure of one or more full replicas where transient replicas start |
| to serve a steady amount of write traffic for some of their transiently |
| replicated ranges. |
| |
| == Speculative Write Option |
| |
| The `CREATE TABLE` adds an option `speculative_write_threshold` for use |
| with transient replicas. The option is of type `simple` with default |
| value as `99PERCENTILE`. When replicas are slow or unresponsive |
| `speculative_write_threshold` specifies the threshold at which a cheap |
| quorum write will be upgraded to include transient replicas. |
| |
| == Pending Ranges and Transient Replicas |
| |
| Pending ranges refers to the movement of token ranges between transient |
| replicas. When a transient range is moved, there will be a period of |
| time where both transient replicas would need to receive any write |
| intended for the logical transient replica so that after the movement |
| takes effect a read quorum is able to return a response. Nodes are _not_ |
| temporarily transient replicas during expansion. They stream data like a |
| full replica for the transient range before they can serve reads. A |
| pending state is incurred similar to how there is a pending state for |
| full replicas. Transient replicas also always receive writes when they |
| are pending. Pending transient ranges are sent a bit more data and |
| reading from them is avoided. |
| |
| == Read Repair and Transient Replicas |
| |
| Read repair never attempts to repair a transient replica. Reads will |
| always include at least one full replica. They should also prefer |
| transient replicas where possible. Range scans ensure the entire scanned |
| range performs replica selection that satisfies the requirement that |
| every range scanned includes one full replica. During incremental & |
| validation repair handling, at transient replicas anti-compaction does |
| not output any data for transient ranges as the data will be dropped |
| after repair, and transient replicas never have data streamed to them. |
| |
| == Transitioning between Full Replicas and Transient Replicas |
| |
| The additional state transitions that transient replication introduces |
| requires streaming and `nodetool cleanup` to behave differently. When |
| data is streamed it is ensured that it is streamed from a full replica |
| and not a transient replica. |
| |
| Transitioning from not replicated to transiently replicated means that a |
| node must stay pending until the next incremental repair completes at |
| which point the data for that range is known to be available at full |
| replicas. |
| |
| Transitioning from transiently replicated to fully replicated requires |
| streaming from a full replica and is identical to how data is streamed |
| when transitioning from not replicated to replicated. The transition is |
| managed so the transient replica is not read from as a full replica |
| until streaming completes. It can be used immediately for a write |
| quorum. |
| |
| Transitioning from fully replicated to transiently replicated requires |
| cleanup to remove repaired data from the transiently replicated range to |
| reclaim space. It can be used immediately for a write quorum. |
| |
| Transitioning from transiently replicated to not replicated requires |
| cleanup to be run to remove the formerly transiently replicated data. |
| |
| When transient replication is in use ring changes are supported |
| including add/remove node, change RF, add/remove DC. |
| |
| == Transient Replication supports EACH_QUORUM |
| |
| (https://issues.apache.org/jira/browse/CASSANDRA-14727[CASSANDRA-14727]) |
| adds support for Transient Replication support for `EACH_QUORUM`. Per |
| (https://issues.apache.org/jira/browse/CASSANDRA-14768[CASSANDRA-14768]), |
| we ensure we write to at least a `QUORUM` of nodes in every DC, |
| regardless of how many responses we need to wait for and our requested |
| consistency level. This is to minimally surprise users with transient |
| replication; with normal writes, we soft-ensure that we reach `QUORUM` |
| in all DCs we are able to, by writing to every node; even if we don't |
| wait for ACK, we have in both cases sent sufficient messages. |