blob: c939497e5e42e43324c78964825ea36109895d15 [file] [log] [blame]
= Transient Replication
*Note*:
Transient Replication
(https://issues.apache.org/jira/browse/CASSANDRA-14404[CASSANDRA-14404])
is an experimental feature designed for expert Apache Cassandra users
who are able to validate every aspect of the database for their
application and deployment. That means being able to check that
operations like reads, writes, decommission, remove, rebuild, repair,
and replace all work with your queries, data, configuration, operational
practices, and availability requirements. Apache Cassandra 4.0 has the
initial implementation of transient replication. Future releases of
Cassandra will make this feature suitable for a wider audience. It is
anticipated that a future version will support monotonic reads with
transient replication as well as LWT, logged batches, and counters.
Being experimental, Transient replication is *not* recommended for
production use.
== Objective
The objective of transient replication is to decouple storage
requirements from data redundancy (or consensus group size) using
incremental repair, in order to reduce storage overhead. Certain nodes
act as full replicas (storing all the data for a given token range), and
some nodes act as transient replicas, storing only unrepaired data for
the same token ranges.
The optimization that is made possible with transient replication is
called "Cheap quorums", which implies that data redundancy is increased
without corresponding increase in storage usage.
Transient replication is useful when sufficient full replicas are
unavailable to receive and store all the data. Transient replication
allows you to configure a subset of replicas to only replicate data that
hasn't been incrementally repaired. As an optimization, we can avoid
writing data to a transient replica if we have successfully written data
to the full replicas.
After incremental repair, transient data stored on transient replicas
can be discarded.
== Enabling Transient Replication
Transient replication is not enabled by default. Transient replication
must be enabled on each node in a cluster separately by setting the
following configuration property in `cassandra.yaml`.
....
enable_transient_replication: true
....
Transient replication may be configured with both `SimpleStrategy` and
`NetworkTopologyStrategy`. Transient replication is configured by
setting replication factor as `<total_replicas>/<transient_replicas>`.
As an example, create a keyspace with replication factor (RF) 3.
....
CREATE KEYSPACE CassandraKeyspaceSimple WITH replication = {'class': 'SimpleStrategy',
'replication_factor' : 4/1};
....
As another example, `some_keysopace keyspace` will have 3 replicas in
DC1, 1 of which is transient, and 5 replicas in DC2, 2 of which are
transient:
....
CREATE KEYSPACE some_keysopace WITH replication = {'class': 'NetworkTopologyStrategy',
'DC1' : '3/1'', 'DC2' : '5/2'};
....
Transiently replicated keyspaces only support tables with `read_repair`
set to `NONE`.
Important Restrictions:
* RF cannot be altered while some endpoints are not in a normal state
(no range movements).
* You can't add full replicas if there are any transient replicas. You
must first remove all transient replicas, then change the # of full
replicas, then add back the transient replicas.
* You can only safely increase number of transients one at a time with
incremental repair run in between each time.
Additionally, transient replication cannot be used for:
* Monotonic Reads
* Lightweight Transactions (LWTs)
* Logged Batches
* Counters
* Keyspaces using materialized views
* Secondary indexes (2i)
== Cheap Quorums
Cheap quorums are a set of optimizations on the write path to avoid
writing to transient replicas unless sufficient full replicas are not
available to satisfy the requested consistency level. Hints are never
written for transient replicas. Optimizations on the read path prefer
reading from transient replicas. When writing at quorum to a table
configured to use transient replication the quorum will always prefer
available full replicas over transient replicas so that transient
replicas don't have to process writes. Tail latency is reduced by rapid
write protection (similar to rapid read protection) when full replicas
are slow or unavailable by sending writes to transient replicas.
Transient replicas can serve reads faster as they don't have to do
anything beyond bloom filter checks if they have no data. With vnodes
and large cluster sizes they will not have a large quantity of data even
for failure of one or more full replicas where transient replicas start
to serve a steady amount of write traffic for some of their transiently
replicated ranges.
== Speculative Write Option
The `CREATE TABLE` adds an option `speculative_write_threshold` for use
with transient replicas. The option is of type `simple` with default
value as `99PERCENTILE`. When replicas are slow or unresponsive
`speculative_write_threshold` specifies the threshold at which a cheap
quorum write will be upgraded to include transient replicas.
== Pending Ranges and Transient Replicas
Pending ranges refers to the movement of token ranges between transient
replicas. When a transient range is moved, there will be a period of
time where both transient replicas would need to receive any write
intended for the logical transient replica so that after the movement
takes effect a read quorum is able to return a response. Nodes are _not_
temporarily transient replicas during expansion. They stream data like a
full replica for the transient range before they can serve reads. A
pending state is incurred similar to how there is a pending state for
full replicas. Transient replicas also always receive writes when they
are pending. Pending transient ranges are sent a bit more data and
reading from them is avoided.
== Read Repair and Transient Replicas
Read repair never attempts to repair a transient replica. Reads will
always include at least one full replica. They should also prefer
transient replicas where possible. Range scans ensure the entire scanned
range performs replica selection that satisfies the requirement that
every range scanned includes one full replica. During incremental &
validation repair handling, at transient replicas anti-compaction does
not output any data for transient ranges as the data will be dropped
after repair, and transient replicas never have data streamed to them.
== Transitioning between Full Replicas and Transient Replicas
The additional state transitions that transient replication introduces
requires streaming and `nodetool cleanup` to behave differently. When
data is streamed it is ensured that it is streamed from a full replica
and not a transient replica.
Transitioning from not replicated to transiently replicated means that a
node must stay pending until the next incremental repair completes at
which point the data for that range is known to be available at full
replicas.
Transitioning from transiently replicated to fully replicated requires
streaming from a full replica and is identical to how data is streamed
when transitioning from not replicated to replicated. The transition is
managed so the transient replica is not read from as a full replica
until streaming completes. It can be used immediately for a write
quorum.
Transitioning from fully replicated to transiently replicated requires
cleanup to remove repaired data from the transiently replicated range to
reclaim space. It can be used immediately for a write quorum.
Transitioning from transiently replicated to not replicated requires
cleanup to be run to remove the formerly transiently replicated data.
When transient replication is in use ring changes are supported
including add/remove node, change RF, add/remove DC.
== Transient Replication supports EACH_QUORUM
(https://issues.apache.org/jira/browse/CASSANDRA-14727[CASSANDRA-14727])
adds support for Transient Replication support for `EACH_QUORUM`. Per
(https://issues.apache.org/jira/browse/CASSANDRA-14768[CASSANDRA-14768]),
we ensure we write to at least a `QUORUM` of nodes in every DC,
regardless of how many responses we need to wait for and our requested
consistency level. This is to minimally surprise users with transient
replication; with normal writes, we soft-ensure that we reach `QUORUM`
in all DCs we are able to, by writing to every node; even if we don't
wait for ACK, we have in both cases sent sufficient messages.