doc/source/architecture/dynamo.rst - cassandra - Git at Google

 .. Licensed to the Apache Software Foundation (ASF) under one
 .. or more contributor license agreements.  See the NOTICE file
 .. distributed with this work for additional information
 .. regarding copyright ownership.  The ASF licenses this file
 .. to you under the Apache License, Version 2.0 (the
 .. "License"); you may not use this file except in compliance
 .. with the License.  You may obtain a copy of the License at
 ..
 ..     http://www.apache.org/licenses/LICENSE-2.0
 ..
 .. Unless required by applicable law or agreed to in writing, software
 .. distributed under the License is distributed on an "AS IS" BASIS,
 .. WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 .. See the License for the specific language governing permissions and
 .. limitations under the License.

 Dynamo
 ------

 .. _gossip:

 Gossip
 ^^^^^^

 .. todo:: todo

 Failure Detection
 ^^^^^^^^^^^^^^^^^

 .. todo:: todo

 Token Ring/Ranges
 ^^^^^^^^^^^^^^^^^

 .. todo:: todo

 .. _replication-strategy:

 Replication
 ^^^^^^^^^^^

 The replication strategy of a keyspace determines which nodes are replicas for a given token range. The two main
 replication strategies are :ref:`simple-strategy` and :ref:`network-topology-strategy`.

 .. _simple-strategy:

 SimpleStrategy
 ~~~~~~~~~~~~~~

 SimpleStrategy allows a single integer ``replication_factor`` to be defined. This determines the number of nodes that
 should contain a copy of each row.  For example, if ``replication_factor`` is 3, then three different nodes should store
 a copy of each row.

 SimpleStrategy treats all nodes identically, ignoring any configured datacenters or racks.  To determine the replicas
 for a token range, Cassandra iterates through the tokens in the ring, starting with the token range of interest.  For
 each token, it checks whether the owning node has been added to the set of replicas, and if it has not, it is added to
 the set.  This process continues until ``replication_factor`` distinct nodes have been added to the set of replicas.

 .. _network-topology-strategy:

 NetworkTopologyStrategy
 ~~~~~~~~~~~~~~~~~~~~~~~

 NetworkTopologyStrategy allows a replication factor to be specified for each datacenter in the cluster.  Even if your
 cluster only uses a single datacenter, NetworkTopologyStrategy should be prefered over SimpleStrategy to make it easier
 to add new physical or virtual datacenters to the cluster later.

 In addition to allowing the replication factor to be specified per-DC, NetworkTopologyStrategy also attempts to choose
 replicas within a datacenter from different racks.  If the number of racks is greater than or equal to the replication
 factor for the DC, each replica will be chosen from a different rack.  Otherwise, each rack will hold at least one
 replica, but some racks may hold more than one. Note that this rack-aware behavior has some potentially `surprising
 implications <https://issues.apache.org/jira/browse/CASSANDRA-3810>`_.  For example, if there are not an even number of
 nodes in each rack, the data load on the smallest rack may be much higher.  Similarly, if a single node is bootstrapped
 into a new rack, it will be considered a replica for the entire ring.  For this reason, many operators choose to
 configure all nodes on a single "rack".

 Tunable Consistency
 ^^^^^^^^^^^^^^^^^^^

 Cassandra supports a per-operation tradeoff between consistency and availability through *Consistency Levels*.
 Essentially, an operation's consistency level specifies how many of the replicas need to respond to the coordinator in
 order to consider the operation a success.

 The following consistency levels are available:

 ``ONE``
   Only a single replica must respond.

 ``TWO``
   Two replicas must respond.

 ``THREE``
   Three replicas must respond.

 ``QUORUM``
   A majority (n/2 + 1) of the replicas must respond.

 ``ALL``
   All of the replicas must respond.

 ``LOCAL_QUORUM``
   A majority of the replicas in the local datacenter (whichever datacenter the coordinator is in) must respond.

 ``EACH_QUORUM``
   A majority of the replicas in each datacenter must respond.

 ``LOCAL_ONE``
   Only a single replica must respond.  In a multi-datacenter cluster, this also gaurantees that read requests are not
   sent to replicas in a remote datacenter.

 ``ANY``
   A single replica may respond, or the coordinator may store a hint. If a hint is stored, the coordinator will later
   attempt to replay the hint and deliver the mutation to the replicas.  This consistency level is only accepted for
   write operations.

 Write operations are always sent to all replicas, regardless of consistency level. The consistency level simply
 controls how many responses the coordinator waits for before responding to the client.

 For read operations, the coordinator generally only issues read commands to enough replicas to satisfy the consistency
 level. There are a couple of exceptions to this:

 - Speculative retry may issue a redundant read request to an extra replica if the other replicas have not responded
   within a specified time window.
 - Based on ``read_repair_chance`` and ``dclocal_read_repair_chance`` (part of a table's schema), read requests may be
   randomly sent to all replicas in order to repair potentially inconsistent data.

 Picking Consistency Levels
 ~~~~~~~~~~~~~~~~~~~~~~~~~~

 It is common to pick read and write consistency levels that are high enough to overlap, resulting in "strong"
 consistency.  This is typically expressed as ``W + R > RF``, where ``W`` is the write consistency level, ``R`` is the
 read consistency level, and ``RF`` is the replication factor.  For example, if ``RF = 3``, a ``QUORUM`` request will
 require responses from at least two of the three replicas.  If ``QUORUM`` is used for both writes and reads, at least
 one of the replicas is guaranteed to participate in *both* the write and the read request, which in turn guarantees that
 the latest write will be read. In a multi-datacenter environment, ``LOCAL_QUORUM`` can be used to provide a weaker but
 still useful guarantee: reads are guaranteed to see the latest write from within the same datacenter.

 If this type of strong consistency isn't required, lower consistency levels like ``ONE`` may be used to improve
 throughput, latency, and availability.
	.. Licensed to the Apache Software Foundation (ASF) under one
	.. or more contributor license agreements. See the NOTICE file
	.. distributed with this work for additional information
	.. regarding copyright ownership. The ASF licenses this file
	.. to you under the Apache License, Version 2.0 (the
	.. "License"); you may not use this file except in compliance
	.. with the License. You may obtain a copy of the License at
	..
	.. http://www.apache.org/licenses/LICENSE-2.0
	..
	.. Unless required by applicable law or agreed to in writing, software
	.. distributed under the License is distributed on an "AS IS" BASIS,
	.. WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	.. See the License for the specific language governing permissions and
	.. limitations under the License.

	Dynamo
	------

	.. _gossip:

	Gossip
	^^^^^^

	.. todo:: todo

	Failure Detection
	^^^^^^^^^^^^^^^^^

	.. todo:: todo

	Token Ring/Ranges
	^^^^^^^^^^^^^^^^^

	.. todo:: todo

	.. _replication-strategy:

	Replication
	^^^^^^^^^^^

	The replication strategy of a keyspace determines which nodes are replicas for a given token range. The two main
	replication strategies are :ref:`simple-strategy` and :ref:`network-topology-strategy`.

	.. _simple-strategy:

	SimpleStrategy
	~~~~~~~~~~~~~~

	SimpleStrategy allows a single integer ``replication_factor`` to be defined. This determines the number of nodes that
	should contain a copy of each row. For example, if ``replication_factor`` is 3, then three different nodes should store
	a copy of each row.

	SimpleStrategy treats all nodes identically, ignoring any configured datacenters or racks. To determine the replicas
	for a token range, Cassandra iterates through the tokens in the ring, starting with the token range of interest. For
	each token, it checks whether the owning node has been added to the set of replicas, and if it has not, it is added to
	the set. This process continues until ``replication_factor`` distinct nodes have been added to the set of replicas.

	.. _network-topology-strategy:

	NetworkTopologyStrategy
	~~~~~~~~~~~~~~~~~~~~~~~

	NetworkTopologyStrategy allows a replication factor to be specified for each datacenter in the cluster. Even if your
	cluster only uses a single datacenter, NetworkTopologyStrategy should be prefered over SimpleStrategy to make it easier
	to add new physical or virtual datacenters to the cluster later.

	In addition to allowing the replication factor to be specified per-DC, NetworkTopologyStrategy also attempts to choose
	replicas within a datacenter from different racks. If the number of racks is greater than or equal to the replication
	factor for the DC, each replica will be chosen from a different rack. Otherwise, each rack will hold at least one
	replica, but some racks may hold more than one. Note that this rack-aware behavior has some potentially `surprising
	implications <https://issues.apache.org/jira/browse/CASSANDRA-3810>`_. For example, if there are not an even number of
	nodes in each rack, the data load on the smallest rack may be much higher. Similarly, if a single node is bootstrapped
	into a new rack, it will be considered a replica for the entire ring. For this reason, many operators choose to
	configure all nodes on a single "rack".

	Tunable Consistency
	^^^^^^^^^^^^^^^^^^^

	Cassandra supports a per-operation tradeoff between consistency and availability through Consistency Levels.
	Essentially, an operation's consistency level specifies how many of the replicas need to respond to the coordinator in
	order to consider the operation a success.

	The following consistency levels are available:

	``ONE``
	Only a single replica must respond.

	``TWO``
	Two replicas must respond.

	``THREE``
	Three replicas must respond.

	``QUORUM``
	A majority (n/2 + 1) of the replicas must respond.

	``ALL``
	All of the replicas must respond.

	``LOCAL_QUORUM``
	A majority of the replicas in the local datacenter (whichever datacenter the coordinator is in) must respond.

	``EACH_QUORUM``
	A majority of the replicas in each datacenter must respond.

	``LOCAL_ONE``
	Only a single replica must respond. In a multi-datacenter cluster, this also gaurantees that read requests are not
	sent to replicas in a remote datacenter.

	``ANY``
	A single replica may respond, or the coordinator may store a hint. If a hint is stored, the coordinator will later
	attempt to replay the hint and deliver the mutation to the replicas. This consistency level is only accepted for
	write operations.

	Write operations are always sent to all replicas, regardless of consistency level. The consistency level simply
	controls how many responses the coordinator waits for before responding to the client.

	For read operations, the coordinator generally only issues read commands to enough replicas to satisfy the consistency
	level. There are a couple of exceptions to this:

	- Speculative retry may issue a redundant read request to an extra replica if the other replicas have not responded
	within a specified time window.
	- Based on ``read_repair_chance`` and ``dclocal_read_repair_chance`` (part of a table's schema), read requests may be
	randomly sent to all replicas in order to repair potentially inconsistent data.

	Picking Consistency Levels
	~~~~~~~~~~~~~~~~~~~~~~~~~~

	It is common to pick read and write consistency levels that are high enough to overlap, resulting in "strong"
	consistency. This is typically expressed as ``W + R > RF``, where ``W`` is the write consistency level, ``R`` is the
	read consistency level, and ``RF`` is the replication factor. For example, if ``RF = 3``, a ``QUORUM`` request will
	require responses from at least two of the three replicas. If ``QUORUM`` is used for both writes and reads, at least
	one of the replicas is guaranteed to participate in both the write and the read request, which in turn guarantees that
	the latest write will be read. In a multi-datacenter environment, ``LOCAL_QUORUM`` can be used to provide a weaker but
	still useful guarantee: reads are guaranteed to see the latest write from within the same datacenter.

	If this type of strong consistency isn't required, lower consistency levels like ``ONE`` may be used to improve
	throughput, latency, and availability.