src/kudu/consensus/consensus.txt - kudu - Git at Google


 Licensed under the Apache License, Version 2.0 (the "License");
 you may not use this file except in compliance with the License.
 You may obtain a copy of the License at

     http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.

 Kudu Consensus Implementation
 =============================

 Kudu implements the RAFT consensus protocol, with minor modifications as
 described in this document. For the details of the write-ahead-log (WAL) message
 format, see the README file in this directory. Kudu maintains one RAFT config
 per tablet.

 RAFT describes how to ensure that additions to the leader's WAL get replicated
 to the rest of the config, and when it's safe to apply (commit) those changes to
 the "state machine" (the database).

 Overview of the RAFT protocol
 -----------------------------

 While the RAFT paper [1] remains the primary specification of the protocol, a
 brief outline of important points follows here, for quick reference:

 - RAFT has a strong leader for each config.
 - A majority of nodes in the config must agree for a write to occur or a leader
   to be elected. Every time a leader election takes place, the monotonic term
   number is incremented (see currentTerm below).
 - There are three roles a RAFT node can assume: leader, follower, or candidate.
   Kudu explicitly specifies an additional learner role (not yet implemented),
   which is a non-voting role.
 - The WAL index is an absolute index into the log. If two logs contain an entry
   with the same index and term, then the logs are identical in all entries
   up through the given index. This is called the Log Matching property
   (abbreviated "LMP" in some parts of the code).

 The following pieces of state are durable (updated before responding to RPCs):

 - The write-ahead log (WAL).
 - currentTerm: latest term the server has seen, initialized to 0.
 - votedFor: candidate that received this replica's vote for the current term,
   or null if none.

 The WAL is implemented by log.cc. currentTerm and votedFor are part of the
 ConsensusMetadataPB which is implemented in consensus_meta.cc.


 Appending to the log:
 ------------------------------

 Replicas implement an RPC method called UpdateConsensus which allows a
 leader to replicate a batch of log entries to the follower. Only a
 leader may call this RPC method, and a follower will only accept an
 UpdateConsensus call with a term equal to or higher than its
 currentTerm.
 [Raft calls this RPC AppendEntries(). See Raft figure 2 for details.]


 Leader election:
 ------------------------------

 Replicas also implement an RPC method called RequestConsensusVote, which is
 invoked by candidates to gather votes (RAFT sec 5.2). See RAFT figure 2 for
 details.

 RaftConfig membership changes:
 ------------------------------
 A two-phase joint consensus protocol is used for making cluster membership
 changes, including changing members of the config or the size of the majority.
 See RAFT section 6 and figure 11 for details.

 NOTE: membership changes are not yet implemented in Kudu.


 Leader election modifications (future work)
 ---------------------------------------------

 The Kudu leader election protocol proceeds exactly as specified in the extended
 version of the RAFT paper [1], sections 5.2 - 5.3, with the following
 exceptions:

 - Timestamps issued by config leaders must be disjoint, as specified in the
   Google Spanner paper [2] in section 4.1.1 and Appendix A. Therefore, when a
   new leader is elected, the leader must do a "commit-wait" style pause before
   replicating new writes.

   NOTE: this is not yet implemented. See KUDU-430.

 - To force acceptance of a particular node as leader (for load-balancing
   purposes by the Master), the tablet may be quiesced until the specified node
   has an up to date WAL, at which time that node becomes a candidate and starts
   an election. Since other nodes are prohibited from becoming candidates during
   this time, either this target node will win the election or the attempt to
   force this configuration will time out and fail.

   NOTE: this is not implemented!

 - To prevent "flapping" in the case of a flaky connection, an additional step is
   added to the leader election protocol. If a node detects that the config
   leader failed, instead of immediately starting an election, it will
   periodically query the existing nodes in the config to determine whether they
   see that the leader has failed as well. If a majority of the config (including
   the node itself) responds that the leader appears to have failed, the node
   becomes a candidate and starts an election. If not, the node backs off and
   retries again later, without updating its currentTerm field.

   NOTE: not yet implemented (KUDU-562)


 Corner cases and other notes
 ----------------------------

 Various parts of RAFT are subtle and worth mentioning explicitly:

 - Terms are monotonic and the only way a term may appear in a log is if a leader
   won an election and then wrote it there.

 - The current term is passed with every RPC request and response, and if any
   replica sees a request or response higher than its currentTerm, it steps down
   (if the leader) and updates its own currentTerm to match. This is the reason
   for our planned addition of the anti-flapping modification to the leader election
   protocol noted above.

 - As mentioned in the RAFT paper (sec 5.4.2), as well as the CSAIL DSRG blog [3]
   there are restrictions on when one can consider a log entry from a previous
   term committed. Because of the Log Matching property, the last entry written
   by the previous leader may be overwritten even after it reaches a majority. To
   work around this issue, replicas may only commit uncommitted log entries from
   a previous leader's term once the current leader has successfully replicated a
   round of log entries for its own current term.

 References
 ----------

 [1] RAFT: In Search of an Understandable Consensus Algorithm (Extended Version).
     Ongaro & Ousterhout. http://ramcloud.stanford.edu/raft.pdf

 [2] Spanner: Google’s globally distributed database. Corbett, et al. OSDI '12.
     http://research.google.com/archive/spanner-osdi2012.pdf

 [3] http://pdos.csail.mit.edu/dsrg/blog/2013/05/23/raft/

	Licensed under the Apache License, Version 2.0 (the "License");
	you may not use this file except in compliance with the License.
	You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing, software
	distributed under the License is distributed on an "AS IS" BASIS,
	WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	See the License for the specific language governing permissions and
	limitations under the License.

	Kudu Consensus Implementation
	=============================

	Kudu implements the RAFT consensus protocol, with minor modifications as
	described in this document. For the details of the write-ahead-log (WAL) message
	format, see the README file in this directory. Kudu maintains one RAFT config
	per tablet.

	RAFT describes how to ensure that additions to the leader's WAL get replicated
	to the rest of the config, and when it's safe to apply (commit) those changes to
	the "state machine" (the database).

	Overview of the RAFT protocol
	-----------------------------

	While the RAFT paper [1] remains the primary specification of the protocol, a
	brief outline of important points follows here, for quick reference:

	- RAFT has a strong leader for each config.
	- A majority of nodes in the config must agree for a write to occur or a leader
	to be elected. Every time a leader election takes place, the monotonic term
	number is incremented (see currentTerm below).
	- There are three roles a RAFT node can assume: leader, follower, or candidate.
	Kudu explicitly specifies an additional learner role (not yet implemented),
	which is a non-voting role.
	- The WAL index is an absolute index into the log. If two logs contain an entry
	with the same index and term, then the logs are identical in all entries
	up through the given index. This is called the Log Matching property
	(abbreviated "LMP" in some parts of the code).

	The following pieces of state are durable (updated before responding to RPCs):

	- The write-ahead log (WAL).
	- currentTerm: latest term the server has seen, initialized to 0.
	- votedFor: candidate that received this replica's vote for the current term,
	or null if none.

	The WAL is implemented by log.cc. currentTerm and votedFor are part of the
	ConsensusMetadataPB which is implemented in consensus_meta.cc.


	Appending to the log:
	------------------------------

	Replicas implement an RPC method called UpdateConsensus which allows a
	leader to replicate a batch of log entries to the follower. Only a
	leader may call this RPC method, and a follower will only accept an
	UpdateConsensus call with a term equal to or higher than its
	currentTerm.
	[Raft calls this RPC AppendEntries(). See Raft figure 2 for details.]


	Leader election:
	------------------------------

	Replicas also implement an RPC method called RequestConsensusVote, which is
	invoked by candidates to gather votes (RAFT sec 5.2). See RAFT figure 2 for
	details.

	RaftConfig membership changes:
	------------------------------
	A two-phase joint consensus protocol is used for making cluster membership
	changes, including changing members of the config or the size of the majority.
	See RAFT section 6 and figure 11 for details.

	NOTE: membership changes are not yet implemented in Kudu.


	Leader election modifications (future work)
	---------------------------------------------

	The Kudu leader election protocol proceeds exactly as specified in the extended
	version of the RAFT paper [1], sections 5.2 - 5.3, with the following
	exceptions:

	- Timestamps issued by config leaders must be disjoint, as specified in the
	Google Spanner paper [2] in section 4.1.1 and Appendix A. Therefore, when a
	new leader is elected, the leader must do a "commit-wait" style pause before
	replicating new writes.

	NOTE: this is not yet implemented. See KUDU-430.

	- To force acceptance of a particular node as leader (for load-balancing
	purposes by the Master), the tablet may be quiesced until the specified node
	has an up to date WAL, at which time that node becomes a candidate and starts
	an election. Since other nodes are prohibited from becoming candidates during
	this time, either this target node will win the election or the attempt to
	force this configuration will time out and fail.

	NOTE: this is not implemented!

	- To prevent "flapping" in the case of a flaky connection, an additional step is
	added to the leader election protocol. If a node detects that the config
	leader failed, instead of immediately starting an election, it will
	periodically query the existing nodes in the config to determine whether they
	see that the leader has failed as well. If a majority of the config (including
	the node itself) responds that the leader appears to have failed, the node
	becomes a candidate and starts an election. If not, the node backs off and
	retries again later, without updating its currentTerm field.

	NOTE: not yet implemented (KUDU-562)


	Corner cases and other notes
	----------------------------

	Various parts of RAFT are subtle and worth mentioning explicitly:

	- Terms are monotonic and the only way a term may appear in a log is if a leader
	won an election and then wrote it there.

	- The current term is passed with every RPC request and response, and if any
	replica sees a request or response higher than its currentTerm, it steps down
	(if the leader) and updates its own currentTerm to match. This is the reason
	for our planned addition of the anti-flapping modification to the leader election
	protocol noted above.

	- As mentioned in the RAFT paper (sec 5.4.2), as well as the CSAIL DSRG blog [3]
	there are restrictions on when one can consider a log entry from a previous
	term committed. Because of the Log Matching property, the last entry written
	by the previous leader may be overwritten even after it reaches a majority. To
	work around this issue, replicas may only commit uncommitted log entries from
	a previous leader's term once the current leader has successfully replicated a
	round of log entries for its own current term.

	References
	----------

	[1] RAFT: In Search of an Understandable Consensus Algorithm (Extended Version).
	Ongaro & Ousterhout. http://ramcloud.stanford.edu/raft.pdf

	[2] Spanner: Google’s globally distributed database. Corbett, et al. OSDI '12.
	http://research.google.com/archive/spanner-osdi2012.pdf

	[3] http://pdos.csail.mit.edu/dsrg/blog/2013/05/23/raft/