blob: 8d08eb5cbd993d736e93469e5e5b811419543e09 [file] [log] [blame]
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
Kudu Consensus Implementation
=============================
Kudu implements the RAFT consensus protocol, with minor modifications as
described in this document. For the details of the write-ahead-log (WAL) message
format, see the README file in this directory. Kudu maintains one RAFT config
per tablet.
RAFT describes how to ensure that additions to the leader's WAL get replicated
to the rest of the config, and when it's safe to apply (commit) those changes to
the "state machine" (the database).
Overview of the RAFT protocol
-----------------------------
While the RAFT paper [1] remains the primary specification of the protocol, a
brief outline of important points follows here, for quick reference:
- RAFT has a strong leader for each config.
- A majority of nodes in the config must agree for a write to occur or a leader
to be elected. Every time a leader election takes place, the monotonic term
number is incremented (see currentTerm below).
- There are three roles a RAFT node can assume: leader, follower, or candidate.
Kudu explicitly specifies an additional learner role (not yet implemented),
which is a non-voting role.
- The WAL index is an absolute index into the log. If two logs contain an entry
with the same index and term, then the logs are identical in all entries
up through the given index. This is called the Log Matching property
(abbreviated "LMP" in some parts of the code).
The following pieces of state are durable (updated before responding to RPCs):
- The write-ahead log (WAL).
- currentTerm: latest term the server has seen, initialized to 0.
- votedFor: candidate that received this replica's vote for the current term,
or null if none.
The WAL is implemented by log.cc. currentTerm and votedFor are part of the
ConsensusMetadataPB which is implemented in consensus_meta.cc.
Appending to the log:
------------------------------
Replicas implement an RPC method called UpdateConsensus which allows a
leader to replicate a batch of log entries to the follower. Only a
leader may call this RPC method, and a follower will only accept an
UpdateConsensus call with a term equal to or higher than its
currentTerm.
[Raft calls this RPC AppendEntries(). See Raft figure 2 for details.]
Leader election:
------------------------------
Replicas also implement an RPC method called RequestConsensusVote, which is
invoked by candidates to gather votes (RAFT sec 5.2). See RAFT figure 2 for
details.
RaftConfig membership changes:
------------------------------
A two-phase joint consensus protocol is used for making cluster membership
changes, including changing members of the config or the size of the majority.
See RAFT section 6 and figure 11 for details.
NOTE: membership changes are not yet implemented in Kudu.
Leader election modifications (future work)
---------------------------------------------
The Kudu leader election protocol proceeds exactly as specified in the extended
version of the RAFT paper [1], sections 5.2 - 5.3, with the following
exceptions:
- Timestamps issued by config leaders must be disjoint, as specified in the
Google Spanner paper [2] in section 4.1.1 and Appendix A. Therefore, when a
new leader is elected, the leader must do a "commit-wait" style pause before
replicating new writes.
NOTE: this is not yet implemented. See KUDU-430.
- To force acceptance of a particular node as leader (for load-balancing
purposes by the Master), the tablet may be quiesced until the specified node
has an up to date WAL, at which time that node becomes a candidate and starts
an election. Since other nodes are prohibited from becoming candidates during
this time, either this target node will win the election or the attempt to
force this configuration will time out and fail.
NOTE: this is not implemented!
- To prevent "flapping" in the case of a flaky connection, an additional step is
added to the leader election protocol. If a node detects that the config
leader failed, instead of immediately starting an election, it will
periodically query the existing nodes in the config to determine whether they
see that the leader has failed as well. If a majority of the config (including
the node itself) responds that the leader appears to have failed, the node
becomes a candidate and starts an election. If not, the node backs off and
retries again later, without updating its currentTerm field.
NOTE: not yet implemented (KUDU-562)
Corner cases and other notes
----------------------------
Various parts of RAFT are subtle and worth mentioning explicitly:
- Terms are monotonic and the only way a term may appear in a log is if a leader
won an election and then wrote it there.
- The current term is passed with every RPC request and response, and if any
replica sees a request or response higher than its currentTerm, it steps down
(if the leader) and updates its own currentTerm to match. This is the reason
for our planned addition of the anti-flapping modification to the leader election
protocol noted above.
- As mentioned in the RAFT paper (sec 5.4.2), as well as the CSAIL DSRG blog [3]
there are restrictions on when one can consider a log entry from a previous
term committed. Because of the Log Matching property, the last entry written
by the previous leader may be overwritten even after it reaches a majority. To
work around this issue, replicas may only commit uncommitted log entries from
a previous leader's term once the current leader has successfully replicated a
round of log entries for its own current term.
References
----------
[1] RAFT: In Search of an Understandable Consensus Algorithm (Extended Version).
Ongaro & Ousterhout. http://ramcloud.stanford.edu/raft.pdf
[2] Spanner: Google’s globally distributed database. Corbett, et al. OSDI '12.
http://research.google.com/archive/spanner-osdi2012.pdf
[3] http://pdos.csail.mit.edu/dsrg/blog/2013/05/23/raft/