Master & Tablet Copy integration with configuration change
This document contains information on implementing Tablet Copy in the context of Kudu's Raft implementation. Details on Raft config change in Kudu can be found in the Raft config change design doc.
The Master will expose the following operations to admin users:
AddReplica(tablet_id, TS to_add, Role role)
RemoveReplica(tablet_id, TS to_remove)
Tablet Copy allows a tablet snapshot to be moved to a new server. A
StartTabletCopy() RPC call will be available on each tablet server. When a leader determines that a follower needs log entries prior to what is available on the leader side, or when it detects that a follower does not host a given tablet, it sends the follower an RPC to instruct the follower to initiate Tablet Copy. Optionally, this callback is made idempotent by passing the latest OpId in the follower's log as an argument.
Since copying a tablet may involve transferring many GB of data, we likely need to support operational visibility into ongoing Tablet Copy jobs, run them on their own thread pool, support cancellation, etc. TBD to enumerate all of this in detail.
A leader could cause a tablet to auto-create / auto-vivify itself if it doesn't already exist by sending a StartTabletCopy RPC to the tablet server. Not requiring the Master to explicitly invoke a CreateTablet() RPC before adding a replica to a consensus config makes adding a new peer much simpler to implement on the Master side.
In addition, allowing the leader to promote a “pre-follower” (
PRE_VOTER) to follower (
VOTER) once its log is sufficiently caught up with the leader's provides availability benefits, similar to the approach specified in the Raft dissertation (see Section 4.2.1)
With such a design, to add a new replica to a tablet's consensus config, a Master server only needs to send an RPC to the leader telling it to add the given node:
The leader will then take care of detecting whether the tablet is out of date or does not exist, in which cases it must be copied, or whether it can be caught-up normally.
If a replica is subsequently removed from a tablet server, it must be tombstoned and retain its (now static) persistent Raft state forever. Not doing this can cause serious split-brain problems due to amnesia, which we define as losing data that was guaranteed to be persisted according to the consensus protocol. That situation is described in detail in this raft-dev mailing list thread.
To safely support deletion and Tablet Copy, a tablet will have 4 states its data can be in:
DOES_NOT_EXIST (implicit; just the non-existence of state),
DOES_NOT_EXIST just means a tablet with that name has never been hosted on the server,
DELETED means it‘s tombstoned,
COPYING means it’s in the process of Tablet Copying, and
READY means it's in a normal, consistent state. More details about Tablet Deletion is in a later section.
The Tablet Copy protocol between the leader and follower is as follows. The leader always starts attempting to heartbeat to the follower:
Leader -> Follower:
AppendEntries(from, to, term, prev_idx, ops)
The follower receives the request and responds normally to the heartbeat request if it is running, and with special responses if it
DOES_NOT_EXIST or is
AppendEntries(from, to, term, prev_idx, ops): if (to != self.uuid): return ERR_INVALID_NAME if (term < self.current_term): return ERR_INVALID_TERM if (self.state == DELETED): return DELETED # Otherwise: Normal consensus update / AppendEntries logic
If the leader gets back a
DELETED tablet status, it will repeatedly attempt to “auto-vivify” the tablet on the follower by sending a StartTabletCopy RPC to the follower.
On the follower, the
StartTabletCopy RPC is idempotent w.r.t. repeated RPC requests from the leader and has logic to create a tablet if it doesn't yet exist. Roughly:
StartTabletCopy(from, to, tablet, current_state, last_opid_in_log = NULL): if (to != self.uuid): ERR_INVALID_NAME if (this.tablet.state == COPYING): ERR_ALREADY_INPROGRESS if (this.tablet.state != current_state): ERR_ILLEGAL_STATE if (this.tablet.state == RUNNING): DeleteTablet() # Quarantine the tablet data. if (this.tablet.state == DELETED || this.tablet.state == DOES_NOT_EXIST): CreateTablet(COPYING) # Create tablet in "COPYING" mode. if (caller.term < self.term): ERR_BAD_TERM if (last_opid_in_log != NULL && != this.log.last_op.id): ERR_ILLEGAL_STATE RunTabletCopy() # Download the tablet data.
The detailed process, on the follower side, of downloading and replacing the data is detailed below under “Follower Tablet Copy”.
This section describes a tablet server's directory structure, which looks like:
instance tablet-meta/*<tablet-id>* consensus-meta/*<tablet-id>* wals/*<tablet-id>*/ data/
The primary indicator of a tablet's existence is the presence of a superblock file for the tablet in the tablet-meta directory. If that file does not exist, we have no pointers to tablet data blocks, so we consider the tablet in
DOES_NOT_EXIST state. In addition to the superblock, the minimum files needed to start a tablet are the consensus metadata file (under consensus-meta), the write-ahead logs (under wals), and the data blocks (under data). Of course, automatically-generated tablet server-level files like the instance file must also be in place.
In order to delete a tablet we must permanently retain the Raft metadata to avoid consensus amnesia bugs. We also temporarily back up (quarantine) the data for later debugging purposes. Initially, we will provide some tool to manually remove the quarantined files and their associated data blocks when they are no longer needed.
We can safely implement tablet deletion using the following steps:
DELETED(thus we always roll the delete forward after this step); Store the last OpId in the log into a field in the SuperBlock PB (we need to know the last OpId in our log if we want to be able to vote in elections after being deleted and no longer having our WAL files); Store the path to the QDIR in the SuperBlock as well; and finally save and fsync the SuperBlock into the tablet metadata directory. Now the tablet is considered deleted (at startup, we will ensure this process completed - see “Tablet startup” below).
Tablet Copy copies the data from the remote; merges the new and old consensus metadata files (if a local one already existed; otherwise the remote metadata is adopted); and writes a replacement SuperBlock.
COPYINGstate and fsync. If SuperBlock already exists, clear the QDIR path field, but not the last_opid field (in case we crash at this stage and must delete ourselves at startup, then we can (must) retain our knowledge of last_opid in order to be able to vote; see “Tablet startup” below).
READYstate and fsync it.
When downloading the remote consensus metadata file, in order to avoid consensus amnesia, we must ensure that our term remains monotonic and that our memory of our votes is not lost. We can't just adopt the remote consensus metadata, so we must merge it with our own. These rules ensure the consensus metadata remains valid while merging the remote metadata:
remoteTerm <= localTerm: Retain vote history for
localTerm, if any.
There is a simple state machine for maintaining a consistent state when a tablet starts up:
DELETED&& WAL dir exists: Redo steps #5 and #6 in the “Tablet deletion” section above (roll forward).
COPYING: Delete self (go back to
READY: Normal startup.
Since the master has only a single idempotent RPC that it has to invoke to add a new peer to the consensus config, the master does not need to store metadata about the operations it is conducting or perform retries. The policy (for example the under-replication policy) will make those decisions and trigger idempotent actions. The master is constantly evaluating its policies and making adjustments.
When a master determines that it needs to change a tablet's configuration, it does the following:
Master sends an
AddServer() RPC to the leader of the tablet to add a new replica as a
PRE_VOTER (the initial implementation may simply add the peer as a
AddServer()RPC will specify the prior committed raft config for the tablet to ensure that the request is idempotent.
Idempotent config change operations
To allow the master to make decisions while having potentially stale consensus config information, we need to add a CompareAndSet-style consistency parameter to “config change” operations (
ChangeRole). We will add committed_config_id as an optional parameter that identifies the latest committed consensus config. If the opid doesn't match the currently committed config on the peer (checked under the RaftConsensus ReplicaState lock), the RPC returns an error indicating that the caller is out of date. If the OpId matches, but there is already a pending config change, the request is also rejected (that behavior is already implemented).
If we lose a disk with consensus metadata or WALs, and would need to copy a new replica to recover, it may be impossible to do so safely due to unavoidable consensus amnesia. In such a case, the tablet server must adopt a new UUID and fully clear all of its data and state:
This precaution ensures that an outdated tablet server cannot ever convince a server with consensus amnesia to auto-vivify a new tablet, elect the outdated server leader, and create a parallel-universe consensus group. Of course, reassigning a server's UUID is only effective at isolating amnesiac servers from requests intended for their former selves if all RPC calls are tagged with the UUID of the intended recipient and if that field is always validated. This ensures that a server refuses to participate with a peer that is calling it by the incorrect (old) name.
It may be possible to optimize this and avoid having to nuke the whole box when only one disk fails. Maybe we could rely on RAID for the consensus metadata and WALs, or we could try to make tablets sticky to specific disks, or do something else clever. However those options aren't great and this issue is pretty tricky.
Ideally, yes. There are scenarios that could be constructed where a tablet config could not make progress (be elected) without this functionality. This is possible to do as long as we retain consensus metadata indefinitely, which is required for correctness anyway. However this is not a top priority.
One scenario where a
DELETED tablet may need to vote to make forward progress is if a
VOTER replica falls behind and so starts to Tablet Copy, crashes in the middle of Tablet Copy, and deletes itself at startup. Once we implement
PRE_VOTER, and always catch up as a
PRE_VOTER before becoming a
VOTER, the opportunity for potential problems with
VOTERs is reduced a lot, especially around the initial step of adding a server to the cluster, but still does not address the above scenario.
Ideally, we would support aborting an in-progress config change request if the new peer is not responding. This could have some availability benefits. However this is not a top priority.
There are hacks that can be done as a workaround to roll back a failed config change, like truncating the log, forcing a leader to be elected to a new term, etc. that could allow us to set quorum members manually. But it would be better to have some kind of auto-rollback after a timeout.
MoveReplica(tablet_id, TS from, TS to)