Multi-master support for Kudu 1.0

Background

Kudu's design avoids a single point of failure via multiple Kudu masters. Just as with tablets, master metadata is persisted to disk and replicated via Raft consensus, and so a deployment of 2N+1 masters can tolerate up to N failures.

By the time Kudu's first beta launched, support for multiple masters had been implemented but was too fragile to be anything but experimental. The rest of this document describes the gaps that must be filled before multi-master support is ready for production, and lays out a plan for how to fill them.

Gaps in the master

Current design

At startup, a master Raft configuration will elect a leader master. The leader master is responsible for servicing both tserver heartbeats as well as client requests. The follower masters participate in Raft consensus and replicate metadata, but are otherwise idle. Any heartbeats or client requests they receive are rejected.

All persistent master metadata is stored in a single replicated tablet. Every row in this tablet represents either a table or a tablet. Table records include unique table identifiers, the table‘s schema, and other bits of information. Tablet records include a unique identifier, the tablet’s Raft configuration, and other information.

What master metadata is replicated?

Table and tablet existence, via CreateTable() and DeleteTable(). Every new tablet record also includes an initial Raft configuration.
Schema changes, via AlterTable() and tserver heartbeats.
Tablet server Raft configuration changes, via tserver heartbeats. These include both the list of peers (may have changed due to under-replication) as well as the current leader (may have changed due to an election).

Scanning the master tablet for every heartbeat or request would be slow, so the leader master caches all master metadata in memory. The caches are only updated after a metadata change is successfully replicated; in this way they are always consistent with the on-disk tablet. When a new leader master is elected, it scans the entire master tablet and uses the metadata to rebuild its in-memory caches.

To understand how the in-memory caches work, let's start with the different kinds of information that are cached:

Tablet server instance. Is uniquely identified by the tserver's UUID and includes general metadata about the tserver, how recently this master received a heartbeat from the tserver, and proxy objects that the master uses to communicate with the tserver.
Table instance. Is uniquely identified by the table's ID and includes general metadata about the table.
Tablet instance. Is uniquely identified by the tablet's ID and includes general metadata about the tablet along with its current replicas.

Now, let's describe the various data structures that store this information:

Global: map of tserver UUID to tserver instance.
Global: map of table ID to table instance. All tables are present, even deleted tables.
Global: map of table name to table instance. Deleted tables are omitted, otherwise a new table could collide with a deleted one.
Global: map of tablet ID to tablet instance. All tablets are present, even deleted ones.
Per-table instance: map of tablet ID to tablet instance. Deleted tablets are retained here but are not inserted the next time master metadata is reloaded from disk.
Per-tablet instance: map of tserver UUID to tablet replica.

With few exceptions (detailed below), these caches behave more or less as one would expect. For example, a successful CreateTable() yields a table, tablet, and replica instances, all of which are mapped accordingly. A heartbeat from a tserver may update a tablet's replica map if that tablet elected a new leader replica. And so on.

All tservers start up with location information for the entire master Raft configuration, but are only responsible for heartbeating to the leader master. Prior to the first heartbeat, a tserver must determine which of the masters is the leader master. After that, the tserver will send heartbeats to that master until such a time that it fails or steps down, at which point the tserver must determine the new leader master and the cycle repeats. Follower masters ignore all heartbeats.

The information communicated in a tserver's heartbeat varies. Normally the enclosed tablet report is “incremental” in that it only includes tablets that have undergone Raft configuration or role changes. In rarer circumstances the tablet report will be “full” and include every tablet. Importantly, incremental reports are “edge” triggered; that is, after a tablet is included in an incremental report N, it is omitted from incremental report N+1. Full tablet reports are sent when one of the following conditions is met:

The master requests it in the previous heartbeat response, because the master is seeing this tserver for the first time.
The tserver has just started.

Clients behave much like tservers: they are configured a priori with the entire master Raft configuration, must always communicate with the leader master, and will retry their requests until the new leader master is found.

Known issues

KUDU-1358: CreateTable() may fail following a master election

One of the aforementioned in-memory caches keeps track of all known tservers and their “liveness” (i.e. how likely they are to be alive). This cache is NOT rebuilt using persistent master metadata; instead, it is updated whenever an unknown tserver heartbeats to the leader master. CreateTable() requests use this cache to determine whether a new table can be satisfied using the current cluster size; if not, the request is rejected.

Right after a master election, the new leader master may have a cold tserver cache if it's never seen any heartbeats before. Until an entire heartbeat interval has elapsed, this cold cache may harm CreateTable() requests:

If insufficient tservers are known to the master, the request will fail.
If not all tservers are known to the master, the cluster may become unbalanced as tablets pile up on a particular subset of tservers.

KUDU-1353: AlterTable() may get stuck

Another in-memory cache that is NOT rebuilt using persistent master metadata is the per-tablet replica map. These maps are used to satisfy client GetTableLocations() and GetTabletLocations() requests. They‘re also used to determine the leader replica for various master to tserver RPC requests, such as in response to a client’s AlterTable(). Each of these maps is updated during a tserver heartbeat using a tablet's latest Raft configuration.

If a new leader master is elected and a tserver is already known to it (perhaps because the master had been leader before), every heartbeat it receives from that tserver will include an empty tablet report. This is problematic for the per-tablet replica maps, which will remain empty until tablets show up in a tablet report.

Empty per-tablet replica maps in an otherwise healthy leader master can cause a variety of failures:

AlterTable() requests may time out.
GetTableLocations() and GetTabletLocations() requests may yield no replica location information.
DeleteTable() requests will succeed but won't delete any tablets from tservers until those tablets find their way into tablet reports.

KUDU-1374: Operations triggered by heartbeats may go unperformed

As described earlier, the inclusion or exclusion of a tablet in an incremental tablet report is edge-triggered, and may result in a state changing operation on the tserver, communicated via out-of-band RPC. This RPC is retried until it is successful. However, if the leader master dies after it is able to respond to the tserver's heartbeat but before the out-of-band RPC is sent, the edge-triggered tablet report may be missed, and the state changing operation will not be performed until the next time the tablet is included in a tablet report. As tablet report inclusion criteria is narrow, operations may be “missed” for quite some time.

These operations include:

Some tablet deletions, such as tablets belonging to orphaned tables, or tablets whose deletion RPCs were sent and failed during an earlier DeleteTable() request.
Some tablet alters, such as tablets whose alter RPCs were sent and failed during an earlier AlterTable() request.
Config changes sent due to under-replicated tablets.

KUDU-495: Masters may abort if replication fails

Some master operations will crash the master when replication fails. That's because they were implemented with local consensus in mind, wherein a replication failure is indicative of a disk failure and recovery is unlikely. With multiple masters and Raft consensus, replication may fail if the current leader master is no longer the leader master (e.g. it was partitioned from the rest of its Raft configuration, which promptly elected a new leader), raising the likelihood of a master crash.

Unimplemented features

Master Raft configuration changes

It's not currently possible to add or remove a master from an active Raft configuration.

It would be nice to implement this for Kudu 1.0, but it's not a strict requirement.

KUDU-500: allow followers to handle read-only client operations

Currently followers reject any client operations as the expectation is that clients communicate with the leader master. As a performance optimization, followers could handle certain “safe” operations (i.e. read-only requests), but follower masters must do a better job of keeping their in-memory caches up-to-date before this change is made. Moreover, operations should include an indication of how stale the follower master's information is allowed to be for it to be considered acceptable by the client.

It would be nice to implement this for Kudu 1.0, but it's not a strict requirement.

Gaps in the clients

(TBD)

TODO: JD says the code that detects the current master was partially removed from the Java client because it was buggy. Needs further investigation.

Goals

The plan must address all of the aforementioned known issues.
If possible, the plan should implement the missing features too.
The plan‘s scope should be minimized, provided that doesn’t complicate future implementations. In other words, the plan should not force backwards incompatibilities down the line.

Plan

Heartbeat to all masters

This is probably the most effective way to address KUDU-1358, and, as a side benefit, helps implement the “verify cluster connectivity” feature. This old design document describes KUDU-1358 and its solutions in more detail.

With this change, tservers no longer need to “follow the leader” as they will heartbeat to every master. However, a couple things need to change for this to work correctly:

Follower masters must process heartbeats, at least in part

Basically, they‘re only intended to refresh the master’s notion of the tserver's liveness; table and tablet information is still replicated from the leader master and should be ignored if found in the heartbeat.

All state changes taken by a tserver must be fenced

That is, the master and/or tserver must enforce that all actions take effect iff they were sent by the master that is currently the leader.

After an exhaustive audit of all master state changes (see appendix A), it was determined that the current protection mechanisms built into each RPC are sufficient to provide fencing. The one exception is orphaned replica deletion done in response to a heartbeat. To protect against that, true orphans (i.e. tablets for which no persistent record exists) will not be deleted at all. As the master retains deleted table/tablet metadata in perpetuity, this should ensure that true orphans appear only under drastic circumstances, such as a tserver that heartbeats to the wrong cluster.

The following protection mechanisms are here for historical record; they will not be implemented.

Alternative fencing mechanisms

One way to do this is by including the current term in every requested state change and hearbeat response. Each tserver maintains the current term in memory, reset whenever a heartbeat response or RPC includes a later term (thus also serving as a “there is a new leader master” notification). If a state change request includes an older term, it is rejected. When a tserver first starts up, it initializes the current term with whatever term is indicated in the majority of the heartbeat responses. In this way it can protect itself from a “rogue master” at startup without having to persist the current term to disk.

An alternative to the above fencing protocol is to ensure that the leader master replicates via Raft before triggering a state change. It doesn‘t matter what is replicated; a successful replication asserts that this master is still the leader. However, our Raft implementation doesn’t currently allow for replicating no-ops (i.e. log entries that needn't be persisted). Moreover, this is effectively an implementation of “leader leases” (in that a successful replication grants the leader a “lease” to remain leader for at least one Raft replication interval), but one that the rest of Kudu must be made aware of in order to be fully robust.

Send full heartbeats to newly elected leader masters

To address KUDU-1374, when a tserver passively detects that there's a new leader master (i.e. step #2 above), it should send it a full heartbeat. This will ensure that any heartbeat-triggered actions intended but not taken by the old leader master are reinitiated by the new one.

Ensure each logical operation is replicated as one batch

Kudu doesn't yet support atomic multi-row transactions, but all row operations bound for one tablet and batched into one Write RPC are combined into one logical transaction. This property is useful for multi-master support as all master metadata is encapsulated into a single tablet. With some refactoring, it is possible to ensure that any logical operation (e.g. creating a table) is encapsulated into a single RPC. Doing so would obviate the need for “roll forward” repair of partially replicated operations during metadata load and is necessary to address KUDU-495.

Some repair is still necessary for table-wide operations. These complete on a tablet by tablet basis and thus it is possible for partially created, altered, or deleted tables to exist at any point in time. However, the repair is a natural course of action taken by the leader master:

During a tablet report: for example, if the master sees a tablet without the latest schema, it'll send that tablet an idempotent alter RPC. Further, thanks to the above full heartbeat change, the new leader master will have an opportunity to roll forward such tables on master failover.
In the background: the master will periodically scan its in-memory state looking for tablets that have yet to be reported. If it finds one, it will be given a “nudge”; the master will send a create tablet RPC to the appropriate tserver. This scanning continues in a new leader master following master failover.

All batched logical operations include one table entry and N tablet entries, where N is the number of tablets in the table. These entries are encapsulated in a WriteRequestPB that is replicated by the leader master to follower masters. When N is very large, it is conceivable for the WriteRequestPB to exceed the maximum size of a Kudu RPC. To determine just how likely this is, replication RPC sizes were measured in the creation of a table with 1000 tablets and a simple three-column schema. The results: the replication RPC clocked in at ~117 KB, a far cry from the 8 MB maximum RPC size. Thus, a batch-based approach should not be unreasonable for today's scale targets.

Remove some unnecessary in-memory caches

To fix KUDU-1353, the per-tablet replica locations could be removed entirely. The same information is already present in each tablet instance, just not in an easy to use map form. The only downside is that operations that previously used the cached locations would need to perform more lookups into the tserver map, to resolve tserver UUIDs into instances. We think this is a reasonable trade-off, however, as the tserver map should be hot.

An alternative is to rebuild the per-tablet replica locations on metadata load, but the outright removal of that cached data is a simpler solution.

The tserver cache could also be rebuilt, but:

It will be incomplete, as only the last known RPC address (and none of the HTTP addresses) is persisted. Additionally, addressing KUDU-418 may require the removal of the last known RPC address from the persisted state, at which point there's nothing worth rebuilding.
The tserver cache is expected to be warm from the moment any master (including followers) starts up due to “Heartbeat to all masters” above.

Note: in-memory caches belonging to a former leader master will, by definition, contain stale information. These caches could be cleared following an election, but it shouldn't matter either way as this master is no longer servicing client requests or tserver heartbeats.

Ensure strict ordering for all state changes

Master state change operations should adhere to the following contract:

Acquire locks on the relevant table and/or tablets. Which locks and whether the locks are held for reading or writing depends on the operation. For example, DeleteTable() acquires locks for writing on the table and all of its tablets. Table locks must be acquired before tablet locks.
Mutate in-memory state belonging to the table and/or tablets. These mutations are made via COW so that concurrent readers continue to see only “committed” data.
Replicate all of the mutations via Raft consensus in a single batch. If the replication fails, the overall operation fails, the mutations are discarded, and the locks are released.
Replication has succeeded; the operation may not fail beyond this point.
Commit the mutations. If both table and tablet mutations are committed, tablet mutations must come first. Any other in-memory changes must be performed now.
The success of the operation is now consistent on this master in both on-disk state as well as in-memory state.
Send RPCs to the tservers (e.g. DeleteTable() will now send delete tablet RPCs to each tablet in the table). The work done by an RPC must take place, either by retrying the RPC until it succeeds, or through some other mechanism, such as sending a new RPC in response to a heartbeat.

Generally speaking, this contract is upheld universally. However, a detailed audit (see appendix B) of the master has revealed a few exceptions:

During CreateTable(), intermediate table and tablet state is made visible prior to replication.
During DeleteTable(), RPCs are sent prior to the committing of mutations.
When background scanning for newly created tablets, the logic that “replaces” timed out tablets makes the replacements visible prior to replication.

To prevent clients from seeing intermediate state and other potential issues, these operations must be made to adhere to the above contract.

Appendix A: Fencing audit and discussion

To understand which master operations need to be fenced, we ask the following key questions:

What decisions can a master make unilaterally (i.e. without first replicating so as to establish consensus)?
When making such decisions, does the master at least consult past replicated state first? If it does, would “stale” state yield incorrect decisions?
If it doesn‘t (or can’t) consult replicated state, are the external actions performed as a result of the decision safe?

We identified the set of potentially problematic external actions as those taken by the master during tablet reports.

We ruled out ChangeConfig; it is safe due to the use of CAS on the last change config opid (protects against two leader masters both trying to add a server), and because if the master somehow added a redundant server, in the worst case the new replica will be deleted the next time it heartbeats.

That left DeleteReplica, which is called under the following circumstances:

When the master can‘t find any record of the replica’s tablet or its table, it is deleted.
When the persistent state says that the replica (or its table) has been deleted, it is deleted.
When the persistent state says that the replica is no longer part of the Raft config, it is deleted.
When the persistent state includes replicas that aren't in the latest Raft config, they are deleted.

Like ChangeConfig, cases 3 and 4 are protected with a CAS. Cases 1 and 2 are not, but 2 falls into category #2 from earlier: if persistent state is consulted and the decision is made to delete a replica, that decision is correct and cannot become incorrect (i.e. under no circumstance would a tablet become “undeleted”).

That leaves case 1 as the only instance that needs additional fencing. We could implement leader leases as described earlier or “current term” checking to protect against it.

Or, we could 1) continue our current policy of retaining persistent state of deleted tables/tablets forever, and 2) change the master not to delete tablets for which it has no records. If we always have the persistent state for deleted tables, all instances of case 1 become case 2 unless there's some drastic problem (e.g. tservers are heartbeating to the wrong master), in which case not deleting the tablets is probably the right thing to do.

Appendix B: Master operation audit and analysis

The following are detailed audits of how each master operation works today. The description of each operation is followed by a list of potential improvements, all of which were already incorporated into the above plan. These audits may be useful to understanding the plan.

Creating a new table

CreateTable() RPC

create table in state UNKNOWN, begin mutation to state PREPARING
create tablets in state UNKNOWN, begin mutation to state PREPARING
update in-memory maps (table by id/name, tablet by id)
- new table and tablets are now visible with UNKNOWN state and no useful metadata (table schema, name, tablet partitions, etc.)
replicate new tablets
change table from state PREPARING to RUNNING
replicate new table
commit mutation for table
- new table and tables are visible, table with full metadata in RUNNING state but tablets still in UNKNOWN state without metadata
commit mutations for tablets
- new table and tablets are visible with full metadata, table in RUNNING state, tablets in PREPARING state (without consensus state)
wake up bg_tasks thread

Potential improvements:

The two replications can be safely combined into one replication
Consumers of table and tablet in-memory state must be prepared for UNKNOWN intermediate state without metadata, as well as lack of consensus state. This can be addressed by:
1. Adding a new global in-memory unordered_set to “lock” a table's name while creating it. When the create is finished (regardless of success), the name is unlocked. We could even reuse the tablet LockManager for this, as it has no real tablet dependencies
2. Moving step #3 to after step #7

Background task scanning

for each tablet t in state PREPARING:
- change t from state PREPARING to CREATING
for each tablet t_old in state CREATING:
- if t_old timed out:
  1. create tablet t_new in state UNKNOWN, begin mutation to state PREPARING
  2. add t_new to table's tablet_map
    - t_new is now visible when operating on all of table's tablets, but is in UNKNOWN state and has no useful metadata
  3. update in-memory tablet_by_id map with t_new
    - t_new is now visible to by tablet_id lookups, still UNKNOWN and no useful metadata
  4. change t_old from state CREATING to REPLACED
  5. change t_new from state PREPARING to CREATING
for each tablet t in state CREATING:
- reset t committed_consensus_state:
  1. reset term to min term
  2. reset consensus type to local or distributed
  3. reset index to invalid op index
  4. select peers
replicate new and existing tablets
if error in replica selection or replication:
1. update each table's in-memory tablet_map to remove t_new tablets (step 2)
  - t_new tablets still visible to by tablet_id lookups
2. update tablet_by_id map to remove t_new tablets (step 2)
  - t_new tablets no longer visible
send DeleteTablet() RPCs for all t_old tablets (step 2)
send CreateTablet() RPCs for all created tablets
commit mutations for new and existing tablets
- replacement tablets from step #2 now have full metadata, all tablets now have visible consensus state

Potential improvements:

All replication is already atomic; nothing to change here
Steps 6 and 7 should probably be reordered after Step 8
t_new can expose intermediate state to consumers, much like in CreateTable(). Is this safe?
1. Remove step #5
2. After step #8 (but before RPCs), insert steps #2b and #2c

Deleting a table

DeleteTable() RPC

change table from state RUNNING (or ALTERING) to REMOVED
replicate table
commit mutation for table
- new state (REMOVED) is now visible
for each tablet t:
1. Send DeleteTablet() RPC for t
2. change t from state RUNNING to DELETED
3. replicate t
4. commit mutation for t
  - new state (DELETED) is now visible

Potential improvements:

Table and tablet replications can be safely combined, provided we invert the commit order and make sure rest of master is OK with that
DeleteTablet() RPC should come after tablet mutation is committed

Altering a table

AlterTable() RPC

if requested change to table name:
- change name
if requested change to table schema:
1. reset fully_applied_schema with current schema
2. reset current schema
increment schema version
increment next column id
change table from state RUNNING to ALTERING
replicate table
commit mutation to table
- new state (ALTERING), schema, etc. are now visible
Send AlterTablet() RPCs for each tablet

No potential improvements

Heartbeating

Heartbeat for tablet t

if t fails by_id lookup:
- send DeleteTablet() RPC for t
if t's table does not exist:
- send DeleteTablet() RPC for t
if t is REMOVED or REPLACED, or t's table is DELETED:
- send DeleteTablet() RPC for t
if t is no longer in the list of peers:
- send DeleteTablet() RPC for t
if t CREATING and has leader:
- change t from CREATING to RUNNING
change consensus state
1. reset committed_consensus_state (if it exists in report)
2. update tablet replica locations
  - new replica locations now immediately visible
3. if t is no longer in the list of peers:
  - send DeleteTablet() RPC for t
4. if tablet is under-replicated
  - send ConfigChange() RPC for t
update t
commit mutation for t
- new state and committed consensus state are now visible
if t reported schema version isn't latest:

send AlterTablet() RPC for t

else:
update tablet schema version
- new schema version is immediately visible
if table is in state ALTERING and all tablets have latest schema version:
1. clear fully_applied_schema
2. change table from state ALTERING to RUNNING
3. replicate table
4. commit mutation for table
  - new state (RUNNING) and empty fully_applied_schema now visible

No potential improvements. One replication per tablet (as written) is OK because:

If replication fails but master still alive: tserver will retry same kind of heartbeat, giving master a chance to replicate again
If replication fails and master dies: with proposed “send full heartbeat on new leader master” change, failed replications will be retried by new leader master