Kudu's design avoids a single point of failure via multiple Kudu masters. Just as with tablets, master metadata is persisted to disk and replicated via Raft consensus, and so a deployment of 2N+1 masters can tolerate up to N failures.
By the time Kudu's first beta launched, support for multiple masters had been implemented but was too fragile to be anything but experimental. The rest of this document describes the gaps that must be filled before multi-master support is ready for production, and lays out a plan for how to fill them.
At startup, a master Raft configuration will elect a leader master. The leader master is responsible for servicing both tserver heartbeats as well as client requests. The follower masters participate in Raft consensus and replicate metadata, but are otherwise idle. Any heartbeats or client requests they receive are rejected.
All persistent master metadata is stored in a single replicated tablet. Every row in this tablet represents either a table or a tablet. Table records include unique table identifiers, the table‘s schema, and other bits of information. Tablet records include a unique identifier, the tablet’s Raft configuration, and other information.
What master metadata is replicated?
Scanning the master tablet for every heartbeat or request would be slow, so the leader master caches all master metadata in memory. The caches are only updated after a metadata change is successfully replicated; in this way they are always consistent with the on-disk tablet. When a new leader master is elected, it scans the entire master tablet and uses the metadata to rebuild its in-memory caches.
To understand how the in-memory caches work, let's start with the different kinds of information that are cached:
Now, let's describe the various data structures that store this information:
With few exceptions (detailed below), these caches behave more or less as one would expect. For example, a successful CreateTable() yields a table, tablet, and replica instances, all of which are mapped accordingly. A heartbeat from a tserver may update a tablet's replica map if that tablet elected a new leader replica. And so on.
All tservers start up with location information for the entire master Raft configuration, but are only responsible for heartbeating to the leader master. Prior to the first heartbeat, a tserver must determine which of the masters is the leader master. After that, the tserver will send heartbeats to that master until such a time that it fails or steps down, at which point the tserver must determine the new leader master and the cycle repeats. Follower masters ignore all heartbeats.
The information communicated in a tserver's heartbeat varies. Normally the enclosed tablet report is “incremental” in that it only includes tablets that have undergone Raft configuration or role changes. In rarer circumstances the tablet report will be “full” and include every tablet. Importantly, incremental reports are “edge” triggered; that is, after a tablet is included in an incremental report N, it is omitted from incremental report N+1. Full tablet reports are sent when one of the following conditions is met:
Clients behave much like tservers: they are configured a priori with the entire master Raft configuration, must always communicate with the leader master, and will retry their requests until the new leader master is found.
One of the aforementioned in-memory caches keeps track of all known tservers and their “liveness” (i.e. how likely they are to be alive). This cache is NOT rebuilt using persistent master metadata; instead, it is updated whenever an unknown tserver heartbeats to the leader master. CreateTable() requests use this cache to determine whether a new table can be satisfied using the current cluster size; if not, the request is rejected.
Right after a master election, the new leader master may have a cold tserver cache if it's never seen any heartbeats before. Until an entire heartbeat interval has elapsed, this cold cache may harm CreateTable() requests:
Another in-memory cache that is NOT rebuilt using persistent master metadata is the per-tablet replica map. These maps are used to satisfy client GetTableLocations() and GetTabletLocations() requests. They‘re also used to determine the leader replica for various master to tserver RPC requests, such as in response to a client’s AlterTable(). Each of these maps is updated during a tserver heartbeat using a tablet's latest Raft configuration.
If a new leader master is elected and a tserver is already known to it (perhaps because the master had been leader before), every heartbeat it receives from that tserver will include an empty tablet report. This is problematic for the per-tablet replica maps, which will remain empty until tablets show up in a tablet report.
Empty per-tablet replica maps in an otherwise healthy leader master can cause a variety of failures:
As described earlier, the inclusion or exclusion of a tablet in an incremental tablet report is edge-triggered, and may result in a state changing operation on the tserver, communicated via out-of-band RPC. This RPC is retried until it is successful. However, if the leader master dies after it is able to respond to the tserver's heartbeat but before the out-of-band RPC is sent, the edge-triggered tablet report may be missed, and the state changing operation will not be performed until the next time the tablet is included in a tablet report. As tablet report inclusion criteria is narrow, operations may be “missed” for quite some time.
These operations include:
Some master operations will crash the master when replication fails. That's because they were implemented with local consensus in mind, wherein a replication failure is indicative of a disk failure and recovery is unlikely. With multiple masters and Raft consensus, replication may fail if the current leader master is no longer the leader master (e.g. it was partitioned from the rest of its Raft configuration, which promptly elected a new leader), raising the likelihood of a master crash.
It's not currently possible to add or remove a master from an active Raft configuration.
It would be nice to implement this for Kudu 1.0, but it's not a strict requirement.
Currently followers reject any client operations as the expectation is that clients communicate with the leader master. As a performance optimization, followers could handle certain “safe” operations (i.e. read-only requests), but follower masters must do a better job of keeping their in-memory caches up-to-date before this change is made. Moreover, operations should include an indication of how stale the follower master's information is allowed to be for it to be considered acceptable by the client.
It would be nice to implement this for Kudu 1.0, but it's not a strict requirement.
(TBD)
TODO: JD says the code that detects the current master was partially removed from the Java client because it was buggy. Needs further investigation.
This is probably the most effective way to address KUDU-1358, and, as a side benefit, helps implement the “verify cluster connectivity” feature. This old design document describes KUDU-1358 and its solutions in more detail.
With this change, tservers no longer need to “follow the leader” as they will heartbeat to every master. However, a couple things need to change for this to work correctly:
Basically, they‘re only intended to refresh the master’s notion of the tserver's liveness; table and tablet information is still replicated from the leader master and should be ignored if found in the heartbeat.
That is, the master and/or tserver must enforce that all actions take effect iff they were sent by the master that is currently the leader.
After an exhaustive audit of all master state changes (see appendix A), it was determined that the current protection mechanisms built into each RPC are sufficient to provide fencing. The one exception is orphaned replica deletion done in response to a heartbeat. To protect against that, true orphans (i.e. tablets for which no persistent record exists) will not be deleted at all. As the master retains deleted table/tablet metadata in perpetuity, this should ensure that true orphans appear only under drastic circumstances, such as a tserver that heartbeats to the wrong cluster.
The following protection mechanisms are here for historical record; they will not be implemented.
One way to do this is by including the current term in every requested state change and hearbeat response. Each tserver maintains the current term in memory, reset whenever a heartbeat response or RPC includes a later term (thus also serving as a “there is a new leader master” notification). If a state change request includes an older term, it is rejected. When a tserver first starts up, it initializes the current term with whatever term is indicated in the majority of the heartbeat responses. In this way it can protect itself from a “rogue master” at startup without having to persist the current term to disk.
An alternative to the above fencing protocol is to ensure that the leader master replicates via Raft before triggering a state change. It doesn‘t matter what is replicated; a successful replication asserts that this master is still the leader. However, our Raft implementation doesn’t currently allow for replicating no-ops (i.e. log entries that needn't be persisted). Moreover, this is effectively an implementation of “leader leases” (in that a successful replication grants the leader a “lease” to remain leader for at least one Raft replication interval), but one that the rest of Kudu must be made aware of in order to be fully robust.
To address KUDU-1374, when a tserver passively detects that there's a new leader master (i.e. step #2 above), it should send it a full heartbeat. This will ensure that any heartbeat-triggered actions intended but not taken by the old leader master are reinitiated by the new one.
Kudu doesn't yet support atomic multi-row transactions, but all row operations bound for one tablet and batched into one Write RPC are combined into one logical transaction. This property is useful for multi-master support as all master metadata is encapsulated into a single tablet. With some refactoring, it is possible to ensure that any logical operation (e.g. creating a table) is encapsulated into a single RPC. Doing so would obviate the need for “roll forward” repair of partially replicated operations during metadata load and is necessary to address KUDU-495.
Some repair is still necessary for table-wide operations. These complete on a tablet by tablet basis and thus it is possible for partially created, altered, or deleted tables to exist at any point in time. However, the repair is a natural course of action taken by the leader master:
All batched logical operations include one table entry and N tablet entries, where N is the number of tablets in the table. These entries are encapsulated in a WriteRequestPB that is replicated by the leader master to follower masters. When N is very large, it is conceivable for the WriteRequestPB to exceed the maximum size of a Kudu RPC. To determine just how likely this is, replication RPC sizes were measured in the creation of a table with 1000 tablets and a simple three-column schema. The results: the replication RPC clocked in at ~117 KB, a far cry from the 8 MB maximum RPC size. Thus, a batch-based approach should not be unreasonable for today's scale targets.
To fix KUDU-1353, the per-tablet replica locations could be removed entirely. The same information is already present in each tablet instance, just not in an easy to use map form. The only downside is that operations that previously used the cached locations would need to perform more lookups into the tserver map, to resolve tserver UUIDs into instances. We think this is a reasonable trade-off, however, as the tserver map should be hot.
An alternative is to rebuild the per-tablet replica locations on metadata load, but the outright removal of that cached data is a simpler solution.
The tserver cache could also be rebuilt, but:
Note: in-memory caches belonging to a former leader master will, by definition, contain stale information. These caches could be cleared following an election, but it shouldn't matter either way as this master is no longer servicing client requests or tserver heartbeats.
Master state change operations should adhere to the following contract:
Generally speaking, this contract is upheld universally. However, a detailed audit (see appendix B) of the master has revealed a few exceptions:
To prevent clients from seeing intermediate state and other potential issues, these operations must be made to adhere to the above contract.
To understand which master operations need to be fenced, we ask the following key questions:
We identified the set of potentially problematic external actions as those taken by the master during tablet reports.
We ruled out ChangeConfig; it is safe due to the use of CAS on the last change config opid (protects against two leader masters both trying to add a server), and because if the master somehow added a redundant server, in the worst case the new replica will be deleted the next time it heartbeats.
That left DeleteReplica, which is called under the following circumstances:
Like ChangeConfig, cases 3 and 4 are protected with a CAS. Cases 1 and 2 are not, but 2 falls into category #2 from earlier: if persistent state is consulted and the decision is made to delete a replica, that decision is correct and cannot become incorrect (i.e. under no circumstance would a tablet become “undeleted”).
That leaves case 1 as the only instance that needs additional fencing. We could implement leader leases as described earlier or “current term” checking to protect against it.
Or, we could 1) continue our current policy of retaining persistent state of deleted tables/tablets forever, and 2) change the master not to delete tablets for which it has no records. If we always have the persistent state for deleted tables, all instances of case 1 become case 2 unless there's some drastic problem (e.g. tservers are heartbeating to the wrong master), in which case not deleting the tablets is probably the right thing to do.
The following are detailed audits of how each master operation works today. The description of each operation is followed by a list of potential improvements, all of which were already incorporated into the above plan. These audits may be useful to understanding the plan.
Potential improvements:
Potential improvements:
Potential improvements:
No potential improvements
No potential improvements. One replication per tablet (as written) is OK because: