| <!--- |
| Licensed under the Apache License, Version 2.0 (the "License"); |
| you may not use this file except in compliance with the License. |
| You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License. |
| --> |
| |
| # Multi-master support for Kudu 1.0 |
| |
| ## Background |
| |
| Kudu's design avoids a single point of failure via multiple Kudu masters. |
| Just as with tablets, master metadata is persisted to disk and replicated |
| via Raft consensus, and so a deployment of **2N+1** masters can tolerate up |
| to **N** failures. |
| |
| By the time Kudu's first beta launched, support for multiple masters had |
| been implemented but was too fragile to be anything but experimental. The |
| rest of this document describes the gaps that must be filled before |
| multi-master support is ready for production, and lays out a plan for how |
| to fill them. |
| |
| ## Gaps in the master |
| |
| ### Current design |
| |
| At startup, a master Raft configuration will elect a leader master. The |
| leader master is responsible for servicing both tserver heartbeats as well |
| as client requests. The follower masters participate in Raft consensus and |
| replicate metadata, but are otherwise idle. Any heartbeats or client |
| requests they receive are rejected. |
| |
| All persistent master metadata is stored in a single replicated tablet. |
| Every row in this tablet represents either a table or a tablet. Table |
| records include unique table identifiers, the table's schema, and other bits |
| of information. Tablet records include a unique identifier, the tablet's |
| Raft configuration, and other information. |
| |
| What master metadata is replicated? |
| |
| 1. Table and tablet existence, via **CreateTable()** and **DeleteTable()**. |
| Every new tablet record also includes an initial Raft configuration. |
| 2. Schema changes, via **AlterTable()** and tserver heartbeats. |
| 3. Tablet server Raft configuration changes, via tserver heartbeats. These |
| include both the list of peers (may have changed due to |
| under-replication) as well as the current leader (may have changed due to |
| an election). |
| |
| Scanning the master tablet for every heartbeat or request would be slow, |
| so the leader master caches all master metadata in memory. The caches are |
| only updated after a metadata change is successfully replicated; in this |
| way they are always consistent with the on-disk tablet. When a new leader |
| master is elected, it scans the entire master tablet and uses the metadata |
| to rebuild its in-memory caches. |
| |
| To understand how the in-memory caches work, let's start with the different |
| kinds of information that are cached: |
| |
| 1. Tablet server instance. Is uniquely identified by the tserver's UUID and |
| includes general metadata about the tserver, how recently this master |
| received a heartbeat from the tserver, and proxy objects that the master |
| uses to communicate with the tserver. |
| 2. Table instance. Is uniquely identified by the table's ID and includes |
| general metadata about the table. |
| 3. Tablet instance. Is uniquely identified by the tablet's ID and includes |
| general metadata about the tablet along with its current replicas. |
| |
| Now, let's describe the various data structures that store this information: |
| |
| 1. Global: map of tserver UUID to tserver instance. |
| 2. Global: map of table ID to table instance. All tables are present, even |
| deleted tables. |
| 3. Global: map of table name to table instance. Deleted tables are omitted, |
| otherwise a new table could collide with a deleted one. |
| 4. Global: map of tablet ID to tablet instance. All tablets are present, |
| even deleted ones. |
| 5. Per-table instance: map of tablet ID to tablet instance. Deleted tablets |
| are retained here but are not inserted the next time master metadata is |
| reloaded from disk. |
| 6. Per-tablet instance: map of tserver UUID to tablet replica. |
| |
| With few exceptions (detailed below), these caches behave more or less as |
| one would expect. For example, a successful **CreateTable()** yields a |
| table, tablet, and replica instances, all of which are mapped accordingly. A |
| heartbeat from a tserver may update a tablet's replica map if that tablet |
| elected a new leader replica. And so on. |
| |
| All tservers start up with location information for the entire master Raft |
| configuration, but are only responsible for heartbeating to the leader |
| master. Prior to the first heartbeat, a tserver must determine which of the |
| masters is the leader master. After that, the tserver will send heartbeats |
| to that master until such a time that it fails or steps down, at which point |
| the tserver must determine the new leader master and the cycle |
| repeats. Follower masters ignore all heartbeats. |
| |
| The information communicated in a tserver's heartbeat varies. Normally the |
| enclosed tablet report is "incremental" in that it only includes tablets |
| that have undergone Raft configuration or role changes. In rarer |
| circumstances the tablet report will be "full" and include every |
| tablet. Importantly, incremental reports are "edge" triggered; that is, |
| after a tablet is included in an incremental report **N**, it is omitted |
| from incremental report **N+1**. Full tablet reports are sent when one of |
| the following conditions is met: |
| |
| 1. The master requests it in the previous heartbeat response, because the |
| master is seeing this tserver for the first time. |
| 2. The tserver has just started. |
| |
| Clients behave much like tservers: they are configured a priori with the |
| entire master Raft configuration, must always communicate with the leader |
| master, and will retry their requests until the new leader master is found. |
| |
| ### Known issues |
| |
| #### KUDU-1358: **CreateTable()** may fail following a master election |
| |
| One of the aforementioned in-memory caches keeps track of all known tservers |
| and their "liveness" (i.e. how likely they are to be alive). This cache is |
| *NOT* rebuilt using persistent master metadata; instead, it is updated |
| whenever an unknown tserver heartbeats to the leader master. |
| **CreateTable()** requests use this cache to determine whether a new table |
| can be satisfied using the current cluster size; if not, the request is |
| rejected. |
| |
| Right after a master election, the new leader master may have a cold tserver |
| cache if it's never seen any heartbeats before. Until an entire heartbeat |
| interval has elapsed, this cold cache may harm **CreateTable()** requests: |
| |
| 1. If insufficient tservers are known to the master, the request will fail. |
| 2. If not all tservers are known to the master, the cluster may become |
| unbalanced as tablets pile up on a particular subset of tservers. |
| |
| #### KUDU-1353: **AlterTable()** may get stuck |
| |
| Another in-memory cache that is *NOT* rebuilt using persistent master |
| metadata is the per-tablet replica map. These maps are used to satisfy |
| client **GetTableLocations()** and **GetTabletLocations()** |
| requests. They're also used to determine the leader replica for various |
| master to tserver RPC requests, such as in response to a client's |
| **AlterTable()**. Each of these maps is updated during a tserver heartbeat |
| using a tablet's latest Raft configuration. |
| |
| If a new leader master is elected and a tserver is already known to it |
| (perhaps because the master had been leader before), every heartbeat it |
| receives from that tserver will include an empty tablet report. This is |
| problematic for the per-tablet replica maps, which will remain empty until |
| tablets show up in a tablet report. |
| |
| Empty per-tablet replica maps in an otherwise healthy leader master can |
| cause a variety of failures: |
| |
| 1. **AlterTable()** requests may time out. |
| 2. **GetTableLocations()** and **GetTabletLocations()** requests may yield |
| no replica location information. |
| 3. **DeleteTable()** requests will succeed but won't delete any tablets |
| from tservers until those tablets find their way into tablet reports. |
| |
| #### KUDU-1374: Operations triggered by heartbeats may go unperformed |
| |
| As described earlier, the inclusion or exclusion of a tablet in an |
| incremental tablet report is edge-triggered, and may result in a state |
| changing operation on the tserver, communicated via out-of-band RPC. This |
| RPC is retried until it is successful. However, if the leader master dies |
| *after* it is able to respond to the tserver's heartbeat but *before* the |
| out-of-band RPC is sent, the edge-triggered tablet report may be missed, and |
| the state changing operation will not be performed until the next time the |
| tablet is included in a tablet report. As tablet report inclusion criteria |
| is narrow, operations may be "missed" for quite some time. |
| |
| These operations include: |
| 1. Some tablet deletions, such as tablets belonging to orphaned tables, or |
| tablets whose deletion RPCs were sent and failed during an earlier |
| **DeleteTable()** request. |
| 2. Some tablet alters, such as tablets whose alter RPCs were sent and failed |
| during an earlier **AlterTable()** request. |
| 3. Config changes sent due to under-replicated tablets. |
| |
| #### KUDU-495: Masters may abort if replication fails |
| |
| Some master operations will crash the master when replication fails. That's |
| because they were implemented with local consensus in mind, wherein a |
| replication failure is indicative of a disk failure and recovery is |
| unlikely. With multiple masters and Raft consensus, replication may fail if |
| the current leader master is no longer the leader master (e.g. it was |
| partitioned from the rest of its Raft configuration, which promptly elected |
| a new leader), raising the likelihood of a master crash. |
| |
| ### Unimplemented features |
| |
| #### Master Raft configuration changes |
| |
| It's not currently possible to add or remove a master from an active Raft |
| configuration. |
| |
| It would be nice to implement this for Kudu 1.0, but it's not a strict |
| requirement. |
| |
| #### KUDU-500: allow followers to handle read-only client operations |
| |
| Currently followers reject any client operations as the expectation is that |
| clients communicate with the leader master. As a performance optimization, |
| followers could handle certain "safe" operations (i.e. read-only requests), |
| but follower masters must do a better job of keeping their in-memory caches |
| up-to-date before this change is made. Moreover, operations should include |
| an indication of how stale the follower master's information is allowed to |
| be for it to be considered acceptable by the client. |
| |
| It would be nice to implement this for Kudu 1.0, but it's not a strict |
| requirement. |
| |
| ## Gaps in the clients |
| |
| (TBD) |
| |
| TODO: JD says the code that detects the current master was partially removed |
| from the Java client because it was buggy. Needs further investigation. |
| |
| ## Goals |
| |
| 1. The plan must address all of the aforementioned known issues. |
| 2. If possible, the plan should implement the missing features too. |
| 3. The plan's scope should be minimized, provided that doesn't complicate |
| future implementations. In other words, the plan should not force |
| backwards incompatibilities down the line. |
| |
| ## Plan |
| |
| ### Heartbeat to all masters |
| |
| This is probably the most effective way to address KUDU-1358, and, as a side |
| benefit, helps implement the "verify cluster connectivity" feature. [This |
| old design document](old-multi-master-heartbeating.md) describes KUDU-1358 |
| and its solutions in more detail. |
| |
| With this change, tservers no longer need to "follow the leader" as they will |
| heartbeat to every master. However, a couple things need to change for this |
| to work correctly: |
| |
| #### Follower masters must process heartbeats, at least in part |
| |
| Basically, they're only intended to refresh the master's notion of the |
| tserver's liveness; table and tablet information is still replicated from |
| the leader master and should be ignored if found in the heartbeat. |
| |
| #### All state changes taken by a tserver must be fenced |
| |
| That is, the master and/or tserver must enforce that all actions take effect |
| iff they were sent by the master that is currently the leader. |
| |
| After an exhaustive audit of all master state changes (see appendix A), it |
| was determined that the current protection mechanisms built into each RPC |
| are sufficient to provide fencing. The one exception is orphaned replica |
| deletion done in response to a heartbeat. To protect against that, true |
| orphans (i.e. tablets for which no persistent record exists) will not be |
| deleted at all. As the master retains deleted table/tablet metadata in |
| perpetuity, this should ensure that true orphans appear only under drastic |
| circumstances, such as a tserver that heartbeats to the wrong cluster. |
| |
| The following protection mechanisms are here for historical record; they |
| will not be implemented. |
| |
| ##### Alternative fencing mechanisms |
| |
| One way to do this is by including the current term in every requested state |
| change and hearbeat response. Each tserver maintains the current term in |
| memory, reset whenever a heartbeat response or RPC includes a later term |
| (thus also serving as a "there is a new leader master" notification). If a |
| state change request includes an older term, it is rejected. When a tserver |
| first starts up, it initializes the current term with whatever term is |
| indicated in the majority of the heartbeat responses. In this way it |
| can protect itself from a "rogue master" at startup without having to |
| persist the current term to disk. |
| |
| An alternative to the above fencing protocol is to ensure that the leader |
| master replicates via Raft before triggering a state change. It doesn't |
| matter what is replicated; a successful replication asserts that this master |
| is still the leader. However, our Raft implementation doesn't currently |
| allow for replicating no-ops (i.e. log entries that needn't be persisted). |
| Moreover, this is effectively an implementation of "leader leases" (in that |
| a successful replication grants the leader a "lease" to remain leader for at |
| least one Raft replication interval), but one that the rest of Kudu must be |
| made aware of in order to be fully robust. |
| |
| ### Send full heartbeats to newly elected leader masters |
| |
| To address KUDU-1374, when a tserver passively detects that there's a new |
| leader master (i.e. step #2 above), it should send it a full heartbeat. This |
| will ensure that any heartbeat-triggered actions intended but not taken by |
| the old leader master are reinitiated by the new one. |
| |
| ### Ensure each logical operation is replicated as one batch |
| |
| Kudu doesn't yet support atomic multi-row transactions, but all row |
| operations bound for one tablet and batched into one Write RPC are combined |
| into one logical transaction. This property is useful for multi-master |
| support as all master metadata is encapsulated into a single tablet. With |
| some refactoring, it is possible to ensure that any logical operation |
| (e.g. creating a table) is encapsulated into a single RPC. Doing so would |
| obviate the need for "roll forward" repair of partially replicated |
| operations during metadata load and is necessary to address KUDU-495. |
| |
| Some repair is still necessary for table-wide operations. These complete on |
| a tablet by tablet basis and thus it is possible for partially created, |
| altered, or deleted tables to exist at any point in time. However, the |
| repair is a natural course of action taken by the leader master: |
| |
| 1. During a tablet report: for example, if the master sees a tablet without |
| the latest schema, it'll send that tablet an idempotent alter RPC. |
| Further, thanks to the above full heartbeat change, the new leader master |
| will have an opportunity to roll forward such tables on master failover. |
| 2. In the background: the master will periodically scan its in-memory state |
| looking for tablets that have yet to be reported. If it finds one, it |
| will be given a "nudge"; the master will send a create tablet RPC to the |
| appropriate tserver. This scanning continues in a new leader master |
| following master failover. |
| |
| All batched logical operations include one table entry and *N* tablet |
| entries, where *N* is the number of tablets in the table. These entries are |
| encapsulated in a WriteRequestPB that is replicated by the leader master to |
| follower masters. When *N* is very large, it is conceivable for the |
| WriteRequestPB to exceed the maximum size of a Kudu RPC. To determine just |
| how likely this is, replication RPC sizes were measured in the creation of a |
| table with 1000 tablets and a simple three-column schema. The results: the |
| replication RPC clocked in at ~117 KB, a far cry from the 8 MB maximum RPC |
| size. Thus, a batch-based approach should not be unreasonable for today's |
| scale targets. |
| |
| ### Remove some unnecessary in-memory caches |
| |
| To fix KUDU-1353, the per-tablet replica locations could be removed entirely. |
| The same information is already present in each tablet instance, just not in |
| an easy to use map form. The only downside is that operations that previously |
| used the cached locations would need to perform more lookups into the tserver |
| map, to resolve tserver UUIDs into instances. We think this is a reasonable |
| trade-off, however, as the tserver map should be hot. |
| |
| An alternative is to rebuild the per-tablet replica locations on metadata |
| load, but the outright removal of that cached data is a simpler solution. |
| |
| The tserver cache could also be rebuilt, but: |
| |
| 1. It will be incomplete, as only the last known RPC address (and none of |
| the HTTP addresses) is persisted. Additionally, addressing KUDU-418 may |
| require the removal of the last known RPC address from the persisted |
| state, at which point there's nothing worth rebuilding. |
| 2. The tserver cache is expected to be warm from the moment any master |
| (including followers) starts up due to "Heartbeat to all masters" above. |
| |
| Note: in-memory caches belonging to a former leader master will, by |
| definition, contain stale information. These caches could be |
| cleared following an election, but it shouldn't matter either way as this |
| master is no longer servicing client requests or tserver heartbeats. |
| |
| ### Ensure strict ordering for all state changes |
| |
| Master state change operations should adhere to the following contract: |
| |
| 1. Acquire locks on the relevant table and/or tablets. Which locks and |
| whether the locks are held for reading or writing depends on the |
| operation. For example, **DeleteTable()** acquires locks for writing on |
| the table and all of its tablets. Table locks must be acquired before |
| tablet locks. |
| 2. Mutate in-memory state belonging to the table and/or tablets. These |
| mutations are made via COW so that concurrent readers continue to see |
| only "committed" data. |
| 3. Replicate all of the mutations via Raft consensus in a single batch. If |
| the replication fails, the overall operation fails, the mutations are |
| discarded, and the locks are released. |
| 4. Replication has succeeded; the operation may not fail beyond this point. |
| 5. Commit the mutations. If both table and tablet mutations are committed, |
| tablet mutations must come first. Any other in-memory changes must be |
| performed now. |
| 6. The success of the operation is now consistent on this master in both |
| on-disk state as well as in-memory state. |
| 7. Send RPCs to the tservers (e.g. **DeleteTable()** will now send delete |
| tablet RPCs to each tablet in the table). The work done by an RPC must |
| take place, either by retrying the RPC until it succeeds, or through some |
| other mechanism, such as sending a new RPC in response to a heartbeat. |
| |
| Generally speaking, this contract is upheld universally. However, a detailed |
| audit (see appendix B) of the master has revealed a few exceptions: |
| |
| 1. During **CreateTable()**, intermediate table and tablet state is made |
| visible prior to replication. |
| 2. During **DeleteTable()**, RPCs are sent prior to the committing of |
| mutations. |
| 3. When background scanning for newly created tablets, the logic that |
| "replaces" timed out tablets makes the replacements visible prior to |
| replication. |
| |
| To prevent clients from seeing intermediate state and other potential |
| issues, these operations must be made to adhere to the above contract. |
| |
| ## Appendix A: Fencing audit and discussion |
| |
| To understand which master operations need to be fenced, we ask the following |
| key questions: |
| |
| 1. What decisions can a master make unilaterally (i.e. without first |
| replicating so as to establish consensus)? |
| 2. When making such decisions, does the master at least consult past replicated |
| state first? If it does, would "stale" state yield incorrect decisions? |
| 3. If it doesn't (or can't) consult replicated state, are the external actions |
| performed as a result of the decision safe? |
| |
| We identified the set of potentially problematic external actions as those taken |
| by the master during tablet reports. |
| |
| We ruled out ChangeConfig; it is safe due to the use of CAS on the last change |
| config opid (protects against two leader masters both trying to add a server), |
| and because if the master somehow added a redundant server, in the worst case |
| the new replica will be deleted the next time it heartbeats. |
| |
| That left DeleteReplica, which is called under the following circumstances: |
| |
| 1. When the master can't find any record of the replica's tablet or its table, |
| it is deleted. |
| 2. When the persistent state says that the replica (or its table) has been |
| deleted, it is deleted. |
| 3. When the persistent state says that the replica is no longer part of the Raft |
| config, it is deleted. |
| 4. When the persistent state includes replicas that aren't in the latest Raft |
| config, they are deleted. |
| |
| Like ChangeConfig, cases 3 and 4 are protected with a CAS. Cases 1 and 2 are |
| not, but 2 falls into category #2 from earlier: if persistent state is consulted |
| and the decision is made to delete a replica, that decision is correct and |
| cannot become incorrect (i.e. under no circumstance would a tablet become |
| "undeleted"). |
| |
| That leaves case 1 as the only instance that needs additional fencing. We could |
| implement leader leases as described earlier or "current term" checking to |
| protect against it. |
| |
| Or, we could 1) continue our current policy of retaining persistent state of |
| deleted tables/tablets forever, and 2) change the master not to delete |
| tablets for which it has no records. If we always have the persistent state |
| for deleted tables, all instances of case 1 become case 2 unless there's |
| some drastic problem (e.g. tservers are heartbeating to the wrong master), |
| in which case not deleting the tablets is probably the right thing to do. |
| |
| ## Appendix B: Master operation audit and analysis |
| |
| The following are detailed audits of how each master operation works today. |
| The description of each operation is followed by a list of potential |
| improvements, all of which were already incorporated into the above |
| plan. These audits may be useful to understanding the plan. |
| |
| ### Creating a new table |
| |
| #### CreateTable() RPC |
| |
| 1. create table in state UNKNOWN, begin mutation to state PREPARING |
| 2. create tablets in state UNKNOWN, begin mutation to state PREPARING |
| 3. update in-memory maps (table by id/name, tablet by id) |
| - new table and tablets are now visible with UNKNOWN state and no |
| useful metadata (table schema, name, tablet partitions, etc.) |
| 4. replicate new tablets |
| 5. change table from state PREPARING to RUNNING |
| 6. replicate new table |
| 7. commit mutation for table |
| - new table and tables are visible, table with full metadata in RUNNING |
| state but tablets still in UNKNOWN state without metadata |
| 8. commit mutations for tablets |
| - new table and tablets are visible with full metadata, table in |
| RUNNING state, tablets in PREPARING state (without consensus state) |
| 9. wake up bg_tasks thread |
| |
| Potential improvements: |
| |
| 1. The two replications can be safely combined into one replication |
| 2. Consumers of table and tablet in-memory state must be prepared for |
| UNKNOWN intermediate state without metadata, as well as lack of consensus |
| state. This can be addressed by: |
| 1. Adding a new global in-memory unordered_set to "lock" a table's name |
| while creating it. When the create is finished (regardless of |
| success), the name is unlocked. We could even reuse the tablet |
| LockManager for this, as it has no real tablet dependencies |
| 2. Moving step #3 to after step #7 |
| |
| #### Background task scanning |
| |
| 1. for each tablet t in state PREPARING: |
| - change t from state PREPARING to CREATING |
| 2. for each tablet t_old in state CREATING: |
| - if t_old timed out: |
| 1. create tablet t_new in state UNKNOWN, begin mutation to state PREPARING |
| 2. add t_new to table's tablet_map |
| - t_new is now visible when operating on all of table's tablets, but |
| is in UNKNOWN state and has no useful metadata |
| 3. update in-memory tablet_by_id map with t_new |
| - t_new is now visible to by tablet_id lookups, still UNKNOWN and |
| no useful metadata |
| 4. change t_old from state CREATING to REPLACED |
| 5. change t_new from state PREPARING to CREATING |
| 3. for each tablet t in state CREATING: |
| - reset t committed_consensus_state: |
| 1. reset term to min term |
| 2. reset consensus type to local or distributed |
| 3. reset index to invalid op index |
| 4. select peers |
| 4. replicate new and existing tablets |
| 5. if error in replica selection or replication: |
| 1. update each table's in-memory tablet_map to remove t_new tablets (step 2) |
| - t_new tablets still visible to by tablet_id lookups |
| 2. update tablet_by_id map to remove t_new tablets (step 2) |
| - t_new tablets no longer visible |
| 6. send DeleteTablet() RPCs for all t_old tablets (step 2) |
| 7. send CreateTablet() RPCs for all created tablets |
| 8. commit mutations for new and existing tablets |
| - replacement tablets from step #2 now have full metadata, all tablets now |
| have visible consensus state |
| |
| Potential improvements: |
| |
| 1. All replication is already atomic; nothing to change here |
| 2. Steps 6 and 7 should probably be reordered after Step 8 |
| 3. t_new can expose intermediate state to consumers, much like in |
| CreateTable(). Is this safe? |
| 1. Remove step #5 |
| 2. After step #8 (but before RPCs), insert steps #2b and #2c |
| |
| ### Deleting a table |
| |
| #### DeleteTable() RPC |
| |
| 1. change table from state RUNNING (or ALTERING) to REMOVED |
| 2. replicate table |
| 3. commit mutation for table |
| - new state (REMOVED) is now visible |
| 4. for each tablet t: |
| 1. Send DeleteTablet() RPC for t |
| 2. change t from state RUNNING to DELETED |
| 3. replicate t |
| 4. commit mutation for t |
| - new state (DELETED) is now visible |
| |
| Potential improvements: |
| |
| 1. Table and tablet replications can be safely combined, provided we invert |
| the commit order and make sure rest of master is OK with that |
| 2. DeleteTablet() RPC should come after tablet mutation is committed |
| |
| ### Altering a table |
| |
| #### AlterTable() RPC |
| |
| 1. if requested change to table name: |
| - change name |
| 2. if requested change to table schema: |
| 1. reset fully_applied_schema with current schema |
| 2. reset current schema |
| 3. increment schema version |
| 4. increment next column id |
| 5. change table from state RUNNING to ALTERING |
| 6. replicate table |
| 7. commit mutation to table |
| - new state (ALTERING), schema, etc. are now visible |
| 8. Send AlterTablet() RPCs for each tablet |
| |
| No potential improvements |
| |
| ### Heartbeating |
| |
| #### Heartbeat for tablet t |
| |
| 1. if t fails by_id lookup: |
| - send DeleteTablet() RPC for t |
| 2. if t's table does not exist: |
| - send DeleteTablet() RPC for t |
| 3. if t is REMOVED or REPLACED, or t's table is DELETED: |
| - send DeleteTablet() RPC for t |
| 4. if t is no longer in the list of peers: |
| - send DeleteTablet() RPC for t |
| 5. if t CREATING and has leader: |
| - change t from CREATING to RUNNING |
| 6. change consensus state |
| 1. reset committed_consensus_state (if it exists in report) |
| 2. update tablet replica locations |
| - new replica locations now immediately visible |
| 3. if t is no longer in the list of peers: |
| - send DeleteTablet() RPC for t |
| 4. if tablet is under-replicated |
| - send ConfigChange() RPC for t |
| 8. update t |
| 9. commit mutation for t |
| - new state and committed consensus state are now visible |
| 10. if t reported schema version isn't latest: |
| - send AlterTablet() RPC for t |
| 11. else: |
| 1. update tablet schema version |
| - new schema version is immediately visible |
| 2. if table is in state ALTERING and all tablets have latest schema version: |
| 1. clear fully_applied_schema |
| 2. change table from state ALTERING to RUNNING |
| 3. replicate table |
| 4. commit mutation for table |
| - new state (RUNNING) and empty fully_applied_schema now visible |
| |
| No potential improvements. One replication per tablet (as written) is OK |
| because: |
| |
| 1. If replication fails but master still alive: tserver will retry same |
| kind of heartbeat, giving master a chance to replicate again |
| 2. If replication fails and master dies: with proposed "send full heartbeat |
| on new leader master" change, failed replications will be retried by new |
| leader master |