The Catalog Manager keeps track of the Kudu tables and tablets defined by the user in the cluster.
All the table and tablet information is stored in-memory in copy-on-write TableInfo / TabletInfo objects, as well as on-disk, in the “sys.catalog” Kudu system table hosted only on the Masters. This system table is loaded into memory on Master startup. At the time of this writing, the “sys.catalog” table consists of only a single tablet in order to provide strong consistency for the metadata under RAFT replication (as currently, each tablet has its own log).
To add or modify a table or tablet, the Master writes, but does not yet commit the changes to memory, then writes and flushes the system table to disk, and then makes the changes visible in-memory (commits them) if the disk write (and, in a distributed master setup, config-based replication) is successful. This allows readers to access the in-memory state in a consistent way, even while a write is in-progress.
This design prevents having to go through the whole scan path to service tablet location calls, which would be more expensive, and allows for easily keeping “soft” state in the Master for every Table and Tablet.
The catalog manager maintains 3 hash-maps for looking up info in the sys table:
The TableInfo has a map [tablet-start-key] -> TabletInfo used to provide the tablets locations to the user based on a key-range request.
The below corresponds to the code in CatalogManager::CreateTable().
After this point in time, the table is reported as created, which means that if the cluster is shut down, when it starts back up the table will still exist. However, the tablets are not yet created (see Table Assignment, below).
When the user sends a DeleteTable request for table T, table T is marked as deleted by writing a “deleted” flag in the state field in T's record in the “sys.catalog” table, table T is removed from the in-memory “table names” map on the Master, and the table is marked as being “deleted” in the in-memory TableInfo / TabletInfo “state” field on the Master. TODO: Could this race with table deletion / creation??
At this point, the table is no longer externally visible to clients via Master RPC calls, but the tablet configs that make up the table may still be up and running. New clients trying to open the table will get a NotFound error, while clients that already have the tablet locations cached may still be able to read and write to the tablet configs, as long as the corresponding tablet servers are online and their respective tablets have not yet been deleted. In some ways, this is similar the design of FS unlink.
The Master will asynchronously send a DeleteTablet RPC request to each tablet (one RPC request per tablet server in the config, for each tablet), and the tablets will therefore be deleted in parallel in some unspecified order. If the Master or tablet server goes offline before a particular DeleteTablet operation successfully completes, the Master will send a new DeleteTablet request at the time that the next heartbeat is received from the tablet that is to be deleted.
Once a table is created, the tablets must be created on a set of replicas. In order to do that, the master has to select the replicas and associate them to the tablet.
For each tablet not created we select a set of replicas and a leader and we send the “create tablet” request. On the next TS-heartbeat from the leader we can mark the tablet as “running”, if reported. If we don't receive a “tablet created” report after ASSIGNMENT-TIMEOUT-MSEC we replace the tablet with a new one, following these same steps for the new tablet.
The Assignment is processed by the “CatalogManagerBgTasks” thread. This thread is waiting for an event that can be:
This is the current control flow:
Process Pending Assignments:
Cleanup deleted tables & tablets (FIXME: is this implemented?):
When the TS receives a CreateTablet() RPC, it will attempt to create the tablet replica locally. Once it is successful, it will be added to the next tablet report. When the tablet is reported, the master-side ProcessTabletReport() function is called.
If we find at this point that the reported tablet is in “creating” state, and the TS reporting the tablet is the leader selected during the assignment process (see CatalogManagerBgTasksThread above), the tablet will be marked as running and committed to disk, completing the assignment process.
When the user sends an alter request, which may contain changes to the schema, table name or attributes, the Master will send a set of AlterTable() RPCs to each TS handling the set of tablets currently running. The Master will keep retrying in case of error.
If a TS is down or goes down during an AlterTable request, on restart it will report the schema version that it is using, and if it is out of date, the Master will send an AlterTable request to that TS at that time.
When the Master first comes online after being restarted, a full tablet report will be requested from each TS, and the tablet schema version sent on the next heartbeat will be used to determine if a given TS needs an AlterTable() call.
Heartbeats are sent by the TS to the master. Per master.proto, a heartbeat contains:
Node instance information: permanent uuid, node sequence number (which is incremented each time the node is started).
(Optional) registration. Sent either at TS startup or if the master responded to a previous heartbeat with “needs register” (see ‘Handling heartbeats’ below for an explanation of when this response will be sent).
(Optional) tablet report. Sent either when tablet information has changed, or if the master responded to a previous heartbeat with “needs a full tablet report” (see “Handling heartbeats” below for an explanation of when this response will be sent).
Upon receiving a heartbeat from a TS, the master will:
Check if the heartbeat has registration info. If so, register the TS instance with TSManager (see “TSManager” below for more details).
Retrieve a TSDescriptor from TSManager. If the TSDescriptor is not found, reply to the TS with “need re-register” field set to true, and return early.
Update the heartbeat time (see “TSManager” below) in the registration object.
If the heartbeat contains a tablet report, the Catalog Manager will process the report and update its cache as well as the system tables (see “Catalog Manager” above). Otherwise, the master will respond to the TS requesting a full tablet report.
Send a success respond to the TS.
TSManager provides in-memory storage for information sent by the tablet server to the master (tablet servers that have been heard from, heartbeats, tablet reports, etc...). The information is stored in a map, where the key is the permanent uuid of a tablet server and the value is (a pointer to) a TSDescriptor.
Besides tables' metadata, the system table contains the root CA certificate and corresponding private key when Kudu is configured to use its own IPKI (Internal Private Key Infrastructure). The root CA certificate and the private key are used to
Upon start of a Kudu master server, it generates and stores the root CA certificate and corresponding private key when becoming leader if no such information is present in the system table. If the internal root CA information is already present in the system table, the leader master loads that information into memory and uses it appropriately.
The system table contains entries with TSKs used for authn/authz token signing. The leader master generates and stores those in the system table. Upon start-up or on the change of master leadership, a new leader master loads existing TSK entries from the system table and populates in-memory structures necessary for token signing. Expired keys are lazily purged from the system table by the leader master.