This document describes user-level and component-level cluster lifecycles and their mutual interaction.
A node maintains its' local state in the local persistent key-value storage named vault. The data stored in the vault is semantically divided in the following categories:
The vault is created during the first node startup and optionally populated with the paremeters from the configuration file passed in to the ignite node start
[command](TODO link to CLI readme). Only user-level properties can be written via the provided file.
System-level properties are written to the storage during the first vault initialization by the node start process. Projected properties are not initialized during the initial node startup because at this point the local node is not aware of the distributed metastorage. The node remains in a ‘zombie’ state until after it learns that there is an initialized metastorage (either via the ignite cluster init
[command](TODO link to CLI readme) during the initial cluster initialization) or from the group membershup service via gossip (implying that group membership protocol is working at this point).
For testability purposes, we require that component dependencies are defined upfront and provided at the construction time. This additionaly requires that component dependencies form no cycles. Therefore, components form an acyclic directed graph that is constructed in topological sort order wrt root.
Components created and initialized also in an order consistent with a topological sort of the components graph. This enforces serveral rules related to the components interaction:
B
depdends on a component A
, then A
receives watch notification prior to B
.The diagram above shows the component dependency diagram and provides an order in which compomnents may be initialized.
For a cluster to become operational, the metastorage instance must be initialized first. The initialization command chooses a set of nodes (normally, 3 - 5 nodes) to host the distributed metastorage Raft group. When a node receives the initialization command, it either creates a bootstrapped Raft instance with the given members (if this is a metastorage group node), or writes the metastorage group member IDs to the vault as a private system-level property.
After the metastorage is initialized, components start to receive and process watch events, updating the local state according to the changes received from the watch.
An entry point to user-initiated cluster state changes is cluster configuration. Configuration module provides convenient ways for managing configuration both as Java API, as well as from ignite
command line utility.
Any configuration change is translated to a metastorage multi-update and has a single configuration change ID. This ID is used to enforce CAS-style configuration changes and to ensure no duplicate configuration changes are executed during the cluster runtime. To reliably process cluster configuration changes, we introduce an additional metastorage key
internal.configuration.applied=<change ID>
that indicates the configuration change ID that was already processed and corresponding changes are written to the metastorage. Whenever a node processes a configuration change, it must also conditionally update the internal.configuration.applied
value checking that the previous value is smaller than the change ID being applied. This prevents configuration changes being processed more than once. Any metastorage update that processes configuration change must update this key to indicate that this configuraion change has been already processed. It is safe to process the same configuration change more than once since only one update will be applied.
All cluster state is written and maintained in the metastorage. Nodes may update some state in the metastorage, which may require a recomputation of some other metastorage properties (for example, when cluster baseline changes, Ignite needs to recalculate table affinity assignments). In other words, some properties in the metastore are dependent on each other and we may need to reliably update one property in response to an update to another.
To facilitate this pattern, Ignite uses the metastorage ability to replay metastorage changes from a certain revision called [watch](TODO link to metastorage watch). To process watch updates reliably, we associate a special persistent value called applied revision
(stored in the vault) with each watch. We rely on the following assumptions about the reliable watch processing:
applied revision
is propagated only after the metastorage confirmed that the proposed change is committed (note that in this case it does not matter whether this particular multi-update succeeds or not: we know that the first multi-update will succeed, and all further updates are idempotent, so we need to make sure that at least one multi-update is committed).applied revision
value.In a case of a crash, each watch is restared from the revision stored in the corresponding applied revision
variable of the watch, and not processed events are replayed.
CREATE TABLE
flowWe require that each Ignite table is assigned a globally unique ID (the ID must not repeat even after the table is dropped, so we use the metastorage key revision to assign table IDs).
To create a table, a user makes a change in the configuration tree by introducing the corresponding configuration object. This can be done either via public [configuration API](TODO link to configuration API) or via the ignite
[configuration command](TODO link to CLI readme). Configuration validator checks that a table with the same name does not exist (and performs other necessary checks) and writes the change to the metastorage. If the update succeeds, Ignite considers the table created and completes user call.
After the configuration change is applied, the table manager receives configuration change notification (essentially, a transformed watch) on metastorage group nodes. Table manager uses configuration keys update counters (not revision) as table IDs and attempts to create the following keys (updating the internal.configuration.applied
key as was described above):
internal.tables.<ID>=<name>
In order to process affinity calculations and assignments, the affinity manager creates a reliable watch for the following keys on metastorage group members:
internal.tables.* internal.baseline
Whenever a watch is fired, the affinity manager checks which key was updated. If the watch is triggered for internal.tables.<ID>
key, it calculates a new affinity for the table with the given ID. If the watch is triggered for internal.baseline
key, the manager recalculates affinity for all tables exsiting at the watch revision (this can be done using the metastorage range(keys, upperBound)
method providing the watch event revision as the upper bound). The calculated affinity is written to the internal.tables.affinity.<ID>
key.
Note that ideally the watch should only be processed on metastorage group leader, thus eliminating unnecessary network trips. Theoretically, we could have embedded this logic to the state machine, but this would enormously complicate the cluster updates and put metastorage consistency at risk.
To handle partition assignments, partition manager creates a reliable watch for the affinity assignment key on all nodes:
internal.tables.affinity.<ID>
Whenever a watch is fired, the node checks whether there exist new partitions assigned to the local node, and if there are, the node bootstraps corresponding Raft partition servers (i.e. allocates paths to Raft logs and storage files). The allocation information is written to projected vault keys:
local.tables.partition.<ID>.<PARTITION_ID>.logpath=/path/to/raft/log local.tables.partition.<ID>.<PARTITION_ID>.storagepath=/path/to/storage/file
Once the projected keys are synced to the vault, the partition manager can create partition Raft servers (initialize the log and storage, write hard state, register message handlers, etc). Upon startup, the node checks the existing projected keys (finishing the raft log and storage initialization in case it crashed in the midst) and starts the Raft servers.