Since the last rebalance design we made some significant decisions and architecture updates:
These changes incline us to the thoughts, that we need to revise the current rebalance flow, because it doesn‘t suite to the new architecture anymore in general and doesn’t use the power of new abstractions on the other side.
The simplest way to start the journey to the new design: look at the real cases and try to draw the whole picture.
We still has the number of triggers, which trigger a rebalance:
Let's take the first one to draw the whole rebalance picture.
Update of zoneId.assignments.* keys can be expressed by the following pseudo-code:
var newAssignments = calculateZoneAssignments()
metastoreInvoke: // atomic metastore call through multi-invoke api
if empty(zoneId.assignments.change) || zoneId.assignments.change < configurationUpdate.timestamp:
if empty(zoneId.assignments.pending) && zoneId.assignments.stable != newAssignments:
zoneId.assignments.pending = newAssignments
zoneId.assignments.change = configurationUpdate.timestamp
else:
if zoneId.assignments.pending != newAssignments
zoneId.assignments.planned = newAssignments
zoneId.assignments.change = configurationUpdate.timestamp
else
remove(zoneId.assignments.planned)
else:
skip
It looks like we can reuse the mechanism of AwaitReplicaRequest:
Let's zoom to the details of PrimaryReplica and replication group communication for the RAFT case:
Current rebalance algorithm based on the metastore invokes and local rebalance listeners.
But for the new one we have an idea, which doesn't need the metastore at all:
Within the single atomic metastore invoke we must update the keys according to the following pseudo-code:
metastoreInvoke: \\ atomic
zoneId.assignment.stable = newPeers
remove(zoneId.assignment.cancel)
if empty(zoneId.assignment.planned):
zoneId.assignment.pending = empty
else:
zoneId.assignment.pending = zoneId.assignment.planned
remove(zoneId.assignment.planned)
About the *.cancel key you can read below
Here we need to:
The main idea of failover process: every rebalance request and cancel rebalance request PlacementDriver->PrimaryReplica or PrimaryReplica->ReplicationGroup must be idempotent. So, redundant request in the worst case should be just answered by positive answer (just like rebalance is already done).
After that we can prepare the following logic:
zoneId.assignment.pending/zondeId.assignment.cancel (the last one always wins, if exists) keys and send RebalanceRequest/CancelRebalanceRequest to needed PrimaryReplicas and then listen updates from the last revision of this key.RebalanceRequest/CancelRebalanceRequest to PrimaryReplica, if pending/cancel (cancel always wins, if filled) key is not empty.More over:
RebalanceRequest/CancelRebalanceRequest must include the revision of its' trigger.currentRevision-1. So, after that PlacementDriver must send the *Request with current actual revision.Sometimes we must cancel the ongoing rebalance:
For the purpose of persisting for cancel intent, we must save the (oldTopology, newTopology) pair of peers lists to zoneId.assignment.cancel key. Also, every invoke with update of *.cancel key must be enriched by revision of the pending key, which must be cancelled:
if(zoneId.assignment.pending.revision == inputRevision):
zoneId.assignment.cancel = cancelValue
return true
else:
return false
It's needed to prevent the race, between the rebalance done and cancel persisting, otherwise we can try to cancel the wrong rebalance process.
When PrimaryReplica send CancelRebalanceRequest(oldTopology, newTopology) to the ReplicationGroup following cases are possible: