set key -> newvalue [txid] // Inserted row has a special metadata containing transaction id it's enlisted in.
set txid + key -> oldvalue (for aborting purposes)
on commit:
set txid -> commited
release exclusive lock
async clear garbage
on abort:
set txid -> aborted
remove key -> newvalue
set key -> oldvalue
release exclusive lock
async clear garbage
We assume only row level locking in the first approach, which gives us a repeatable_read isolation.
When the SQL query is executed, it acquires locks while the the data is collected on data nodes.
Locks are acquired lazily as result set is consumed.
The locking rules are same as for get/put operations.
Then values are removed from indexes on step 2, they are written as tombstones to avoid read inconsistency and should be cleaned up after tx finish.
Failover protocol is similar to Ignite 2 with a main difference: until tx is sure it can commit or rollback, it holds its locks. This means in the case of split-brain, some keys will be locked until split-brain situation is resolved and tx recovery protocol will converge.
If a tx is not started to COMMIT, the coordinator reverts a transaction on remaining leaseholders. Then a new leasholder is elected, it check for its pending transactions and asks a coordinator if it's possible to commit.
Broadcast recovery (various strategies are possible: via gossip or dedicated node) is necessary (because we don‘t have full tx topology on each enlisted node - because it’s unknown until commit). All nodes are requested about local txs state. If at least one is commiting, it's safe to commit.
[1] CockroachDB: The Resilient Geo-Distributed SQL Database, 2020 ↩
[2] Concurrency Control in Distributed Database Systems, 1981 ↩