Let's begin with a simple case using a similar syntax of isolation2 test framework:
A: begin; A: update foo set val=val where id=1; B: begin; B: update foo set val=val where id=2; B&: update foo set val=val where id=1; A&: update foo set val=val where id=2;
In this case there are two concurrent transactions, A and B. At first A holds a Xid lock on the tuple with id=1 and B holds a Xid lock with id=2 without blocking, but then B gets blocked by A when trying to update the tuple with id=1 and A gets blocked by B when trying to update the tuple with id=2, so a deadlock is formed. And in this case the tuple id=1 and the tuple id=2 is on the same segment, so the deadlock just happens locally.
On a single segment this deadlock can be automatically detected. The detection is based on a cycle detection algorithm:
In above case there are two vertices, A and B, and there are two directed edges, A==>B and B==>A, obviously these two edges form a directed cycle, so the deadlock can be detected.
+---------------------------+ | seg 0 | | | | +-----+ id=2 +-----+ | | | | ========> | | | | | A | | B | | | | | <======== | | | | +-----+ id=1 +-----+ | | | +---------------------------+
However on a distributed cluster (e.g. Apache Cloudberry) rows with different ids are distributed to different segments, suppose rows with id=1 are on seg 0 and rows with id=2 are on seg 1, then each segment only see a part of the graph.
+---------------------------+ +---------------------------+ | seg 0 | | seg 1 | | | | | | +-----+ +-----+ | | +-----+ +-----+ | | | | id=1 | | | | | | id=2 | | | | | B | ========> | A | | | | A | ========> | B | | | | | | | | | | | | | | | +-----+ +-----+ | | +-----+ +-----+ | | | | | +---------------------------+ +---------------------------+
This forms a global / distributed deadlock.
To prevent global deadlock, in Apache Cloudberry an exclusive table lock is held for UPDATE and DELETE commands, so concurrent updates are actually disabled.
This policy does prevent the global deadlock, but the cost is bad performance on concurrent updates.
If we can gather all these edges to one segment then we can form the same graph as the local one and detect the deadlock.
However if we make the detection based on these directly gathered edges then some fake deadlocks can be detected. A transaction can be waiting for at most one other transaction on any single host, but it can be waiting for multiple other transactions globally in a distributed system. Some waiting relations are automatically changed if the holder object changed, but some are not. An example of unchanged waiting relations is waiting for a Xid lock(it can be only released after the holding transaction is over). An example of changed waiting relations is waiting for a tuple lock(the tuple lock can be released before the holding transaction is over). For concrete examples, please refer the test cases.
A proper global deadlock detector must be designed with these distributed system specific characteristics in consideration. The basic requirements are that if there is a deadlock then find it out and break it, and make sure don't make false alarms.
Our global deadlock detection algorithm is based on the local version, and has extra logic to distinguish the false deadlocks from the real ones.
The detection is based on one basic assumption:
And we know some facts:
According to these, we have below deductions:
Rule 1: delete a vertex if its global out degree is 0;Rule 2: delete a vertex if its global in degree is 0;Rule 3: delete a vertex's dotted in edges on a segment if its local out degree is 0;Rule 1 & 2 are straightforward as they are also the rules for local deadlock detection. Rule 3 needs further deductions:
Suppose:
For A:
For B:
With all these 3 rules we can have an algorithm as this:
# reduce edges and vertices while true: for vertex in vertices_with_zero_global_out_degree(): # delete vertex and its solid/dotted in edges on all segments delete_vertex(vertex) for vertex in vertices_with_zero_global_in_degree(): # delete vertex and its solid/dotted out edges on all segments delete_vertex(vertex) for segment in all_segments(): for vertex in vertices_with_zero_local_out_degree(segment): # delete vertex's dotted in edges on segment, # related vertices' global in/out degrees are also updated delete_dotted_in_edges_on_segment(segment, vertex) if nothing_deleted_in_current_loop(): # no more vertex or edge to delete break # detect for deadlock cycle if count_vertices_not_deleted() > 0: # detected at least one cycle # the cycles are only reliable if all the contained vertices # are valid transactions on master if all_vertices_are_valid(): # choose a vertex and cancel it
In this algorithm we need to maintain below information:
We maintain a local graph on each segment just like the local deadlock detection algorithm, the difference is that we also introduce extra rules to reduce edges/vertices based on global information. But there is no need to also maintain a global graph.
Edges can be collected with the pg_locks view.
select a.*, b.* from pg_locks a join pg_locks b on a.locktype=b.locktype and a.pid<>b.pid and a.granted='f' and a.locktype in ('transactionid', 'relation', 'tuple') and ((a.locktype='transactionid' and a.transactionid=b.transactionid) or (a.locktype='relation' and a.database=b.database and a.relation=b.relation) or (a.locktype='tuple' and a.database=b.database and a.relation=b.relation and a.page=b.page and a.tuple=b.tuple));
However pg_locks lack some necessary information:
So we have to create a new udf gp_dist_wait_status to collect all the necessary information, this udf is similar to pg_locks, but with these information included, an example output:
select * from gp_dist_wait_status(); segid | waiter_dxid | holder_dxid | holdTillEndXact -------+-------------+-------------+------------ -1 | 29 | 28 | t 0 | 28 | 26 | t 0 | 27 | 29 | t 1 | 26 | 27 | t (4 rows)
The local wait-relationship graphs collected from each segment are generated asynchronously. We have to take the impact into account.
Our algorithm reduces all the edges that are possible to change. On each segment the reduced graph's end vertices (vertices whose local out degree is 0) must only have solid in edges. We can validate these transactions on master, if they are all valid then those solid edges to them are also valid, so the transactions waiting for them directly or indirectly are valid, so the graphs are valid and the detected deadlock cycles are true; if any of the transactions is invalid on master then the graphs are not reliable and we should retry later.
So our algorithm only requires the edges information to be consistent at segment level, different segments can report their information at different time.
Edge types are detected locally by gp_dist_wait_status on each segment, the policies are:
NO_LOCK mode at least once are solid edges, otherwise they are dotted edges;For relation locks, there is ref counts in both Relation and LOCK, if a relation lock is opened in non NO_LOCK mode but closed in NO_LOCK mode then the ref count in LOCK can no longer reach 0, so the lock can no loner be released until end of transaction, that's why we can take it as a solid edge.
Let's run the new algorithm on an actual case to understand how does it work.
The testing cluster has 3 segments: seg -1, seg 0 and seg 1. Master is on seg -1. A table t1 is created as below:
create table t1 (id int, val int); insert into t1 values (1,1), (2,2), (3,3), (4,4);
Its data distribution is as below:
select gp_segment_id, * from t1 order by id; gp_segment_id | id | val ---------------+----+----- 0 | 1 | 1 1 | 2 | 2 0 | 3 | 3 1 | 4 | 4 (4 rows)
The sessions are like below:
C: begin; C: update t1 set val=30 where id=2; A: begin; A: update t1 set val=10 where val=3; B: begin; B: update t1 set val=20 where id=4; B&: update t1 set val=20 where val=3 or id=2; -- B ==> A: xid lock on seg 0 -- B ==> C: xid lock on seg 1 A&: update t1 set val=10 where val=2; -- A ~~> B: tuple lock on seg 1 D: begin; D&: update t1 set val=40 where id=4; -- D ==> B: xid lock on seg 1
With the collected edges we form below original graphs on each segment:
+-----------------------+ +-------------------------------------+ | seg 0 | | seg 1 | | | | +-----+ | | +-----+ +-----+ | | | A | ~~~~> +-----+ +-----+ | | | | | | | | +-----+ | | | | | | | B | ====> | A | | | | B | ====> | C | | | | | | | | | +-----+ | | | | | | +-----+ +-----+ | | | D | ====> +-----+ +-----+ | | | | +-----+ | +-----------------------+ +-------------------------------------+
Edges are reduced in this order:
B1 ==> C1: as C's global out degree is 0 (Rule 1);D1 ==> B1: as D's global in degree is 0 (Rule 2);A1 ~~> B1: as B1‘s local out degree is 0 and it’s a dotted edge (Rule 3);B0 ==> A0: as A' global out degree is 0 (Rule 1);As all edges are removed, there is no deadlock cycle.
vi:cc=80:tw=78:ai:et: