tree: 6e16687cc8a3de79244ccdf6cc1baf158e227530 [path history] [tgz]

src/backend/utils/gdd/README.md

Global Deadlock Detection

Background

Local Deadlock

Let's begin with a simple case using a similar syntax of isolation2 test framework:

A: begin;
A: update foo set val=val where id=1;

B: begin;
B: update foo set val=val where id=2;
B&: update foo set val=val where id=1;

A&: update foo set val=val where id=2;

In this case there are two concurrent transactions, A and B. At first A holds a Xid lock on the tuple with id=1 and B holds a Xid lock with id=2 without blocking, but then B gets blocked by A when trying to update the tuple with id=1 and A gets blocked by B when trying to update the tuple with id=2, so a deadlock is formed. And in this case the tuple id=1 and the tuple id=2 is on the same segment, so the deadlock just happens locally.

Local Deadlock Detection

On a single segment this deadlock can be automatically detected. The detection is based on a cycle detection algorithm:

each transaction is a vertex;
each waiting relation is a directed edge;
if there is a directed cycle in the graph then there is deadlock.

In above case there are two vertices, A and B, and there are two directed edges, A==>B and B==>A, obviously these two edges form a directed cycle, so the deadlock can be detected.

+---------------------------+
|           seg 0           |
|                           |
| +-----+   id=2    +-----+ |
| |     | ========> |     | |
| |  A  |           |  B  | |
| |     | <======== |     | |
| +-----+   id=1    +-----+ |
|                           |
+---------------------------+

Global Deadlock

However on a distributed cluster (e.g. Apache Cloudberry) rows with different ids are distributed to different segments, suppose rows with id=1 are on seg 0 and rows with id=2 are on seg 1, then each segment only see a part of the graph.

+---------------------------+    +---------------------------+
|           seg 0           |    |           seg 1           |
|                           |    |                           |
| +-----+           +-----+ |    | +-----+           +-----+ |
| |     |   id=1    |     | |    | |     |   id=2    |     | |
| |  B  | ========> |  A  | |    | |  A  | ========> |  B  | |
| |     |           |     | |    | |     |           |     | |
| +-----+           +-----+ |    | +-----+           +-----+ |
|                           |    |                           |
+---------------------------+    +---------------------------+

This forms a global / distributed deadlock.

Global Deadlock Prevention

To prevent global deadlock, in Apache Cloudberry an exclusive table lock is held for UPDATE and DELETE commands, so concurrent updates are actually disabled.

This policy does prevent the global deadlock, but the cost is bad performance on concurrent updates.

Global Deadlock Detection

If we can gather all these edges to one segment then we can form the same graph as the local one and detect the deadlock.

However if we make the detection based on these directly gathered edges then some fake deadlocks can be detected. A transaction can be waiting for at most one other transaction on any single host, but it can be waiting for multiple other transactions globally in a distributed system. Some waiting relations are automatically changed if the holder object changed, but some are not. An example of unchanged waiting relations is waiting for a Xid lock(it can be only released after the holding transaction is over). An example of changed waiting relations is waiting for a tuple lock(the tuple lock can be released before the holding transaction is over). For concrete examples, please refer the test cases.

A proper global deadlock detector must be designed with these distributed system specific characteristics in consideration. The basic requirements are that if there is a deadlock then find it out and break it, and make sure don't make false alarms.

Our global deadlock detection algorithm is based on the local version, and has extra logic to distinguish the false deadlocks from the real ones.

Abstraction and Definition

vertex: each transaction is a vertex;
edge: if transaction A is waiting for transaction B then there is a directed edge from vertex A to vertex B;
out edge: an edge from A to B is an out edge of A;
in edge: an edge from A to B is an in edge of B;
out degree: the number of a vertex's out edges is its out degree;
in degree: the number of a vertex's in edges is its in degree;
solid edge: if an edge from A to B can not be released until B COMMIT/ABORT, then this edge is a solid edge;
dotted edge: if an edge from A to B can be released when or before B's current query is done, then this edge is a dotted edge;
local graph: each segment has a local graph formed by all the local edges on this segment;
global graph: edges on all the segment form a global graph;

Rules and Policies

The detection is based on one basic assumption:

if some vertices (transactions) formed a cycle (deadlock), then they and their edges (waiting relations) will no longer change their status;

And we know some facts:

an edge can be either a dotted edge or a solid edge, no other possibilities;

According to these, we have below deductions:

if the status of a vertex or an edge is possible to change, then this vertex/edge is not part of a cycle (yet);
if a vertex has no out edge on any segment then its status is possible to change;
- --> Rule 1: delete a vertex if its global out degree is 0;
if a vertex has no in edge on any segment then it's not part of a cycle;
- --> Rule 2: delete a vertex if its global in degree is 0;
if a vertex has no out edge on segment X but has an out edge on segment Y:
- its solid in edges on X is not possible to change;
- its dotted in edges on X are possible to change;
  - --> Rule 3: delete a vertex's dotted in edges on a segment if its local out degree is 0;

Rule 1 & 2 are straightforward as they are also the rules for local deadlock detection. Rule 3 needs further deductions:

Suppose:

vertex C has no out edge on segment X, but has out edge on segment Y;
a solid edge from A to C on X;
an dotted edge from B to C on X;

For A:

A is waiting for a transaction level lock/object held by C on X, C will not release it until end of C's transaction;
C‘s transaction will only end until all C’s segments end;
as we know C has an out edge on Y, we know C will not end on Y, so C's transaction will not end, so that lock/object held by C on X will not be released on X;
in summary, if a vertex has global out edge(s), then all its solid in edges on any segment can't be removed;

For B:

B is waiting for a query level lock/object held by C on X;
as C has no out edge on X so C's status itself is possible to change;
so C is possible to release any query level lock/object holding now;
so this dotted edge from B to C is possible to change on X;
in summary, if vertex has no local out edge, then all its dotted in edges on this segment should be removed;

Algorithm

With all these 3 rules we can have an algorithm as this:

# reduce edges and vertices
while true:
    for vertex in vertices_with_zero_global_out_degree():
        # delete vertex and its solid/dotted in edges on all segments
        delete_vertex(vertex)

    for vertex in vertices_with_zero_global_in_degree():
        # delete vertex and its solid/dotted out edges on all segments
        delete_vertex(vertex)

    for segment in all_segments():
        for vertex in vertices_with_zero_local_out_degree(segment):
            # delete vertex's dotted in edges on segment,
            # related vertices' global in/out degrees are also updated
            delete_dotted_in_edges_on_segment(segment, vertex)

    if nothing_deleted_in_current_loop():
        # no more vertex or edge to delete
        break

# detect for deadlock cycle
if count_vertices_not_deleted() > 0:
    # detected at least one cycle
    # the cycles are only reliable if all the contained vertices
    # are valid transactions on master
    if all_vertices_are_valid():
        # choose a vertex and cancel it

In this algorithm we need to maintain below information:

local in/out edges of all the vertices on each segment;
global in/out degrees of all the vertices;

We maintain a local graph on each segment just like the local deadlock detection algorithm, the difference is that we also introduce extra rules to reduce edges/vertices based on global information. But there is no need to also maintain a global graph.

Edge Collection

Edges can be collected with the pg_locks view.

select a.*, b.*
from pg_locks a
join pg_locks b on
  a.locktype=b.locktype and
  a.pid<>b.pid and
  a.granted='f' and
  a.locktype in ('transactionid', 'relation', 'tuple') and
  ((a.locktype='transactionid' and
    a.transactionid=b.transactionid) or
   (a.locktype='relation' and
    a.database=b.database and
    a.relation=b.relation) or
   (a.locktype='tuple' and
    a.database=b.database and
    a.relation=b.relation and
    a.page=b.page and
    a.tuple=b.tuple));

However pg_locks lack some necessary information:

the waiter xid;
the edge type: dotted or solid;

So we have to create a new udf gp_dist_wait_status to collect all the necessary information, this udf is similar to pg_locks, but with these information included, an example output:

select * from gp_dist_wait_status();
 segid | waiter_dxid | holder_dxid | holdTillEndXact
-------+-------------+-------------+------------
    -1 |          29 |          28 | t
     0 |          28 |          26 | t
     0 |          27 |          29 | t
     1 |          26 |          27 | t
(4 rows)

Impact of The Async Edge Collection

The local wait-relationship graphs collected from each segment are generated asynchronously. We have to take the impact into account.

if a transaction A is valid on master then it's valid on all segments;
transaction A's solid in edges are all valid as long as A is valid on master;
if there is a solid edge from B to A on a segment, then B is valid as long as A is valid;
further more, if there is a solid edge from B to A on a segment, then B is valid on all segments as long as A is valid on master;
if there is a dotted edge from B to A on a segment, then this edge is valid unless A's status changed;
if there is a solid edge from B to A then all the vertices wait for B directly or indirectly are valid on all segments;

Our algorithm reduces all the edges that are possible to change. On each segment the reduced graph's end vertices (vertices whose local out degree is 0) must only have solid in edges. We can validate these transactions on master, if they are all valid then those solid edges to them are also valid, so the transactions waiting for them directly or indirectly are valid, so the graphs are valid and the detected deadlock cycles are true; if any of the transactions is invalid on master then the graphs are not reliable and we should retry later.

So our algorithm only requires the edges information to be consistent at segment level, different segments can report their information at different time.

Edge Classifying

Edge types are detected locally by gp_dist_wait_status on each segment, the policies are:

xid waitings are solid edges;
relation waitings that ever been closed in NO_LOCK mode at least once are solid edges, otherwise they are dotted edges;
all other waitings are dotted edges;

For relation locks, there is ref counts in both Relation and LOCK, if a relation lock is opened in non NO_LOCK mode but closed in NO_LOCK mode then the ref count in LOCK can no longer reach 0, so the lock can no loner be released until end of transaction, that's why we can take it as a solid edge.

Case Analysis

Let's run the new algorithm on an actual case to understand how does it work.

The testing cluster has 3 segments: seg -1, seg 0 and seg 1. Master is on seg -1. A table t1 is created as below:

create table t1 (id int, val int);
insert into t1 values (1,1), (2,2), (3,3), (4,4);

Its data distribution is as below:

select gp_segment_id, * from t1 order by id;
 gp_segment_id | id | val
---------------+----+-----
             0 |  1 |   1
             1 |  2 |   2
             0 |  3 |   3
             1 |  4 |   4
(4 rows)

The sessions are like below:

C: begin;
C: update t1 set val=30 where id=2;

A: begin;
A: update t1 set val=10 where val=3;

B: begin;
B: update t1 set val=20 where id=4;
B&: update t1 set val=20 where val=3 or id=2;
-- B ==> A: xid lock on seg 0
-- B ==> C: xid lock on seg 1

A&: update t1 set val=10 where val=2;
-- A ~~> B: tuple lock on seg 1

D: begin;
D&: update t1 set val=40 where id=4;
-- D ==> B: xid lock on seg 1

With the collected edges we form below original graphs on each segment:

+-----------------------+    +-------------------------------------+
|         seg 0         |    |                seg 1                |
|                       |    | +-----+                             |
| +-----+       +-----+ |    | |  A  | ~~~~> +-----+       +-----+ |
| |     |       |     | |    | +-----+       |     |       |     | |
| |  B  | ====> |  A  | |    |               |  B  | ====> |  C  | |
| |     |       |     | |    | +-----+       |     |       |     | |
| +-----+       +-----+ |    | |  D  | ====> +-----+       +-----+ |
|                       |    | +-----+                             |
+-----------------------+    +-------------------------------------+

Edges are reduced in this order:

B1 ==> C1: as C's global out degree is 0 (Rule 1);
D1 ==> B1: as D's global in degree is 0 (Rule 2);
A1 ~~> B1: as B1‘s local out degree is 0 and it’s a dotted edge (Rule 3);
B0 ==> A0: as A' global out degree is 0 (Rule 1);

As all edges are removed, there is no deadlock cycle.

vi:cc=80:tw=78:ai:et: