blob: ec31fb2e666aa547f0c5add559cb8c04ec4a00b3 [file] [log] [blame] [view]
# Global Deadlock Detection
## Background
### Local Deadlock
Let's begin with a simple case using a similar syntax of isolation2 test
framework:
```sql
A: begin;
A: update foo set val=val where id=1;
B: begin;
B: update foo set val=val where id=2;
B&: update foo set val=val where id=1;
A&: update foo set val=val where id=2;
```
In this case there are two concurrent transactions, A and B. At first A holds
a Xid lock on the tuple with `id=1` and B holds a Xid lock with `id=2` without
blocking, but then B gets blocked by A when trying to update the tuple with
`id=1` and A gets blocked by B when trying to update the tuple with `id=2`, so
a deadlock is formed. And in this case the tuple `id=1` and the tuple `id=2` is
on the same segment, so the deadlock just happens locally.
### Local Deadlock Detection
On a single segment this deadlock can be automatically
detected. The detection is based on a cycle detection algorithm:
- each transaction is a vertex;
- each waiting relation is a directed edge;
- if there is a directed cycle in the graph then there is deadlock.
In above case there are two vertices, A and B, and there are two directed
edges, `A==>B` and `B==>A`, obviously these two edges form a directed cycle,
so the deadlock can be detected.
+---------------------------+
| seg 0 |
| |
| +-----+ id=2 +-----+ |
| | | ========> | | |
| | A | | B | |
| | | <======== | | |
| +-----+ id=1 +-----+ |
| |
+---------------------------+
### Global Deadlock
However on a distributed cluster (e.g. Apache Cloudberry) rows with different ids
are distributed to different segments, suppose rows with `id=1` are on seg 0
and rows with `id=2` are on seg 1, then each segment only see a part of the
graph.
+---------------------------+ +---------------------------+
| seg 0 | | seg 1 |
| | | |
| +-----+ +-----+ | | +-----+ +-----+ |
| | | id=1 | | | | | | id=2 | | |
| | B | ========> | A | | | | A | ========> | B | |
| | | | | | | | | | | |
| +-----+ +-----+ | | +-----+ +-----+ |
| | | |
+---------------------------+ +---------------------------+
This forms a global / distributed deadlock.
### Global Deadlock Prevention
To prevent global deadlock, in Apache Cloudberry an exclusive table lock is held
for `UPDATE` and `DELETE` commands, so concurrent updates are actually
disabled.
This policy does prevent the global deadlock, but the cost is bad performance
on concurrent updates.
## Global Deadlock Detection
If we can gather all these edges to one segment then we can form the same
graph as the [local one](#local-deadlock-detection) and detect the deadlock.
However if we make the detection based on these directly gathered edges then
some fake deadlocks can be detected. A transaction can be waiting for at most
one other transaction on any single host, but it can be waiting for multiple
other transactions globally in a distributed system. Some waiting relations
are automatically changed if the holder object changed, but some are not. An
example of unchanged waiting relations is waiting for a Xid lock(it can be only
released after the holding transaction is over). An example of changed waiting
relations is waiting for a tuple lock(the tuple lock can be released before the
holding transaction is over). For concrete examples, please refer the test cases.
A proper global deadlock detector must be designed with these distributed
system specific characteristics in consideration. The basic requirements are
that if there is a deadlock then find it out and break it, and make sure don't
make false alarms.
Our global deadlock detection algorithm is based on the local version, and has
extra logic to distinguish the false deadlocks from the real ones.
### Abstraction and Definition
- **vertex**: each transaction is a vertex;
- **edge**: if transaction A is waiting for transaction B then there is a
directed edge from vertex A to vertex B;
- **out edge**: an edge from A to B is an out edge of A;
- **in edge**: an edge from A to B is an in edge of B;
- **out degree**: the number of a vertex's out edges is its out degree;
- **in degree**: the number of a vertex's in edges is its in degree;
- **solid edge**: if an edge from A to B can not be released until B
COMMIT/ABORT, then this edge is a solid edge;
- **dotted edge**: if an edge from A to B can be released when or before B's
current query is done, then this edge is a dotted edge;
- **local graph**: each segment has a local graph formed by all the local
edges on this segment;
- **global graph**: edges on all the segment form a global graph;
### Rules and Policies
The detection is based on one basic assumption:
- if some vertices (transactions) formed a cycle (deadlock), then they and
their edges (waiting relations) will no longer change their status;
And we know some facts:
- an edge can be either a dotted edge or a solid edge, no other possibilities;
According to these, we have below deductions:
- if the status of a vertex or an edge is possible to change, then this
vertex/edge is not part of a cycle (yet);
- if a vertex has no out edge on any segment then its status is possible to
change;
- --> `Rule 1: delete a vertex if its global out degree is 0`;
- if a vertex has no in edge on any segment then it's not part of a cycle;
- --> `Rule 2: delete a vertex if its global in degree is 0`;
- if a vertex has no out edge on segment X but has an out edge on segment Y:
- its solid in edges on X is not possible to change;
- its dotted in edges on X are possible to change;
- --> `Rule 3: delete a vertex's dotted in edges on a segment if its local
out degree is 0`;
Rule 1 & 2 are straightforward as they are also the rules for local deadlock
detection. Rule 3 needs further deductions:
Suppose:
- vertex C has no out edge on segment X, but has out edge on segment Y;
- a solid edge from A to C on X;
- an dotted edge from B to C on X;
For A:
- A is waiting for a transaction level lock/object held by C on X, C will not
release it until end of C's transaction;
- C's transaction will only end until all C's segments end;
- as we know C has an out edge on Y, we know C will not end on Y, so C's
transaction will not end, so that lock/object held by C on X will not be
released on X;
- in summary, if a vertex has global out edge(s), then all its solid in edges
on any segment can't be removed;
For B:
- B is waiting for a query level lock/object held by C on X;
- as C has no out edge on X so C's status itself is possible to change;
- so C is possible to release any query level lock/object holding now;
- so this dotted edge from B to C is possible to change on X;
- in summary, if vertex has no local out edge, then all its dotted in edges on
this segment should be removed;
### Algorithm
With all these 3 rules we can have an algorithm as this:
```python
# reduce edges and vertices
while true:
for vertex in vertices_with_zero_global_out_degree():
# delete vertex and its solid/dotted in edges on all segments
delete_vertex(vertex)
for vertex in vertices_with_zero_global_in_degree():
# delete vertex and its solid/dotted out edges on all segments
delete_vertex(vertex)
for segment in all_segments():
for vertex in vertices_with_zero_local_out_degree(segment):
# delete vertex's dotted in edges on segment,
# related vertices' global in/out degrees are also updated
delete_dotted_in_edges_on_segment(segment, vertex)
if nothing_deleted_in_current_loop():
# no more vertex or edge to delete
break
# detect for deadlock cycle
if count_vertices_not_deleted() > 0:
# detected at least one cycle
# the cycles are only reliable if all the contained vertices
# are valid transactions on master
if all_vertices_are_valid():
# choose a vertex and cancel it
```
In this algorithm we need to maintain below information:
- local in/out edges of all the vertices on each segment;
- global in/out degrees of all the vertices;
We maintain a local graph on each segment just like the local deadlock
detection algorithm, the difference is that we also introduce extra rules to
reduce edges/vertices based on global information. But there is no need to
also maintain a global graph.
### Edge Collection
Edges can be collected with the `pg_locks` view.
```sql
select a.*, b.*
from pg_locks a
join pg_locks b on
a.locktype=b.locktype and
a.pid<>b.pid and
a.granted='f' and
a.locktype in ('transactionid', 'relation', 'tuple') and
((a.locktype='transactionid' and
a.transactionid=b.transactionid) or
(a.locktype='relation' and
a.database=b.database and
a.relation=b.relation) or
(a.locktype='tuple' and
a.database=b.database and
a.relation=b.relation and
a.page=b.page and
a.tuple=b.tuple));
```
However `pg_locks` lack some necessary information:
- the waiter xid;
- the edge type: dotted or solid;
So we have to create a new udf `gp_dist_wait_status` to collect all the
necessary information, this udf is similar to `pg_locks`, but with these
information included, an example output:
```sql
select * from gp_dist_wait_status();
segid | waiter_dxid | holder_dxid | holdTillEndXact
-------+-------------+-------------+------------
-1 | 29 | 28 | t
0 | 28 | 26 | t
0 | 27 | 29 | t
1 | 26 | 27 | t
(4 rows)
```
### Impact of The Async Edge Collection
The local wait-relationship graphs collected from each segment are generated
asynchronously. We have to take the impact into account.
- if a transaction A is valid on master then it's valid on all segments;
- transaction A's solid in edges are all valid as long as A is valid on master;
- if there is a solid edge from B to A on a segment, then B is valid as long as
A is valid;
- further more, if there is a solid edge from B to A on a segment, then B is
valid on all segments as long as A is valid on master;
- if there is a dotted edge from B to A on a segment, then this edge is valid
unless A's status changed;
- if there is a solid edge from B to A then all the vertices wait for B
directly or indirectly are valid on all segments;
Our algorithm reduces all the edges that are possible to change. On each
segment the reduced graph's end vertices (vertices whose local out degree is
0) must only have solid in edges. We can validate these transactions on master,
if they are all valid then those solid edges to them are also valid, so the
transactions waiting for them directly or indirectly are valid, so the graphs
are valid and the detected deadlock cycles are true; if any of the
transactions is invalid on master then the graphs are not reliable and we
should retry later.
So our algorithm only requires the edges information to be consistent at
segment level, different segments can report their information at different
time.
### Edge Classifying
Edge types are detected locally by `gp_dist_wait_status` on each segment, the
policies are:
- xid waitings are solid edges;
- relation waitings that ever been closed in `NO_LOCK` mode at least once are
solid edges, otherwise they are dotted edges;
- all other waitings are dotted edges;
For relation locks, there is ref counts in both `Relation` and `LOCK`, if a
relation lock is opened in non `NO_LOCK` mode but closed in `NO_LOCK` mode
then the ref count in `LOCK` can no longer reach 0, so the lock can no loner
be released until end of transaction, that's why we can take it as a solid
edge.
### Case Analysis
Let's run the new algorithm on an actual case to understand how does it work.
The testing cluster has 3 segments: seg -1, seg 0 and seg 1. Master is on seg
-1. A table `t1` is created as below:
```sql
create table t1 (id int, val int);
insert into t1 values (1,1), (2,2), (3,3), (4,4);
```
Its data distribution is as below:
```sql
select gp_segment_id, * from t1 order by id;
gp_segment_id | id | val
---------------+----+-----
0 | 1 | 1
1 | 2 | 2
0 | 3 | 3
1 | 4 | 4
(4 rows)
```
The sessions are like below:
```sql
C: begin;
C: update t1 set val=30 where id=2;
A: begin;
A: update t1 set val=10 where val=3;
B: begin;
B: update t1 set val=20 where id=4;
B&: update t1 set val=20 where val=3 or id=2;
-- B ==> A: xid lock on seg 0
-- B ==> C: xid lock on seg 1
A&: update t1 set val=10 where val=2;
-- A ~~> B: tuple lock on seg 1
D: begin;
D&: update t1 set val=40 where id=4;
-- D ==> B: xid lock on seg 1
```
With the collected edges we form below original graphs on each segment:
+-----------------------+ +-------------------------------------+
| seg 0 | | seg 1 |
| | | +-----+ |
| +-----+ +-----+ | | | A | ~~~~> +-----+ +-----+ |
| | | | | | | +-----+ | | | | |
| | B | ====> | A | | | | B | ====> | C | |
| | | | | | | +-----+ | | | | |
| +-----+ +-----+ | | | D | ====> +-----+ +-----+ |
| | | +-----+ |
+-----------------------+ +-------------------------------------+
Edges are reduced in this order:
- `B1 ==> C1`: as C's global out degree is 0 (Rule 1);
- `D1 ==> B1`: as D's global in degree is 0 (Rule 2);
- `A1 ~~> B1`: as B1's local out degree is 0 and it's a dotted edge (Rule 3);
- `B0 ==> A0`: as A' global out degree is 0 (Rule 1);
As all edges are removed, there is no deadlock cycle.
vi:cc=80:tw=78:ai:et: