src/backend/access/heap/README.tuplock - cloudberry - Git at Google

 Locking tuples
 --------------

 Locking tuples is not as easy as locking tables or other database objects.
 The problem is that transactions might want to lock large numbers of tuples at
 any one time, so it's not possible to keep the locks objects in shared memory.
 To work around this limitation, we use a two-level mechanism.  The first level
 is implemented by storing locking information in the tuple header: a tuple is
 marked as locked by setting the current transaction's XID as its XMAX, and
 setting additional infomask bits to distinguish this case from the more normal
 case of having deleted the tuple.  When multiple transactions concurrently
 lock a tuple, a MultiXact is used; see below.  This mechanism can accommodate
 arbitrarily large numbers of tuples being locked simultaneously.

 When it is necessary to wait for a tuple-level lock to be released, the basic
 delay is provided by XactLockTableWait or MultiXactIdWait on the contents of
 the tuple's XMAX.  However, that mechanism will release all waiters
 concurrently, so there would be a race condition as to which waiter gets the
 tuple, potentially leading to indefinite starvation of some waiters.  The
 possibility of share-locking makes the problem much worse --- a steady stream
 of share-lockers can easily block an exclusive locker forever.  To provide
 more reliable semantics about who gets a tuple-level lock first, we use the
 standard lock manager, which implements the second level mentioned above.  The
 protocol for waiting for a tuple-level lock is really

      LockTuple()
      XactLockTableWait()
      mark tuple as locked by me
      UnlockTuple()

 When there are multiple waiters, arbitration of who is to get the lock next
 is provided by LockTuple().  However, at most one tuple-level lock will
 be held or awaited per backend at any time, so we don't risk overflow
 of the lock table.  Note that incoming share-lockers are required to
 do LockTuple as well, if there is any conflict, to ensure that they don't
 starve out waiting exclusive-lockers.  However, if there is not any active
 conflict for a tuple, we don't incur any extra overhead.

 We make an exception to the above rule for those lockers that already hold
 some lock on a tuple and attempt to acquire a stronger one on it.  In that
 case, we skip the LockTuple() call even when there are conflicts, provided
 that the target tuple is being locked, updated or deleted by multiple sessions
 concurrently.  Failing to skip the lock would risk a deadlock, e.g., between a
 session that was first to record its weaker lock in the tuple header and would
 be waiting on the LockTuple() call to upgrade to the stronger lock level, and
 another session that has already done LockTuple() and is waiting for the first
 session transaction to release its tuple header-level lock.

 We provide four levels of tuple locking strength: SELECT FOR UPDATE obtains an
 exclusive lock which prevents any kind of modification of the tuple. This is
 the lock level that is implicitly taken by DELETE operations, and also by
 UPDATE operations if they modify any of the tuple's key fields. SELECT FOR NO
 KEY UPDATE likewise obtains an exclusive lock, but only prevents tuple removal
 and modifications which might alter the tuple's key. This is the lock that is
 implicitly taken by UPDATE operations which leave all key fields unchanged.
 SELECT FOR SHARE obtains a shared lock which prevents any kind of tuple
 modification. Finally, SELECT FOR KEY SHARE obtains a shared lock which only
 prevents tuple removal and modifications of key fields. This lock level is
 just strong enough to implement RI checks, i.e. it ensures that tuples do not
 go away from under a check, without blocking transactions that want to update
 the tuple without changing its key.

 The conflict table is:

                   UPDATE       NO KEY UPDATE    SHARE        KEY SHARE
 UPDATE           conflict        conflict      conflict      conflict
 NO KEY UPDATE    conflict        conflict      conflict
 SHARE            conflict        conflict
 KEY SHARE        conflict

 When there is a single locker in a tuple, we can just store the locking info
 in the tuple itself.  We do this by storing the locker's Xid in XMAX, and
 setting infomask bits specifying the locking strength.  There is one exception
 here: since infomask space is limited, we do not provide a separate bit
 for SELECT FOR SHARE, so we have to use the extended info in a MultiXact in
 that case.  (The other cases, SELECT FOR UPDATE and SELECT FOR KEY SHARE, are
 presumably more commonly used due to being the standards-mandated locking
 mechanism, or heavily used by the RI code, so we want to provide fast paths
 for those.)

 MultiXacts
 ----------

 A tuple header provides very limited space for storing information about tuple
 locking and updates: there is room only for a single Xid and a small number of
 infomask bits.  Whenever we need to store more than one lock, we replace the
 first locker's Xid with a new MultiXactId.  Each MultiXact provides extended
 locking data; it comprises an array of Xids plus some flags bits for each one.
 The flags are currently used to store the locking strength of each member
 transaction.  (The flags also distinguish a pure locker from an updater.)

 In earlier PostgreSQL releases, a MultiXact always meant that the tuple was
 locked in shared mode by multiple transactions.  This is no longer the case; a
 MultiXact may contain an update or delete Xid.  (Keep in mind that tuple locks
 in a transaction do not conflict with other tuple locks in the same
 transaction, so it's possible to have otherwise conflicting locks in a
 MultiXact if they belong to the same transaction).

 Note that each lock is attributed to the subtransaction that acquires it.
 This means that a subtransaction that aborts is seen as though it releases the
 locks it acquired; concurrent transactions can then proceed without having to
 wait for the main transaction to finish.  It also means that a subtransaction
 can upgrade to a stronger lock level than an earlier transaction had, and if
 the subxact aborts, the earlier, weaker lock is kept.

 The possibility of having an update within a MultiXact means that they must
 persist across crashes and restarts: a future reader of the tuple needs to
 figure out whether the update committed or aborted.  So we have a requirement
 that pg_multixact needs to retain pages of its data until we're certain that
 the MultiXacts in them are no longer of interest.

 VACUUM is in charge of removing old MultiXacts at the time of tuple freezing.
 The lower bound used by vacuum (that is, the value below which all multixacts
 are removed) is stored as pg_class.relminmxid for each table; the minimum of
 all such values is stored in pg_database.datminmxid.  The minimum across
 all databases, in turn, is recorded in checkpoint records, and CHECKPOINT
 removes pg_multixact/ segments older than that value once the checkpoint
 record has been flushed.

 Infomask Bits
 -------------

 The following infomask bits are applicable:

 - HEAP_XMAX_INVALID
   Any tuple with this bit set does not have a valid value stored in XMAX.

 - HEAP_XMAX_IS_MULTI
   This bit is set if the tuple's Xmax is a MultiXactId (as opposed to a
   regular TransactionId).

 - HEAP_XMAX_LOCK_ONLY
   This bit is set when the XMAX is a locker only; that is, if it's a
   multixact, it does not contain an update among its members.  It's set when
   the XMAX is a plain Xid that locked the tuple, as well.

 - HEAP_XMAX_KEYSHR_LOCK
 - HEAP_XMAX_SHR_LOCK
 - HEAP_XMAX_EXCL_LOCK
   These bits indicate the strength of the lock acquired; they are useful when
   the XMAX is not a MultiXactId.  If it's a multi, the info is to be found in
   the member flags.  If HEAP_XMAX_IS_MULTI is not set and HEAP_XMAX_LOCK_ONLY
   is set, then one of these *must* be set as well.

   Note that HEAP_XMAX_EXCL_LOCK does not distinguish FOR NO KEY UPDATE from
   FOR UPDATE; this is implemented by the HEAP_KEYS_UPDATED bit.

 - HEAP_KEYS_UPDATED
   This bit lives in t_infomask2.  If set, indicates that the operation(s) done
   by the XMAX compromise the tuple key, such as a SELECT FOR UPDATE, an UPDATE
   that modifies the columns of the key, or a DELETE.  It's set regardless of
   whether the XMAX is a TransactionId or a MultiXactId.

 We currently never set the HEAP_XMAX_COMMITTED when the HEAP_XMAX_IS_MULTI bit
 is set.
	Locking tuples
	--------------

	Locking tuples is not as easy as locking tables or other database objects.
	The problem is that transactions might want to lock large numbers of tuples at
	any one time, so it's not possible to keep the locks objects in shared memory.
	To work around this limitation, we use a two-level mechanism. The first level
	is implemented by storing locking information in the tuple header: a tuple is
	marked as locked by setting the current transaction's XID as its XMAX, and
	setting additional infomask bits to distinguish this case from the more normal
	case of having deleted the tuple. When multiple transactions concurrently
	lock a tuple, a MultiXact is used; see below. This mechanism can accommodate
	arbitrarily large numbers of tuples being locked simultaneously.

	When it is necessary to wait for a tuple-level lock to be released, the basic
	delay is provided by XactLockTableWait or MultiXactIdWait on the contents of
	the tuple's XMAX. However, that mechanism will release all waiters
	concurrently, so there would be a race condition as to which waiter gets the
	tuple, potentially leading to indefinite starvation of some waiters. The
	possibility of share-locking makes the problem much worse --- a steady stream
	of share-lockers can easily block an exclusive locker forever. To provide
	more reliable semantics about who gets a tuple-level lock first, we use the
	standard lock manager, which implements the second level mentioned above. The
	protocol for waiting for a tuple-level lock is really

	LockTuple()
	XactLockTableWait()
	mark tuple as locked by me
	UnlockTuple()

	When there are multiple waiters, arbitration of who is to get the lock next
	is provided by LockTuple(). However, at most one tuple-level lock will
	be held or awaited per backend at any time, so we don't risk overflow
	of the lock table. Note that incoming share-lockers are required to
	do LockTuple as well, if there is any conflict, to ensure that they don't
	starve out waiting exclusive-lockers. However, if there is not any active
	conflict for a tuple, we don't incur any extra overhead.

	We make an exception to the above rule for those lockers that already hold
	some lock on a tuple and attempt to acquire a stronger one on it. In that
	case, we skip the LockTuple() call even when there are conflicts, provided
	that the target tuple is being locked, updated or deleted by multiple sessions
	concurrently. Failing to skip the lock would risk a deadlock, e.g., between a
	session that was first to record its weaker lock in the tuple header and would
	be waiting on the LockTuple() call to upgrade to the stronger lock level, and
	another session that has already done LockTuple() and is waiting for the first
	session transaction to release its tuple header-level lock.

	We provide four levels of tuple locking strength: SELECT FOR UPDATE obtains an
	exclusive lock which prevents any kind of modification of the tuple. This is
	the lock level that is implicitly taken by DELETE operations, and also by
	UPDATE operations if they modify any of the tuple's key fields. SELECT FOR NO
	KEY UPDATE likewise obtains an exclusive lock, but only prevents tuple removal
	and modifications which might alter the tuple's key. This is the lock that is
	implicitly taken by UPDATE operations which leave all key fields unchanged.
	SELECT FOR SHARE obtains a shared lock which prevents any kind of tuple
	modification. Finally, SELECT FOR KEY SHARE obtains a shared lock which only
	prevents tuple removal and modifications of key fields. This lock level is
	just strong enough to implement RI checks, i.e. it ensures that tuples do not
	go away from under a check, without blocking transactions that want to update
	the tuple without changing its key.

	The conflict table is:

	UPDATE NO KEY UPDATE SHARE KEY SHARE
	UPDATE conflict conflict conflict conflict
	NO KEY UPDATE conflict conflict conflict
	SHARE conflict conflict
	KEY SHARE conflict

	When there is a single locker in a tuple, we can just store the locking info
	in the tuple itself. We do this by storing the locker's Xid in XMAX, and
	setting infomask bits specifying the locking strength. There is one exception
	here: since infomask space is limited, we do not provide a separate bit
	for SELECT FOR SHARE, so we have to use the extended info in a MultiXact in
	that case. (The other cases, SELECT FOR UPDATE and SELECT FOR KEY SHARE, are
	presumably more commonly used due to being the standards-mandated locking
	mechanism, or heavily used by the RI code, so we want to provide fast paths
	for those.)

	MultiXacts
	----------

	A tuple header provides very limited space for storing information about tuple
	locking and updates: there is room only for a single Xid and a small number of
	infomask bits. Whenever we need to store more than one lock, we replace the
	first locker's Xid with a new MultiXactId. Each MultiXact provides extended
	locking data; it comprises an array of Xids plus some flags bits for each one.
	The flags are currently used to store the locking strength of each member
	transaction. (The flags also distinguish a pure locker from an updater.)

	In earlier PostgreSQL releases, a MultiXact always meant that the tuple was
	locked in shared mode by multiple transactions. This is no longer the case; a
	MultiXact may contain an update or delete Xid. (Keep in mind that tuple locks
	in a transaction do not conflict with other tuple locks in the same
	transaction, so it's possible to have otherwise conflicting locks in a
	MultiXact if they belong to the same transaction).

	Note that each lock is attributed to the subtransaction that acquires it.
	This means that a subtransaction that aborts is seen as though it releases the
	locks it acquired; concurrent transactions can then proceed without having to
	wait for the main transaction to finish. It also means that a subtransaction
	can upgrade to a stronger lock level than an earlier transaction had, and if
	the subxact aborts, the earlier, weaker lock is kept.

	The possibility of having an update within a MultiXact means that they must
	persist across crashes and restarts: a future reader of the tuple needs to
	figure out whether the update committed or aborted. So we have a requirement
	that pg_multixact needs to retain pages of its data until we're certain that
	the MultiXacts in them are no longer of interest.

	VACUUM is in charge of removing old MultiXacts at the time of tuple freezing.
	The lower bound used by vacuum (that is, the value below which all multixacts
	are removed) is stored as pg_class.relminmxid for each table; the minimum of
	all such values is stored in pg_database.datminmxid. The minimum across
	all databases, in turn, is recorded in checkpoint records, and CHECKPOINT
	removes pg_multixact/ segments older than that value once the checkpoint
	record has been flushed.

	Infomask Bits
	-------------

	The following infomask bits are applicable:

	- HEAP_XMAX_INVALID
	Any tuple with this bit set does not have a valid value stored in XMAX.

	- HEAP_XMAX_IS_MULTI
	This bit is set if the tuple's Xmax is a MultiXactId (as opposed to a
	regular TransactionId).

	- HEAP_XMAX_LOCK_ONLY
	This bit is set when the XMAX is a locker only; that is, if it's a
	multixact, it does not contain an update among its members. It's set when
	the XMAX is a plain Xid that locked the tuple, as well.

	- HEAP_XMAX_KEYSHR_LOCK
	- HEAP_XMAX_SHR_LOCK
	- HEAP_XMAX_EXCL_LOCK
	These bits indicate the strength of the lock acquired; they are useful when
	the XMAX is not a MultiXactId. If it's a multi, the info is to be found in
	the member flags. If HEAP_XMAX_IS_MULTI is not set and HEAP_XMAX_LOCK_ONLY
	is set, then one of these must be set as well.

	Note that HEAP_XMAX_EXCL_LOCK does not distinguish FOR NO KEY UPDATE from
	FOR UPDATE; this is implemented by the HEAP_KEYS_UPDATED bit.

	- HEAP_KEYS_UPDATED
	This bit lives in t_infomask2. If set, indicates that the operation(s) done
	by the XMAX compromise the tuple key, such as a SELECT FOR UPDATE, an UPDATE
	that modifies the columns of the key, or a DELETE. It's set regardless of
	whether the XMAX is a TransactionId or a MultiXactId.

	We currently never set the HEAP_XMAX_COMMITTED when the HEAP_XMAX_IS_MULTI bit
	is set.