| Locking tuples |
| -------------- |
| |
| Locking tuples is not as easy as locking tables or other database objects. |
| The problem is that transactions might want to lock large numbers of tuples at |
| any one time, so it's not possible to keep the locks objects in shared memory. |
| To work around this limitation, we use a two-level mechanism. The first level |
| is implemented by storing locking information in the tuple header: a tuple is |
| marked as locked by setting the current transaction's XID as its XMAX, and |
| setting additional infomask bits to distinguish this case from the more normal |
| case of having deleted the tuple. When multiple transactions concurrently |
| lock a tuple, a MultiXact is used; see below. This mechanism can accommodate |
| arbitrarily large numbers of tuples being locked simultaneously. |
| |
| When it is necessary to wait for a tuple-level lock to be released, the basic |
| delay is provided by XactLockTableWait or MultiXactIdWait on the contents of |
| the tuple's XMAX. However, that mechanism will release all waiters |
| concurrently, so there would be a race condition as to which waiter gets the |
| tuple, potentially leading to indefinite starvation of some waiters. The |
| possibility of share-locking makes the problem much worse --- a steady stream |
| of share-lockers can easily block an exclusive locker forever. To provide |
| more reliable semantics about who gets a tuple-level lock first, we use the |
| standard lock manager, which implements the second level mentioned above. The |
| protocol for waiting for a tuple-level lock is really |
| |
| LockTuple() |
| XactLockTableWait() |
| mark tuple as locked by me |
| UnlockTuple() |
| |
| When there are multiple waiters, arbitration of who is to get the lock next |
| is provided by LockTuple(). However, at most one tuple-level lock will |
| be held or awaited per backend at any time, so we don't risk overflow |
| of the lock table. Note that incoming share-lockers are required to |
| do LockTuple as well, if there is any conflict, to ensure that they don't |
| starve out waiting exclusive-lockers. However, if there is not any active |
| conflict for a tuple, we don't incur any extra overhead. |
| |
| We make an exception to the above rule for those lockers that already hold |
| some lock on a tuple and attempt to acquire a stronger one on it. In that |
| case, we skip the LockTuple() call even when there are conflicts, provided |
| that the target tuple is being locked, updated or deleted by multiple sessions |
| concurrently. Failing to skip the lock would risk a deadlock, e.g., between a |
| session that was first to record its weaker lock in the tuple header and would |
| be waiting on the LockTuple() call to upgrade to the stronger lock level, and |
| another session that has already done LockTuple() and is waiting for the first |
| session transaction to release its tuple header-level lock. |
| |
| We provide four levels of tuple locking strength: SELECT FOR UPDATE obtains an |
| exclusive lock which prevents any kind of modification of the tuple. This is |
| the lock level that is implicitly taken by DELETE operations, and also by |
| UPDATE operations if they modify any of the tuple's key fields. SELECT FOR NO |
| KEY UPDATE likewise obtains an exclusive lock, but only prevents tuple removal |
| and modifications which might alter the tuple's key. This is the lock that is |
| implicitly taken by UPDATE operations which leave all key fields unchanged. |
| SELECT FOR SHARE obtains a shared lock which prevents any kind of tuple |
| modification. Finally, SELECT FOR KEY SHARE obtains a shared lock which only |
| prevents tuple removal and modifications of key fields. This lock level is |
| just strong enough to implement RI checks, i.e. it ensures that tuples do not |
| go away from under a check, without blocking transactions that want to update |
| the tuple without changing its key. |
| |
| The conflict table is: |
| |
| UPDATE NO KEY UPDATE SHARE KEY SHARE |
| UPDATE conflict conflict conflict conflict |
| NO KEY UPDATE conflict conflict conflict |
| SHARE conflict conflict |
| KEY SHARE conflict |
| |
| When there is a single locker in a tuple, we can just store the locking info |
| in the tuple itself. We do this by storing the locker's Xid in XMAX, and |
| setting infomask bits specifying the locking strength. There is one exception |
| here: since infomask space is limited, we do not provide a separate bit |
| for SELECT FOR SHARE, so we have to use the extended info in a MultiXact in |
| that case. (The other cases, SELECT FOR UPDATE and SELECT FOR KEY SHARE, are |
| presumably more commonly used due to being the standards-mandated locking |
| mechanism, or heavily used by the RI code, so we want to provide fast paths |
| for those.) |
| |
| MultiXacts |
| ---------- |
| |
| A tuple header provides very limited space for storing information about tuple |
| locking and updates: there is room only for a single Xid and a small number of |
| infomask bits. Whenever we need to store more than one lock, we replace the |
| first locker's Xid with a new MultiXactId. Each MultiXact provides extended |
| locking data; it comprises an array of Xids plus some flags bits for each one. |
| The flags are currently used to store the locking strength of each member |
| transaction. (The flags also distinguish a pure locker from an updater.) |
| |
| In earlier PostgreSQL releases, a MultiXact always meant that the tuple was |
| locked in shared mode by multiple transactions. This is no longer the case; a |
| MultiXact may contain an update or delete Xid. (Keep in mind that tuple locks |
| in a transaction do not conflict with other tuple locks in the same |
| transaction, so it's possible to have otherwise conflicting locks in a |
| MultiXact if they belong to the same transaction). |
| |
| Note that each lock is attributed to the subtransaction that acquires it. |
| This means that a subtransaction that aborts is seen as though it releases the |
| locks it acquired; concurrent transactions can then proceed without having to |
| wait for the main transaction to finish. It also means that a subtransaction |
| can upgrade to a stronger lock level than an earlier transaction had, and if |
| the subxact aborts, the earlier, weaker lock is kept. |
| |
| The possibility of having an update within a MultiXact means that they must |
| persist across crashes and restarts: a future reader of the tuple needs to |
| figure out whether the update committed or aborted. So we have a requirement |
| that pg_multixact needs to retain pages of its data until we're certain that |
| the MultiXacts in them are no longer of interest. |
| |
| VACUUM is in charge of removing old MultiXacts at the time of tuple freezing. |
| The lower bound used by vacuum (that is, the value below which all multixacts |
| are removed) is stored as pg_class.relminmxid for each table; the minimum of |
| all such values is stored in pg_database.datminmxid. The minimum across |
| all databases, in turn, is recorded in checkpoint records, and CHECKPOINT |
| removes pg_multixact/ segments older than that value once the checkpoint |
| record has been flushed. |
| |
| Infomask Bits |
| ------------- |
| |
| The following infomask bits are applicable: |
| |
| - HEAP_XMAX_INVALID |
| Any tuple with this bit set does not have a valid value stored in XMAX. |
| |
| - HEAP_XMAX_IS_MULTI |
| This bit is set if the tuple's Xmax is a MultiXactId (as opposed to a |
| regular TransactionId). |
| |
| - HEAP_XMAX_LOCK_ONLY |
| This bit is set when the XMAX is a locker only; that is, if it's a |
| multixact, it does not contain an update among its members. It's set when |
| the XMAX is a plain Xid that locked the tuple, as well. |
| |
| - HEAP_XMAX_KEYSHR_LOCK |
| - HEAP_XMAX_SHR_LOCK |
| - HEAP_XMAX_EXCL_LOCK |
| These bits indicate the strength of the lock acquired; they are useful when |
| the XMAX is not a MultiXactId. If it's a multi, the info is to be found in |
| the member flags. If HEAP_XMAX_IS_MULTI is not set and HEAP_XMAX_LOCK_ONLY |
| is set, then one of these *must* be set as well. |
| |
| Note that HEAP_XMAX_EXCL_LOCK does not distinguish FOR NO KEY UPDATE from |
| FOR UPDATE; this is implemented by the HEAP_KEYS_UPDATED bit. |
| |
| - HEAP_KEYS_UPDATED |
| This bit lives in t_infomask2. If set, indicates that the operation(s) done |
| by the XMAX compromise the tuple key, such as a SELECT FOR UPDATE, an UPDATE |
| that modifies the columns of the key, or a DELETE. It's set regardless of |
| whether the XMAX is a TransactionId or a MultiXactId. |
| |
| We currently never set the HEAP_XMAX_COMMITTED when the HEAP_XMAX_IS_MULTI bit |
| is set. |