| src/backend/access/transam/README |
| |
| The Transaction System |
| ====================== |
| |
| PostgreSQL's transaction system is a three-layer system. The bottom layer |
| implements low-level transactions and subtransactions, on top of which rests |
| the mainloop's control code, which in turn implements user-visible |
| transactions and savepoints. |
| |
| The middle layer of code is called by postgres.c before and after the |
| processing of each query, or after detecting an error: |
| |
| StartTransactionCommand |
| CommitTransactionCommand |
| AbortCurrentTransaction |
| |
| Meanwhile, the user can alter the system's state by issuing the SQL commands |
| BEGIN, COMMIT, ROLLBACK, SAVEPOINT, ROLLBACK TO or RELEASE. The traffic cop |
| redirects these calls to the toplevel routines |
| |
| BeginTransactionBlock |
| EndTransactionBlock |
| UserAbortTransactionBlock |
| DefineSavepoint |
| RollbackToSavepoint |
| ReleaseSavepoint |
| |
| respectively. Depending on the current state of the system, these functions |
| call low level functions to activate the real transaction system: |
| |
| StartTransaction |
| CommitTransaction |
| AbortTransaction |
| CleanupTransaction |
| StartSubTransaction |
| CommitSubTransaction |
| AbortSubTransaction |
| CleanupSubTransaction |
| |
| Additionally, within a transaction, CommandCounterIncrement is called to |
| increment the command counter, which allows future commands to "see" the |
| effects of previous commands within the same transaction. Note that this is |
| done automatically by CommitTransactionCommand after each query inside a |
| transaction block, but some utility functions also do it internally to allow |
| some operations (usually in the system catalogs) to be seen by future |
| operations in the same utility command. (For example, in DefineRelation it is |
| done after creating the heap so the pg_class row is visible, to be able to |
| lock it.) |
| |
| |
| For example, consider the following sequence of user commands: |
| |
| 1) BEGIN |
| 2) SELECT * FROM foo |
| 3) INSERT INTO foo VALUES (...) |
| 4) COMMIT |
| |
| In the main processing loop, this results in the following function call |
| sequence: |
| |
| / StartTransactionCommand; |
| / StartTransaction; |
| 1) < ProcessUtility; << BEGIN |
| \ BeginTransactionBlock; |
| \ CommitTransactionCommand; |
| |
| / StartTransactionCommand; |
| 2) / PortalRunSelect; << SELECT ... |
| \ CommitTransactionCommand; |
| \ CommandCounterIncrement; |
| |
| / StartTransactionCommand; |
| 3) / ProcessQuery; << INSERT ... |
| \ CommitTransactionCommand; |
| \ CommandCounterIncrement; |
| |
| / StartTransactionCommand; |
| / ProcessUtility; << COMMIT |
| 4) < EndTransactionBlock; |
| \ CommitTransactionCommand; |
| \ CommitTransaction; |
| |
| The point of this example is to demonstrate the need for |
| StartTransactionCommand and CommitTransactionCommand to be state smart -- they |
| should call CommandCounterIncrement between the calls to BeginTransactionBlock |
| and EndTransactionBlock and outside these calls they need to do normal start, |
| commit or abort processing. |
| |
| Furthermore, suppose the "SELECT * FROM foo" caused an abort condition. In |
| this case AbortCurrentTransaction is called, and the transaction is put in |
| aborted state. In this state, any user input is ignored except for |
| transaction-termination statements, or ROLLBACK TO <savepoint> commands. |
| |
| Transaction aborts can occur in two ways: |
| |
| 1) system dies from some internal cause (syntax error, etc) |
| 2) user types ROLLBACK |
| |
| The reason we have to distinguish them is illustrated by the following two |
| situations: |
| |
| case 1 case 2 |
| ------ ------ |
| 1) user types BEGIN 1) user types BEGIN |
| 2) user does something 2) user does something |
| 3) user does not like what 3) system aborts for some reason |
| she sees and types ABORT (syntax error, etc) |
| |
| In case 1, we want to abort the transaction and return to the default state. |
| In case 2, there may be more commands coming our way which are part of the |
| same transaction block; we have to ignore these commands until we see a COMMIT |
| or ROLLBACK. |
| |
| Internal aborts are handled by AbortCurrentTransaction, while user aborts are |
| handled by UserAbortTransactionBlock. Both of them rely on AbortTransaction |
| to do all the real work. The only difference is what state we enter after |
| AbortTransaction does its work: |
| |
| * AbortCurrentTransaction leaves us in TBLOCK_ABORT, |
| * UserAbortTransactionBlock leaves us in TBLOCK_ABORT_END |
| |
| Low-level transaction abort handling is divided in two phases: |
| * AbortTransaction executes as soon as we realize the transaction has |
| failed. It should release all shared resources (locks etc) so that we do |
| not delay other backends unnecessarily. |
| * CleanupTransaction executes when we finally see a user COMMIT |
| or ROLLBACK command; it cleans things up and gets us out of the transaction |
| completely. In particular, we mustn't destroy TopTransactionContext until |
| this point. |
| |
| Also, note that when a transaction is committed, we don't close it right away. |
| Rather it's put in TBLOCK_END state, which means that when |
| CommitTransactionCommand is called after the query has finished processing, |
| the transaction has to be closed. The distinction is subtle but important, |
| because it means that control will leave the xact.c code with the transaction |
| open, and the main loop will be able to keep processing inside the same |
| transaction. So, in a sense, transaction commit is also handled in two |
| phases, the first at EndTransactionBlock and the second at |
| CommitTransactionCommand (which is where CommitTransaction is actually |
| called). |
| |
| The rest of the code in xact.c are routines to support the creation and |
| finishing of transactions and subtransactions. For example, AtStart_Memory |
| takes care of initializing the memory subsystem at main transaction start. |
| |
| |
| Subtransaction Handling |
| ----------------------- |
| |
| Subtransactions are implemented using a stack of TransactionState structures, |
| each of which has a pointer to its parent transaction's struct. When a new |
| subtransaction is to be opened, PushTransaction is called, which creates a new |
| TransactionState, with its parent link pointing to the current transaction. |
| StartSubTransaction is in charge of initializing the new TransactionState to |
| sane values, and properly initializing other subsystems (AtSubStart routines). |
| |
| When closing a subtransaction, either CommitSubTransaction has to be called |
| (if the subtransaction is committing), or AbortSubTransaction and |
| CleanupSubTransaction (if it's aborting). In either case, PopTransaction is |
| called so the system returns to the parent transaction. |
| |
| One important point regarding subtransaction handling is that several may need |
| to be closed in response to a single user command. That's because savepoints |
| have names, and we allow to commit or rollback a savepoint by name, which is |
| not necessarily the one that was last opened. Also a COMMIT or ROLLBACK |
| command must be able to close out the entire stack. We handle this by having |
| the utility command subroutine mark all the state stack entries as commit- |
| pending or abort-pending, and then when the main loop reaches |
| CommitTransactionCommand, the real work is done. The main point of doing |
| things this way is that if we get an error while popping state stack entries, |
| the remaining stack entries still show what we need to do to finish up. |
| |
| In the case of ROLLBACK TO <savepoint>, we abort all the subtransactions up |
| through the one identified by the savepoint name, and then re-create that |
| subtransaction level with the same name. So it's a completely new |
| subtransaction as far as the internals are concerned. |
| |
| Other subsystems are allowed to start "internal" subtransactions, which are |
| handled by BeginInternalSubTransaction. This is to allow implementing |
| exception handling, e.g. in PL/pgSQL. ReleaseCurrentSubTransaction and |
| RollbackAndReleaseCurrentSubTransaction allows the subsystem to close said |
| subtransactions. The main difference between this and the savepoint/release |
| path is that we execute the complete state transition immediately in each |
| subroutine, rather than deferring some work until CommitTransactionCommand. |
| Another difference is that BeginInternalSubTransaction is allowed when no |
| explicit transaction block has been established, while DefineSavepoint is not. |
| |
| |
| Transaction and Subtransaction Numbering |
| ---------------------------------------- |
| |
| Transactions and subtransactions are assigned permanent XIDs only when/if |
| they first do something that requires one --- typically, insert/update/delete |
| a tuple, though there are a few other places that need an XID assigned. |
| If a subtransaction requires an XID, we always first assign one to its |
| parent. This maintains the invariant that child transactions have XIDs later |
| than their parents, which is assumed in a number of places. |
| |
| The subsidiary actions of obtaining a lock on the XID and entering it into |
| pg_subtrans and PG_PROC are done at the time it is assigned. |
| |
| A transaction that has no XID still needs to be identified for various |
| purposes, notably holding locks. For this purpose we assign a "virtual |
| transaction ID" or VXID to each top-level transaction. VXIDs are formed from |
| two fields, the backendID and a backend-local counter; this arrangement allows |
| assignment of a new VXID at transaction start without any contention for |
| shared memory. To ensure that a VXID isn't re-used too soon after backend |
| exit, we store the last local counter value into shared memory at backend |
| exit, and initialize it from the previous value for the same backendID slot |
| at backend start. All these counters go back to zero at shared memory |
| re-initialization, but that's OK because VXIDs never appear anywhere on-disk. |
| |
| Internally, a backend needs a way to identify subtransactions whether or not |
| they have XIDs; but this need only lasts as long as the parent top transaction |
| endures. Therefore, we have SubTransactionId, which is somewhat like |
| CommandId in that it's generated from a counter that we reset at the start of |
| each top transaction. The top-level transaction itself has SubTransactionId 1, |
| and subtransactions have IDs 2 and up. (Zero is reserved for |
| InvalidSubTransactionId.) Note that subtransactions do not have their |
| own VXIDs; they use the parent top transaction's VXID. |
| |
| |
| Interlocking Transaction Begin, Transaction End, and Snapshots |
| -------------------------------------------------------------- |
| |
| We try hard to minimize the amount of overhead and lock contention involved |
| in the frequent activities of beginning/ending a transaction and taking a |
| snapshot. Unfortunately, we must have some interlocking for this, because |
| we must ensure consistency about the commit order of transactions. |
| For example, suppose an UPDATE in xact A is blocked by xact B's prior |
| update of the same row, and xact B is doing commit while xact C gets a |
| snapshot. Xact A can complete and commit as soon as B releases its locks. |
| If xact C's GetSnapshotData sees xact B as still running, then it had |
| better see xact A as still running as well, or it will be able to see two |
| tuple versions - one deleted by xact B and one inserted by xact A. Another |
| reason why this would be bad is that C would see (in the row inserted by A) |
| earlier changes by B, and it would be inconsistent for C not to see any |
| of B's changes elsewhere in the database. |
| |
| Formally, the correctness requirement is "if a snapshot A considers |
| transaction X as committed, and any of transaction X's snapshots considered |
| transaction Y as committed, then snapshot A must consider transaction Y as |
| committed". |
| |
| What we actually enforce is strict serialization of commits and rollbacks |
| with snapshot-taking: we do not allow any transaction to exit the set of |
| running transactions while a snapshot is being taken. (This rule is |
| stronger than necessary for consistency, but is relatively simple to |
| enforce, and it assists with some other issues as explained below.) The |
| implementation of this is that GetSnapshotData takes the ProcArrayLock in |
| shared mode (so that multiple backends can take snapshots in parallel), |
| but ProcArrayEndTransaction must take the ProcArrayLock in exclusive mode |
| while clearing the ProcGlobal->xids[] entry at transaction end (either |
| commit or abort). (To reduce context switching, when multiple transactions |
| commit nearly simultaneously, we have one backend take ProcArrayLock and |
| clear the XIDs of multiple processes at once.) |
| |
| ProcArrayEndTransaction also holds the lock while advancing the shared |
| latestCompletedXid variable. This allows GetSnapshotData to use |
| latestCompletedXid + 1 as xmax for its snapshot: there can be no |
| transaction >= this xid value that the snapshot needs to consider as |
| completed. |
| |
| In short, then, the rule is that no transaction may exit the set of |
| currently-running transactions between the time we fetch latestCompletedXid |
| and the time we finish building our snapshot. However, this restriction |
| only applies to transactions that have an XID --- read-only transactions |
| can end without acquiring ProcArrayLock, since they don't affect anyone |
| else's snapshot nor latestCompletedXid. |
| |
| Transaction start, per se, doesn't have any interlocking with these |
| considerations, since we no longer assign an XID immediately at transaction |
| start. But when we do decide to allocate an XID, GetNewTransactionId must |
| store the new XID into the shared ProcArray before releasing XidGenLock. |
| This ensures that all top-level XIDs <= latestCompletedXid are either |
| present in the ProcArray, or not running anymore. (This guarantee doesn't |
| apply to subtransaction XIDs, because of the possibility that there's not |
| room for them in the subxid array; instead we guarantee that they are |
| present or the overflow flag is set.) If a backend released XidGenLock |
| before storing its XID into ProcGlobal->xids[], then it would be possible for |
| another backend to allocate and commit a later XID, causing latestCompletedXid |
| to pass the first backend's XID, before that value became visible in the |
| ProcArray. That would break ComputeXidHorizons, as discussed below. |
| |
| We allow GetNewTransactionId to store the XID into ProcGlobal->xids[] (or the |
| subxid array) without taking ProcArrayLock. This was once necessary to |
| avoid deadlock; while that is no longer the case, it's still beneficial for |
| performance. We are thereby relying on fetch/store of an XID to be atomic, |
| else other backends might see a partially-set XID. This also means that |
| readers of the ProcArray xid fields must be careful to fetch a value only |
| once, rather than assume they can read it multiple times and get the same |
| answer each time. (Use volatile-qualified pointers when doing this, to |
| ensure that the C compiler does exactly what you tell it to.) |
| |
| Another important activity that uses the shared ProcArray is |
| ComputeXidHorizons, which must determine a lower bound for the oldest xmin |
| of any active MVCC snapshot, system-wide. Each individual backend |
| advertises the smallest xmin of its own snapshots in MyProc->xmin, or zero |
| if it currently has no live snapshots (eg, if it's between transactions or |
| hasn't yet set a snapshot for a new transaction). ComputeXidHorizons takes |
| the MIN() of the valid xmin fields. It does this with only shared lock on |
| ProcArrayLock, which means there is a potential race condition against other |
| backends doing GetSnapshotData concurrently: we must be certain that a |
| concurrent backend that is about to set its xmin does not compute an xmin |
| less than what ComputeXidHorizons determines. We ensure that by including |
| all the active XIDs into the MIN() calculation, along with the valid xmins. |
| The rule that transactions can't exit without taking exclusive ProcArrayLock |
| ensures that concurrent holders of shared ProcArrayLock will compute the |
| same minimum of currently-active XIDs: no xact, in particular not the |
| oldest, can exit while we hold shared ProcArrayLock. So |
| ComputeXidHorizons's view of the minimum active XID will be the same as that |
| of any concurrent GetSnapshotData, and so it can't produce an overestimate. |
| If there is no active transaction at all, ComputeXidHorizons uses |
| latestCompletedXid + 1, which is a lower bound for the xmin that might |
| be computed by concurrent or later GetSnapshotData calls. (We know that no |
| XID less than this could be about to appear in the ProcArray, because of the |
| XidGenLock interlock discussed above.) |
| |
| As GetSnapshotData is performance critical, it does not perform an accurate |
| oldest-xmin calculation (it used to, until v14). The contents of a snapshot |
| only depend on the xids of other backends, not their xmin. As backend's xmin |
| changes much more often than its xid, having GetSnapshotData look at xmins |
| can lead to a lot of unnecessary cacheline ping-pong. Instead |
| GetSnapshotData updates approximate thresholds (one that guarantees that all |
| deleted rows older than it can be removed, another determining that deleted |
| rows newer than it can not be removed). GlobalVisTest* uses those threshold |
| to make invisibility decision, falling back to ComputeXidHorizons if |
| necessary. |
| |
| Note that while it is certain that two concurrent executions of |
| GetSnapshotData will compute the same xmin for their own snapshots, there is |
| no such guarantee for the horizons computed by ComputeXidHorizons. This is |
| because we allow XID-less transactions to clear their MyProc->xmin |
| asynchronously (without taking ProcArrayLock), so one execution might see |
| what had been the oldest xmin, and another not. This is OK since the |
| thresholds need only be a valid lower bound. As noted above, we are already |
| assuming that fetch/store of the xid fields is atomic, so assuming it for |
| xmin as well is no extra risk. |
| |
| |
| pg_xact and pg_subtrans |
| ----------------------- |
| |
| pg_xact and pg_subtrans are permanent (on-disk) storage of transaction related |
| information. There is a limited number of pages of each kept in memory, so |
| in many cases there is no need to actually read from disk. However, if |
| there's a long running transaction or a backend sitting idle with an open |
| transaction, it may be necessary to be able to read and write this information |
| from disk. They also allow information to be permanent across server restarts. |
| |
| pg_xact records the commit status for each transaction that has been assigned |
| an XID. A transaction can be in progress, committed, aborted, or |
| "sub-committed". This last state means that it's a subtransaction that's no |
| longer running, but its parent has not updated its state yet. It is not |
| necessary to update a subtransaction's transaction status to subcommit, so we |
| can just defer it until main transaction commit. The main role of marking |
| transactions as sub-committed is to provide an atomic commit protocol when |
| transaction status is spread across multiple clog pages. As a result, whenever |
| transaction status spreads across multiple pages we must use a two-phase commit |
| protocol: the first phase is to mark the subtransactions as sub-committed, then |
| we mark the top level transaction and all its subtransactions committed (in |
| that order). Thus, subtransactions that have not aborted appear as in-progress |
| even when they have already finished, and the subcommit status appears as a |
| very short transitory state during main transaction commit. Subtransaction |
| abort is always marked in clog as soon as it occurs. When the transaction |
| status all fit in a single CLOG page, we atomically mark them all as committed |
| without bothering with the intermediate sub-commit state. |
| |
| Savepoints are implemented using subtransactions. A subtransaction is a |
| transaction inside a transaction; its commit or abort status is not only |
| dependent on whether it committed itself, but also whether its parent |
| transaction committed. To implement multiple savepoints in a transaction we |
| allow unlimited transaction nesting depth, so any particular subtransaction's |
| commit state is dependent on the commit status of each and every ancestor |
| transaction. |
| |
| The "subtransaction parent" (pg_subtrans) mechanism records, for each |
| transaction with an XID, the TransactionId of its parent transaction. This |
| information is stored as soon as the subtransaction is assigned an XID. |
| Top-level transactions do not have a parent, so they leave their pg_subtrans |
| entries set to the default value of zero (InvalidTransactionId). |
| |
| pg_subtrans is used to check whether the transaction in question is still |
| running --- the main Xid of a transaction is recorded in ProcGlobal->xids[], |
| with a copy in PGPROC->xid, but since we allow arbitrary nesting of |
| subtransactions, we can't fit all Xids in shared memory, so we have to store |
| them on disk. Note, however, that for each transaction we keep a "cache" of |
| Xids that are known to be part of the transaction tree, so we can skip looking |
| at pg_subtrans unless we know the cache has been overflowed. See |
| storage/ipc/procarray.c for the gory details. |
| |
| slru.c is the supporting mechanism for both pg_xact and pg_subtrans. It |
| implements the LRU policy for in-memory buffer pages. The high-level routines |
| for pg_xact are implemented in transam.c, while the low-level functions are in |
| clog.c. pg_subtrans is contained completely in subtrans.c. |
| |
| |
| Write-Ahead Log Coding |
| ---------------------- |
| |
| The WAL subsystem (also called XLOG in the code) exists to guarantee crash |
| recovery. It can also be used to provide point-in-time recovery, as well as |
| hot-standby replication via log shipping. Here are some notes about |
| non-obvious aspects of its design. |
| |
| A basic assumption of a write AHEAD log is that log entries must reach stable |
| storage before the data-page changes they describe. This ensures that |
| replaying the log to its end will bring us to a consistent state where there |
| are no partially-performed transactions. To guarantee this, each data page |
| (either heap or index) is marked with the LSN (log sequence number --- in |
| practice, a WAL file location) of the latest XLOG record affecting the page. |
| Before the bufmgr can write out a dirty page, it must ensure that xlog has |
| been flushed to disk at least up to the page's LSN. This low-level |
| interaction improves performance by not waiting for XLOG I/O until necessary. |
| The LSN check exists only in the shared-buffer manager, not in the local |
| buffer manager used for temp tables; hence operations on temp tables must not |
| be WAL-logged. |
| |
| During WAL replay, we can check the LSN of a page to detect whether the change |
| recorded by the current log entry is already applied (it has been, if the page |
| LSN is >= the log entry's WAL location). |
| |
| Usually, log entries contain just enough information to redo a single |
| incremental update on a page (or small group of pages). This will work only |
| if the filesystem and hardware implement data page writes as atomic actions, |
| so that a page is never left in a corrupt partly-written state. Since that's |
| often an untenable assumption in practice, we log additional information to |
| allow complete reconstruction of modified pages. The first WAL record |
| affecting a given page after a checkpoint is made to contain a copy of the |
| entire page, and we implement replay by restoring that page copy instead of |
| redoing the update. (This is more reliable than the data storage itself would |
| be because we can check the validity of the WAL record's CRC.) We can detect |
| the "first change after checkpoint" by noting whether the page's old LSN |
| precedes the end of WAL as of the last checkpoint (the RedoRecPtr). |
| |
| The general schema for executing a WAL-logged action is |
| |
| 1. Pin and exclusive-lock the shared buffer(s) containing the data page(s) |
| to be modified. |
| |
| 2. START_CRIT_SECTION() (Any error during the next three steps must cause a |
| PANIC because the shared buffers will contain unlogged changes, which we |
| have to ensure don't get to disk. Obviously, you should check conditions |
| such as whether there's enough free space on the page before you start the |
| critical section.) |
| |
| 3. Apply the required changes to the shared buffer(s). |
| |
| 4. Mark the shared buffer(s) as dirty with MarkBufferDirty(). (This must |
| happen before the WAL record is inserted; see notes in SyncOneBuffer().) |
| Note that marking a buffer dirty with MarkBufferDirty() should only |
| happen iff you write a WAL record; see Writing Hints below. |
| |
| 5. If the relation requires WAL-logging, build a WAL record using |
| XLogBeginInsert and XLogRegister* functions, and insert it. (See |
| "Constructing a WAL record" below). Then update the page's LSN using the |
| returned XLOG location. For instance, |
| |
| XLogBeginInsert(); |
| XLogRegisterBuffer(...) |
| XLogRegisterData(...) |
| recptr = XLogInsert(rmgr_id, info); |
| |
| PageSetLSN(dp, recptr); |
| |
| 6. END_CRIT_SECTION() |
| |
| 7. Unlock and unpin the buffer(s). |
| |
| Complex changes (such as a multilevel index insertion) normally need to be |
| described by a series of atomic-action WAL records. The intermediate states |
| must be self-consistent, so that if the replay is interrupted between any |
| two actions, the system is fully functional. In btree indexes, for example, |
| a page split requires a new page to be allocated, and an insertion of a new |
| key in the parent btree level, but for locking reasons this has to be |
| reflected by two separate WAL records. Replaying the first record, to |
| allocate the new page and move tuples to it, sets a flag on the page to |
| indicate that the key has not been inserted to the parent yet. Replaying the |
| second record clears the flag. This intermediate state is never seen by |
| other backends during normal operation, because the lock on the child page |
| is held across the two actions, but will be seen if the operation is |
| interrupted before writing the second WAL record. The search algorithm works |
| with the intermediate state as normal, but if an insertion encounters a page |
| with the incomplete-split flag set, it will finish the interrupted split by |
| inserting the key to the parent, before proceeding. |
| |
| |
| Constructing a WAL record |
| ------------------------- |
| |
| A WAL record consists of a header common to all WAL record types, |
| record-specific data, and information about the data blocks modified. Each |
| modified data block is identified by an ID number, and can optionally have |
| more record-specific data associated with the block. If XLogInsert decides |
| that a full-page image of a block needs to be taken, the data associated |
| with that block is not included. |
| |
| The API for constructing a WAL record consists of five functions: |
| XLogBeginInsert, XLogRegisterBuffer, XLogRegisterData, XLogRegisterBufData, |
| and XLogInsert. First, call XLogBeginInsert(). Then register all the buffers |
| modified, and data needed to replay the changes, using XLogRegister* |
| functions. Finally, insert the constructed record to the WAL by calling |
| XLogInsert(). |
| |
| XLogBeginInsert(); |
| |
| /* register buffers modified as part of this WAL-logged action */ |
| XLogRegisterBuffer(0, lbuffer, REGBUF_STANDARD); |
| XLogRegisterBuffer(1, rbuffer, REGBUF_STANDARD); |
| |
| /* register data that is always included in the WAL record */ |
| XLogRegisterData(&xlrec, SizeOfFictionalAction); |
| |
| /* |
| * register data associated with a buffer. This will not be included |
| * in the record if a full-page image is taken. |
| */ |
| XLogRegisterBufData(0, tuple->data, tuple->len); |
| |
| /* more data associated with the buffer */ |
| XLogRegisterBufData(0, data2, len2); |
| |
| /* |
| * Ok, all the data and buffers to include in the WAL record have |
| * been registered. Insert the record. |
| */ |
| recptr = XLogInsert(RM_FOO_ID, XLOG_FOOBAR_DO_STUFF); |
| |
| Details of the API functions: |
| |
| void XLogBeginInsert(void) |
| |
| Must be called before XLogRegisterBuffer and XLogRegisterData. |
| |
| void XLogResetInsertion(void) |
| |
| Clear any currently registered data and buffers from the WAL record |
| construction workspace. This is only needed if you have already called |
| XLogBeginInsert(), but decide to not insert the record after all. |
| |
| void XLogEnsureRecordSpace(int max_block_id, int ndatas) |
| |
| Normally, the WAL record construction buffers have the following limits: |
| |
| * highest block ID that can be used is 4 (allowing five block references) |
| * Max 20 chunks of registered data |
| |
| These default limits are enough for most record types that change some |
| on-disk structures. For the odd case that requires more data, or needs to |
| modify more buffers, these limits can be raised by calling |
| XLogEnsureRecordSpace(). XLogEnsureRecordSpace() must be called before |
| XLogBeginInsert(), and outside a critical section. |
| |
| void XLogRegisterBuffer(uint8 block_id, Buffer buf, uint8 flags); |
| |
| XLogRegisterBuffer adds information about a data block to the WAL record. |
| block_id is an arbitrary number used to identify this page reference in |
| the redo routine. The information needed to re-find the page at redo - |
| relfilenode, fork, and block number - are included in the WAL record. |
| |
| XLogInsert will automatically include a full copy of the page contents, if |
| this is the first modification of the buffer since the last checkpoint. |
| It is important to register every buffer modified by the action with |
| XLogRegisterBuffer, to avoid torn-page hazards. |
| |
| The flags control when and how the buffer contents are included in the |
| WAL record. Normally, a full-page image is taken only if the page has not |
| been modified since the last checkpoint, and only if full_page_writes=on |
| or an online backup is in progress. The REGBUF_FORCE_IMAGE flag can be |
| used to force a full-page image to always be included; that is useful |
| e.g. for an operation that rewrites most of the page, so that tracking the |
| details is not worth it. For the rare case where it is not necessary to |
| protect from torn pages, REGBUF_NO_IMAGE flag can be used to suppress |
| full page image from being taken. REGBUF_WILL_INIT also suppresses a full |
| page image, but the redo routine must re-generate the page from scratch, |
| without looking at the old page contents. Re-initializing the page |
| protects from torn page hazards like a full page image does. |
| |
| The REGBUF_STANDARD flag can be specified together with the other flags to |
| indicate that the page follows the standard page layout. It causes the |
| area between pd_lower and pd_upper to be left out from the image, reducing |
| WAL volume. |
| |
| If the REGBUF_KEEP_DATA flag is given, any per-buffer data registered with |
| XLogRegisterBufData() is included in the WAL record even if a full-page |
| image is taken. |
| |
| void XLogRegisterData(char *data, int len); |
| |
| XLogRegisterData is used to include arbitrary data in the WAL record. If |
| XLogRegisterData() is called multiple times, the data are appended, and |
| will be made available to the redo routine as one contiguous chunk. |
| |
| void XLogRegisterBufData(uint8 block_id, char *data, int len); |
| |
| XLogRegisterBufData is used to include data associated with a particular |
| buffer that was registered earlier with XLogRegisterBuffer(). If |
| XLogRegisterBufData() is called multiple times with the same block ID, the |
| data are appended, and will be made available to the redo routine as one |
| contiguous chunk. |
| |
| If a full-page image of the buffer is taken at insertion, the data is not |
| included in the WAL record, unless the REGBUF_KEEP_DATA flag is used. |
| |
| |
| Writing a REDO routine |
| ---------------------- |
| |
| A REDO routine uses the data and page references included in the WAL record |
| to reconstruct the new state of the page. The record decoding functions |
| and macros in xlogreader.c/h can be used to extract the data from the record. |
| |
| When replaying a WAL record that describes changes on multiple pages, you |
| must be careful to lock the pages properly to prevent concurrent Hot Standby |
| queries from seeing an inconsistent state. If this requires that two |
| or more buffer locks be held concurrently, you must lock the pages in |
| appropriate order, and not release the locks until all the changes are done. |
| |
| Note that we must only use PageSetLSN/PageGetLSN() when we know the action |
| is serialised. Only Startup process may modify data blocks during recovery, |
| so Startup process may execute PageGetLSN() without fear of serialisation |
| problems. All other processes must only call PageSet/GetLSN when holding |
| either an exclusive buffer lock or a shared lock plus buffer header lock, |
| or be writing the data block directly rather than through shared buffers |
| while holding AccessExclusiveLock on the relation. |
| |
| |
| Writing Hints |
| ------------- |
| |
| In some cases, we write additional information to data blocks without |
| writing a preceding WAL record. This should only happen iff the data can |
| be reconstructed later following a crash and the action is simply a way |
| of optimising for performance. When a hint is written we use |
| MarkBufferDirtyHint() to mark the block dirty. |
| |
| If the buffer is clean and checksums are in use then MarkBufferDirtyHint() |
| inserts an XLOG_FPI_FOR_HINT record to ensure that we take a full page image |
| that includes the hint. We do this to avoid a partial page write, when we |
| write the dirtied page. WAL is not written during recovery, so we simply skip |
| dirtying blocks because of hints when in recovery. |
| |
| If you do decide to optimise away a WAL record, then any calls to |
| MarkBufferDirty() must be replaced by MarkBufferDirtyHint(), |
| otherwise you will expose the risk of partial page writes. |
| |
| In GPDB, gp_disable_tuple_hints GUC dictates whether a buffer is marked dirty |
| by a hint bit change. If the GUC is on, hint bit changes do not mark a buffer |
| dirty. |
| |
| Write-Ahead Logging for Filesystem Actions |
| ------------------------------------------ |
| |
| The previous section described how to WAL-log actions that only change page |
| contents within shared buffers. For that type of action it is generally |
| possible to check all likely error cases (such as insufficient space on the |
| page) before beginning to make the actual change. Therefore we can make |
| the change and the creation of the associated WAL log record "atomic" by |
| wrapping them into a critical section --- the odds of failure partway |
| through are low enough that PANIC is acceptable if it does happen. |
| |
| Clearly, that approach doesn't work for cases where there's a significant |
| probability of failure within the action to be logged, such as creation |
| of a new file or database. We don't want to PANIC, and we especially don't |
| want to PANIC after having already written a WAL record that says we did |
| the action --- if we did, replay of the record would probably fail again |
| and PANIC again, making the failure unrecoverable. This means that the |
| ordinary WAL rule of "write WAL before the changes it describes" doesn't |
| work, and we need a different design for such cases. |
| |
| There are several basic types of filesystem actions that have this |
| issue. Here is how we deal with each: |
| |
| 1. Adding a disk page to an existing table. |
| |
| This action isn't WAL-logged at all. We extend a table by writing a page |
| of zeroes at its end. We must actually do this write so that we are sure |
| the filesystem has allocated the space. If the write fails we can just |
| error out normally. Once the space is known allocated, we can initialize |
| and fill the page via one or more normal WAL-logged actions. Because it's |
| possible that we crash between extending the file and writing out the WAL |
| entries, we have to treat discovery of an all-zeroes page in a table or |
| index as being a non-error condition. In such cases we can just reclaim |
| the space for re-use. |
| |
| 2. Creating a new table, which requires a new file in the filesystem. |
| |
| We try to create the file, and if successful we make a WAL record saying |
| we did it. If not successful, we can just throw an error. Notice that |
| there is a window where we have created the file but not yet written any |
| WAL about it to disk. If we crash during this window, the file remains |
| on disk as an "orphan". It would be possible to clean up such orphans |
| by having database restart search for files that don't have any committed |
| entry in pg_class, but that currently isn't done because of the possibility |
| of deleting data that is useful for forensic analysis of the crash. |
| Orphan files are harmless --- at worst they waste a bit of disk space --- |
| because we check for on-disk collisions when allocating new relfilenode |
| OIDs. So cleaning up isn't really necessary. |
| |
| 3. Deleting a table, which requires an unlink() that could fail. |
| |
| Our approach here is to WAL-log the operation first, but to treat failure |
| of the actual unlink() call as a warning rather than error condition. |
| Again, this can leave an orphan file behind, but that's cheap compared to |
| the alternatives. Since we can't actually do the unlink() until after |
| we've committed the DROP TABLE transaction, throwing an error would be out |
| of the question anyway. (It may be worth noting that the WAL entry about |
| the file deletion is actually part of the commit record for the dropping |
| transaction.) |
| |
| 4. Creating and deleting databases and tablespaces, which requires creating |
| and deleting directories and entire directory trees. |
| |
| These cases are handled similarly to creating individual files, ie, we |
| try to do the action first and then write a WAL entry if it succeeded. |
| The potential amount of wasted disk space is rather larger, of course. |
| In the creation case we try to delete the directory tree again if creation |
| fails, so as to reduce the risk of wasted space. Failure partway through |
| a deletion operation results in a corrupt database: the DROP failed, but |
| some of the data is gone anyway. There is little we can do about that, |
| though, and in any case it was presumably data the user no longer wants. |
| |
| In all of these cases, if WAL replay fails to redo the original action |
| we must panic and abort recovery. The DBA will have to manually clean up |
| (for instance, free up some disk space or fix directory permissions) and |
| then restart recovery. This is part of the reason for not writing a WAL |
| entry until we've successfully done the original action. |
| |
| |
| Skipping WAL for New RelFileNode |
| -------------------------------- |
| |
| Under wal_level=minimal, if a change modifies a relfilenode that ROLLBACK |
| would unlink, in-tree access methods write no WAL for that change. Code that |
| writes WAL without calling RelationNeedsWAL() must check for this case. This |
| skipping is mandatory. If a WAL-writing change preceded a WAL-skipping change |
| for the same block, REDO could overwrite the WAL-skipping change. If a |
| WAL-writing change followed a WAL-skipping change for the same block, a |
| related problem would arise. When a WAL record contains no full-page image, |
| REDO expects the page to match its contents from just before record insertion. |
| A WAL-skipping change may not reach disk at all, violating REDO's expectation |
| under full_page_writes=off. For any access method, CommitTransaction() writes |
| and fsyncs affected blocks before recording the commit. |
| |
| Prefer to do the same in future access methods. However, two other approaches |
| can work. First, an access method can irreversibly transition a given fork |
| from WAL-skipping to WAL-writing by calling FlushRelationBuffers() and |
| smgrimmedsync(). Second, an access method can opt to write WAL |
| unconditionally for permanent relations. Under these approaches, the access |
| method callbacks must not call functions that react to RelationNeedsWAL(). |
| |
| This applies only to WAL records whose replay would modify bytes stored in the |
| new relfilenode. It does not apply to other records about the relfilenode, |
| such as XLOG_SMGR_CREATE. Because it operates at the level of individual |
| relfilenodes, RelationNeedsWAL() can differ for tightly-coupled relations. |
| Consider "CREATE TABLE t (); BEGIN; ALTER TABLE t ADD c text; ..." in which |
| ALTER TABLE adds a TOAST relation. The TOAST relation will skip WAL, while |
| the table owning it will not. ALTER TABLE SET TABLESPACE will cause a table |
| to skip WAL, but that won't affect its indexes. |
| |
| |
| Asynchronous Commit |
| ------------------- |
| |
| As of PostgreSQL 8.3 it is possible to perform asynchronous commits - i.e., |
| we don't wait while the WAL record for the commit is fsync'ed. |
| We perform an asynchronous commit when synchronous_commit = off. Instead |
| of performing an XLogFlush() up to the LSN of the commit, we merely note |
| the LSN in shared memory. The backend then continues with other work. |
| We record the LSN only for an asynchronous commit, not an abort; there's |
| never any need to flush an abort record, since the presumption after a |
| crash would be that the transaction aborted anyway. |
| |
| We always force synchronous commit when the transaction is deleting |
| relations, to ensure the commit record is down to disk before the relations |
| are removed from the filesystem. Also, certain utility commands that have |
| non-roll-backable side effects (such as filesystem changes) force sync |
| commit to minimize the window in which the filesystem change has been made |
| but the transaction isn't guaranteed committed. |
| |
| The walwriter regularly wakes up (via wal_writer_delay) or is woken up |
| (via its latch, which is set by backends committing asynchronously) and |
| performs an XLogBackgroundFlush(). This checks the location of the last |
| completely filled WAL page. If that has moved forwards, then we write all |
| the changed buffers up to that point, so that under full load we write |
| only whole buffers. If there has been a break in activity and the current |
| WAL page is the same as before, then we find out the LSN of the most |
| recent asynchronous commit, and write up to that point, if required (i.e. |
| if it's in the current WAL page). If more than wal_writer_delay has |
| passed, or more than wal_writer_flush_after blocks have been written, since |
| the last flush, WAL is also flushed up to the current location. This |
| arrangement in itself would guarantee that an async commit record reaches |
| disk after at most two times wal_writer_delay after the transaction |
| completes. However, we also allow XLogFlush to write/flush full buffers |
| "flexibly" (ie, not wrapping around at the end of the circular WAL buffer |
| area), so as to minimize the number of writes issued under high load when |
| multiple WAL pages are filled per walwriter cycle. This makes the worst-case |
| delay three wal_writer_delay cycles. |
| |
| There are some other subtle points to consider with asynchronous commits. |
| First, for each page of CLOG we must remember the LSN of the latest commit |
| affecting the page, so that we can enforce the same flush-WAL-before-write |
| rule that we do for ordinary relation pages. Otherwise the record of the |
| commit might reach disk before the WAL record does. Again, abort records |
| need not factor into this consideration. |
| |
| In fact, we store more than one LSN for each clog page. This relates to |
| the way we set transaction status hint bits during visibility tests. |
| We must not set a transaction-committed hint bit on a relation page and |
| have that record make it to disk prior to the WAL record of the commit. |
| Since visibility tests are normally made while holding buffer share locks, |
| we do not have the option of changing the page's LSN to guarantee WAL |
| synchronization. Instead, we defer the setting of the hint bit if we have |
| not yet flushed WAL as far as the LSN associated with the transaction. |
| This requires tracking the LSN of each unflushed async commit. It is |
| convenient to associate this data with clog buffers: because we will flush |
| WAL before writing a clog page, we know that we do not need to remember a |
| transaction's LSN longer than the clog page holding its commit status |
| remains in memory. However, the naive approach of storing an LSN for each |
| clog position is unattractive: the LSNs are 32x bigger than the two-bit |
| commit status fields, and so we'd need 256K of additional shared memory for |
| each 8K clog buffer page. We choose instead to store a smaller number of |
| LSNs per page, where each LSN is the highest LSN associated with any |
| transaction commit in a contiguous range of transaction IDs on that page. |
| This saves storage at the price of some possibly-unnecessary delay in |
| setting transaction hint bits. |
| |
| How many transactions should share the same cached LSN (N)? If the |
| system's workload consists only of small async-commit transactions, then |
| it's reasonable to have N similar to the number of transactions per |
| walwriter cycle, since that is the granularity with which transactions will |
| become truly committed (and thus hintable) anyway. The worst case is where |
| a sync-commit xact shares a cached LSN with an async-commit xact that |
| commits a bit later; even though we paid to sync the first xact to disk, |
| we won't be able to hint its outputs until the second xact is sync'd, up to |
| three walwriter cycles later. This argues for keeping N (the group size) |
| as small as possible. For the moment we are setting the group size to 32, |
| which makes the LSN cache space the same size as the actual clog buffer |
| space (independently of BLCKSZ). |
| |
| It is useful that we can run both synchronous and asynchronous commit |
| transactions concurrently, but the safety of this is perhaps not |
| immediately obvious. Assume we have two transactions, T1 and T2. The Log |
| Sequence Number (LSN) is the point in the WAL sequence where a transaction |
| commit is recorded, so LSN1 and LSN2 are the commit records of those |
| transactions. If T2 can see changes made by T1 then when T2 commits it |
| must be true that LSN2 follows LSN1. Thus when T2 commits it is certain |
| that all of the changes made by T1 are also now recorded in the WAL. This |
| is true whether T1 was asynchronous or synchronous. As a result, it is |
| safe for asynchronous commits and synchronous commits to work concurrently |
| without endangering data written by synchronous commits. Sub-transactions |
| are not important here since the final write to disk only occurs at the |
| commit of the top level transaction. |
| |
| Changes to data blocks cannot reach disk unless WAL is flushed up to the |
| point of the LSN of the data blocks. Any attempt to write unsafe data to |
| disk will trigger a write which ensures the safety of all data written by |
| that and prior transactions. Data blocks and clog pages are both protected |
| by LSNs. |
| |
| Changes to a temp table are not WAL-logged, hence could reach disk in |
| advance of T1's commit, but we don't care since temp table contents don't |
| survive crashes anyway. |
| |
| Database writes that skip WAL for new relfilenodes are also safe. In these |
| cases it's entirely possible for the data to reach disk before T1's commit, |
| because T1 will fsync it down to disk without any sort of interlock. However, |
| all these paths are designed to write data that no other transaction can see |
| until after T1 commits. The situation is thus not different from ordinary |
| WAL-logged updates. |
| |
| Transaction Emulation during Recovery |
| ------------------------------------- |
| |
| During Recovery we replay transaction changes in the order they occurred. |
| As part of this replay we emulate some transactional behaviour, so that |
| read only backends can take MVCC snapshots. We do this by maintaining a |
| list of XIDs belonging to transactions that are being replayed, so that |
| each transaction that has recorded WAL records for database writes exist |
| in the array until it commits. Further details are given in comments in |
| procarray.c. |
| |
| Many actions write no WAL records at all, for example read only transactions. |
| These have no effect on MVCC in recovery and we can pretend they never |
| occurred at all. Subtransaction commit does not write a WAL record either |
| and has very little effect, since lock waiters need to wait for the |
| parent transaction to complete. |
| |
| Not all transactional behaviour is emulated, for example we do not insert |
| a transaction entry into the lock table, nor do we maintain the transaction |
| stack in memory. Clog, multixact and commit_ts entries are made normally. |
| Subtrans is maintained during recovery but the details of the transaction |
| tree are ignored and all subtransactions reference the top-level TransactionId |
| directly. Since commit is atomic this provides correct lock wait behaviour |
| yet simplifies emulation of subtransactions considerably. |
| |
| Further details on locking mechanics in recovery are given in comments |
| with the Lock rmgr code. |
| |
| Distributed Transaction Emulation during Recovery |
| ------------------------------------- |
| |
| In GPDB, the MVCC snapshot also includes distributed transactions (aka dtx). |
| Accordingly, on a hot standby we also emulate running dtx. The way to do that |
| is to re-use the shmCommittedGxidArray which has been used on a primary for dtx |
| recovery: it tracks all the 2PC dtx that have their PREPARE phase done, |
| but for which the COMMIT phase hasn't finished (i.e. window between the |
| XLOG_XACT_DISTRIBUTED_COMMIT record being written and the |
| XLOG_XACT_DISTRIBUTED_FORGET record being written on the QD). On a hot standby, |
| any dtx shown in that array are regarded as in-progress. The MVCC snapshot does |
| not really need to account for dtx not in that array: for a dtx that hasn't |
| done PREPARE, we know no segment has committed any data yet; for a dtx that |
| hasn't done COMMIT, we know all segments have committed their data. |
| |
| Note: dtxes that are preparing will not be tracked in this array, and thus will |
| not be included in this snapshot. This is slightly different from a primary QD, |
| where such transactions would have been included in the distributed snapshot's |
| inProgressXidArray (as we construct the inProgressXidArray from the PGXACTs that |
| would contain the dummy entries for prepared transactions). However, as |
| mentioned in CreateDistributedSnapshot, including these is not a requirement for |
| correctness. |
| |
| Note: aborted/aborting dtxes are not accounted for by the standby either. Those |
| are the dtxes that encountered error during preparing. Same as the previous |
| point, the standby does not need to be aware of them for correctness. Worth also |
| noting that if a dtx encountered error after being prepared, it cannot be |
| aborted anymore and must be committed by the dtx recovery process. Until |
| committed, such a dtx will be seen as in-progress to the standby. |
| |
| For 1PC dtx, however, there is a known limitation where the hot standby won't |
| see the last 1PC (or the last few 1PCs if they are all 1PC). This is because |
| since 1PC does not have any WAL on QD, the standby QD won't advance its |
| latestCompletedGxid, so its distributed snapshot horizon does not include the |
| last 1PC - it would view the last 1PC not yet started or at best still in |
| progress. Only if another 2PC comes, the standby would advance its |
| latestCompletedGxid and its distributed snapshot will include the previous 1PC. |
| |
| We don't emulate the full architecture of "running transaction" for dtx because |
| that is unnecessary, at least ATM. For example, we don't create a dtx-version |
| of XLOG_RUNNING_XACTS, because we already have that information as part of the |
| extended checkpoint (see TMGXACT_CHECKPOINT). We also don't need to emulate |
| other members in RunningTransactionsData, like subxid or xid-pruning related |
| variables because those do not apply to dtx. |