| src/backend/access/heap/README.HOT |
| |
| Heap Only Tuples (HOT) |
| ====================== |
| |
| The Heap Only Tuple (HOT) feature eliminates redundant index entries and |
| allows the re-use of space taken by DELETEd or obsoleted UPDATEd tuples |
| without performing a table-wide vacuum. It does this by allowing |
| single-page vacuuming, also called "defragmentation" or "pruning". |
| |
| Note: there is a Glossary at the end of this document that may be helpful |
| for first-time readers. |
| |
| |
| Technical Challenges |
| -------------------- |
| |
| Page-at-a-time vacuuming is normally impractical because of the costs of |
| finding and removing the index entries that link to the tuples to be |
| reclaimed. Standard vacuuming scans the indexes to ensure all such index |
| entries are removed, amortizing the index scan cost across as many dead |
| tuples as possible; this approach does not scale down well to the case of |
| reclaiming just a few tuples. In principle one could recompute the index |
| keys and do standard index searches to find the index entries, but this is |
| risky in the presence of possibly-buggy user-defined functions in |
| functional indexes. An allegedly immutable function that in fact is not |
| immutable might prevent us from re-finding an index entry (and we cannot |
| throw an error for not finding it, in view of the fact that dead index |
| entries are sometimes reclaimed early). That would lead to a seriously |
| corrupt index, in the form of entries pointing to tuple slots that by now |
| contain some unrelated content. In any case we would prefer to be able |
| to do vacuuming without invoking any user-written code. |
| |
| HOT solves this problem for two restricted but useful special cases: |
| |
| First, where a tuple is repeatedly updated in ways that do not change |
| its indexed columns. (Here, "indexed column" means any column referenced |
| at all in an index definition, including for example columns that are |
| tested in a partial-index predicate but are not stored in the index.) |
| |
| Second, where the modified columns are only used in indexes that do not |
| contain tuple IDs, but maintain summaries of the indexed data by block. |
| As these indexes don't contain references to individual tuples, they |
| can't remove tuple references in VACUUM, and thus don't need to get a new |
| and unique reference to a tuple. These indexes still need to be notified |
| of the new column data, but don't need a new HOT chain to be established. |
| |
| An additional property of HOT is that it reduces index size by avoiding |
| the creation of identically-keyed index entries. This improves search |
| speeds. |
| |
| |
| Update Chains With a Single Index Entry |
| --------------------------------------- |
| |
| Without HOT, every version of a row in an update chain has its own index |
| entries, even if all indexed columns are the same. With HOT, a new tuple |
| placed on the same page and with all indexed columns the same as its |
| parent row version does not get new index entries. This means there is |
| only one index entry for the entire update chain on the heap page. |
| An index-entry-less tuple is marked with the HEAP_ONLY_TUPLE flag. |
| The prior row version is marked HEAP_HOT_UPDATED, and (as always in an |
| update chain) its t_ctid field links forward to the newer version. |
| |
| For example: |
| |
| Index points to 1 |
| lp [1] [2] |
| |
| [111111111]->[2222222222] |
| |
| In the above diagram, the index points to line pointer 1, and tuple 1 is |
| marked as HEAP_HOT_UPDATED. Tuple 2 is a HOT tuple, meaning it has |
| no index entry pointing to it, and is marked as HEAP_ONLY_TUPLE. |
| Although tuple 2 is not directly referenced by the index, it can still be |
| found by an index search: after traversing from the index to tuple 1, |
| the index search proceeds forward to child tuples as long as it sees the |
| HEAP_HOT_UPDATED flag set. Since we restrict the HOT chain to lie within |
| a single page, this requires no additional page fetches and doesn't |
| introduce much performance penalty. |
| |
| Eventually, tuple 1 will no longer be visible to any transaction. |
| At that point its space could be reclaimed, but its line pointer cannot, |
| since the index still links to that line pointer and we still need to |
| be able to find tuple 2 in an index search. HOT handles this by turning |
| line pointer 1 into a "redirecting line pointer", which links to tuple 2 |
| but has no actual tuple attached. This state of affairs looks like |
| |
| Index points to 1 |
| lp [1]->[2] |
| |
| [2222222222] |
| |
| If now the row is updated again, to version 3, the page looks like this: |
| |
| Index points to 1 |
| lp [1]->[2] [3] |
| |
| [2222222222]->[3333333333] |
| |
| At some later time when no transaction can see tuple 2 in its snapshot, |
| tuple 2 and its line pointer can be pruned entirely: |
| |
| Index points to 1 |
| lp [1]------>[3] |
| |
| [3333333333] |
| |
| This is safe because no index entry points to line pointer 2. Subsequent |
| insertions into the page can now recycle both line pointer 2 and the |
| space formerly used by tuple 2. |
| |
| If an update changes any column indexed by a non-summarizing indexes, or |
| if there is not room on the same page for the new tuple, then the HOT |
| chain ends: the last member has a regular t_ctid link to the next version |
| and is not marked HEAP_HOT_UPDATED. (In principle we could continue a |
| HOT chain across pages, but this would destroy the desired property of |
| being able to reclaim space with just page-local manipulations. Anyway, |
| we don't want to have to chase through multiple heap pages to get from an |
| index entry to the desired tuple, so it seems better to create a new |
| index entry for the new tuple.) If further updates occur, the next |
| version could become the root of a new HOT chain. |
| |
| Line pointer 1 has to remain as long as there is any non-dead member of |
| the chain on the page. When there is not, it is marked "dead". |
| This lets us reclaim the last child line pointer and associated tuple |
| immediately. The next regular VACUUM pass can reclaim the index entries |
| pointing at the line pointer and then the line pointer itself. Since a |
| line pointer is small compared to a tuple, this does not represent an |
| undue space cost. |
| |
| Note: we can use a "dead" line pointer for any DELETEd tuple, |
| whether it was part of a HOT chain or not. This allows space reclamation |
| in advance of running VACUUM for plain DELETEs as well as HOT updates. |
| |
| The requirement for doing a HOT update is that indexes which point to |
| the root line pointer (and thus need to be cleaned up by VACUUM when the |
| tuple is dead) do not reference columns which are updated in that HOT |
| chain. Summarizing indexes (such as BRIN) are assumed to have no |
| references to individual tuples and thus are ignored when checking HOT |
| applicability. The updated columns are checked at execution time by |
| comparing the binary representation of the old and new values. We insist |
| on bitwise equality rather than using datatype-specific equality routines. |
| The main reason to avoid the latter is that there might be multiple |
| notions of equality for a datatype, and we don't know exactly which one |
| is relevant for the indexes at hand. We assume that bitwise equality |
| guarantees equality for all purposes. |
| |
| If any columns that are included by non-summarizing indexes are updated, |
| the HOT optimization is not applied, and the new tuple is inserted into |
| all indexes of the table. If none of the updated columns are included in |
| the table's indexes, the HOT optimization is applied and no indexes are |
| updated. If instead the updated columns are only indexed by summarizing |
| indexes, the HOT optimization is applied, but the update is propagated to |
| all summarizing indexes. (Realistically, we only need to propagate the |
| update to the indexes that contain the updated values, but that is yet to |
| be implemented.) |
| |
| Abort Cases |
| ----------- |
| |
| If a heap-only tuple's xmin is aborted, then it can be removed immediately: |
| it was never visible to any other transaction, and all descendant row |
| versions must be aborted as well. Therefore we need not consider it part |
| of a HOT chain. By the same token, if a HOT-updated tuple's xmax is |
| aborted, there is no need to follow the chain link. However, there is a |
| race condition here: the transaction that did the HOT update might abort |
| between the time we inspect the HOT-updated tuple and the time we reach |
| the descendant heap-only tuple. It is conceivable that someone prunes |
| the heap-only tuple before that, and even conceivable that the line pointer |
| is re-used for another purpose. Therefore, when following a HOT chain, |
| it is always necessary to be prepared for the possibility that the |
| linked-to line pointer is unused, dead, or redirected; and if it is a |
| normal line pointer, we still have to check that XMIN of the tuple matches |
| the XMAX of the tuple we left. Otherwise we should assume that we have |
| come to the end of the HOT chain. Note that this sort of XMIN/XMAX |
| matching is required when following ordinary update chains anyway. |
| |
| (Early versions of the HOT code assumed that holding pin on the page |
| buffer while following a HOT link would prevent this type of problem, |
| but checking XMIN/XMAX matching is a much more robust solution.) |
| |
| |
| Index/Sequential Scans |
| ---------------------- |
| |
| When doing an index scan, whenever we reach a HEAP_HOT_UPDATED tuple whose |
| xmax is not aborted, we need to follow its t_ctid link and check that |
| entry as well; possibly repeatedly until we reach the end of the HOT |
| chain. (When using an MVCC snapshot it is possible to optimize this a |
| bit: there can be at most one visible tuple in the chain, so we can stop |
| when we find it. This rule does not work for non-MVCC snapshots, though.) |
| |
| Sequential scans do not need to pay attention to the HOT links because |
| they scan every line pointer on the page anyway. The same goes for a |
| bitmap heap scan with a lossy bitmap. |
| |
| |
| Pruning |
| ------- |
| |
| HOT pruning means updating line pointers so that HOT chains are |
| reduced in length, by collapsing out line pointers for intermediate dead |
| tuples. Although this makes those line pointers available for re-use, |
| it does not immediately make the space occupied by their tuples available. |
| |
| |
| Defragmentation |
| --------------- |
| |
| Defragmentation centralizes unused space. After we have converted root |
| line pointers to redirected line pointers and pruned away any dead |
| intermediate line pointers, the tuples they linked to are free space. |
| But unless that space is adjacent to the central "hole" on the page |
| (the pd_lower-to-pd_upper area) it cannot be used by tuple insertion. |
| Defragmentation moves the surviving tuples to coalesce all the free |
| space into one "hole". This is done with the same PageRepairFragmentation |
| function that regular VACUUM uses. |
| |
| |
| When can/should we prune or defragment? |
| --------------------------------------- |
| |
| This is the most interesting question in HOT implementation, since there |
| is no simple right answer: we must use heuristics to determine when it's |
| most efficient to perform pruning and/or defragmenting. |
| |
| We cannot prune or defragment unless we can get a "buffer cleanup lock" |
| on the target page; otherwise, pruning might destroy line pointers that |
| other backends have live references to, and defragmenting might move |
| tuples that other backends have live pointers to. Thus the general |
| approach must be to heuristically decide if we should try to prune |
| or defragment, and if so try to acquire the buffer cleanup lock without |
| blocking. If we succeed we can proceed with our housekeeping work. |
| If we cannot get the lock (which should not happen often, except under |
| very heavy contention) then the housekeeping has to be postponed till |
| some other time. The worst-case consequence of this is only that an |
| UPDATE cannot be made HOT but has to link to a new tuple version placed on |
| some other page, for lack of centralized space on the original page. |
| |
| Ideally we would do defragmenting only when we are about to attempt |
| heap_update on a HOT-safe tuple. The difficulty with this approach |
| is that the update query has certainly got a pin on the old tuple, and |
| therefore our attempt to acquire a buffer cleanup lock will always fail. |
| (This corresponds to the idea that we don't want to move the old tuple |
| out from under where the query's HeapTuple pointer points. It might |
| be possible to finesse that, but it seems fragile.) |
| |
| Pruning, however, is potentially useful even when we are not about to |
| insert a new tuple, since shortening a HOT chain reduces the cost of |
| subsequent index searches. However it is unclear that this gain is |
| large enough to accept any extra maintenance burden for. |
| |
| The currently planned heuristic is to prune and defrag when first accessing |
| a page that potentially has prunable tuples (as flagged by the pd_prune_xid |
| page hint field) and that either has free space less than MAX(fillfactor |
| target free space, BLCKSZ/10) *or* has recently had an UPDATE fail to |
| find enough free space to store an updated tuple version. (These rules |
| are subject to change.) |
| |
| We have effectively implemented the "truncate dead tuples to just line |
| pointer" idea that has been proposed and rejected before because of fear |
| of line pointer bloat: we might end up with huge numbers of line pointers |
| and just a few actual tuples on a page. To limit the damage in the worst |
| case, and to keep various work arrays as well as the bitmaps in bitmap |
| scans reasonably sized, the maximum number of line pointers per page |
| is arbitrarily capped at MaxHeapTuplesPerPage (the most tuples that |
| could fit without HOT pruning). |
| |
| Effectively, space reclamation happens during tuple retrieval when the |
| page is nearly full (<10% free) and a buffer cleanup lock can be |
| acquired. This means that UPDATE, DELETE, and SELECT can trigger space |
| reclamation, but often not during INSERT ... VALUES because it does |
| not retrieve a row. |
| |
| |
| VACUUM |
| ------ |
| |
| There is little change to regular vacuum. It performs pruning to remove |
| dead heap-only tuples, and cleans up any dead line pointers as if they were |
| regular dead tuples. |
| |
| |
| Statistics |
| ---------- |
| |
| Currently, we count HOT updates the same as cold updates for statistics |
| purposes, though there is an additional per-table counter that counts |
| only HOT updates. When a page pruning operation is able to remove a |
| physical tuple by eliminating an intermediate heap-only tuple or |
| replacing a physical root tuple by a redirect pointer, a decrement in |
| the table's number of dead tuples is reported to pgstats, which may |
| postpone autovacuuming. Note that we do not count replacing a root tuple |
| by a DEAD line pointer as decrementing dead_tuples; we still want |
| autovacuum to run to clean up the index entries and DEAD item. |
| |
| This area probably needs further work ... |
| |
| |
| CREATE INDEX |
| ------------ |
| |
| CREATE INDEX presents a problem for HOT updates. While the existing HOT |
| chains all have the same index values for existing indexes, the columns |
| in the new index might change within a pre-existing HOT chain, creating |
| a "broken" chain that can't be indexed properly. |
| |
| To address this issue, regular (non-concurrent) CREATE INDEX makes the |
| new index usable only by new transactions and transactions that don't |
| have snapshots older than the CREATE INDEX command. This prevents |
| queries that can see the inconsistent HOT chains from trying to use the |
| new index and getting incorrect results. Queries that can see the index |
| can only see the rows that were visible after the index was created, |
| hence the HOT chains are consistent for them. |
| |
| Entries in the new index point to root tuples (tuples with current index |
| pointers) so that our index uses the same index pointers as all other |
| indexes on the table. However the row we want to index is actually at |
| the *end* of the chain, ie, the most recent live tuple on the HOT chain. |
| That is the one we compute the index entry values for, but the TID |
| we put into the index is that of the root tuple. Since queries that |
| will be allowed to use the new index cannot see any of the older tuple |
| versions in the chain, the fact that they might not match the index entry |
| isn't a problem. (Such queries will check the tuple visibility |
| information of the older versions and ignore them, without ever looking at |
| their contents, so the content inconsistency is OK.) Subsequent updates |
| to the live tuple will be allowed to extend the HOT chain only if they are |
| HOT-safe for all the indexes. |
| |
| Because we have ShareLock on the table, any DELETE_IN_PROGRESS or |
| INSERT_IN_PROGRESS tuples should have come from our own transaction. |
| Therefore we can consider them committed since if the CREATE INDEX |
| commits, they will be committed, and if it aborts the index is discarded. |
| An exception to this is that early lock release is customary for system |
| catalog updates, and so we might find such tuples when reindexing a system |
| catalog. In that case we deal with it by waiting for the source |
| transaction to commit or roll back. (We could do that for user tables |
| too, but since the case is unexpected we prefer to throw an error.) |
| |
| Practically, we prevent certain transactions from using the new index by |
| setting pg_index.indcheckxmin to TRUE. Transactions are allowed to use |
| such an index only after pg_index.xmin is below their TransactionXmin |
| horizon, thereby ensuring that any incompatible rows in HOT chains are |
| dead to them. (pg_index.xmin will be the XID of the CREATE INDEX |
| transaction. The reason for using xmin rather than a normal column is |
| that the regular vacuum freezing mechanism will take care of converting |
| xmin to FrozenTransactionId before it can wrap around.) |
| |
| This means in particular that the transaction creating the index will be |
| unable to use the index if the transaction has old snapshots. We |
| alleviate that problem somewhat by not setting indcheckxmin unless the |
| table actually contains HOT chains with RECENTLY_DEAD members. |
| |
| Another unpleasant consequence is that it is now risky to use SnapshotAny |
| in an index scan: if the index was created more recently than the last |
| vacuum, it's possible that some of the visited tuples do not match the |
| index entry they are linked to. This does not seem to be a fatal |
| objection, since there are few users of SnapshotAny and most use seqscans. |
| The only exception at this writing is CLUSTER, which is okay because it |
| does not require perfect ordering of the indexscan readout (and especially |
| so because CLUSTER tends to write recently-dead tuples out of order anyway). |
| |
| |
| CREATE INDEX CONCURRENTLY |
| ------------------------- |
| |
| In the concurrent case we must take a different approach. We create the |
| pg_index entry immediately, before we scan the table. The pg_index entry |
| is marked as "not ready for inserts". Then we commit and wait for any |
| transactions which have the table open to finish. This ensures that no |
| new HOT updates will change the key value for our new index, because all |
| transactions will see the existence of the index and will respect its |
| constraint on which updates can be HOT. Other transactions must include |
| such an index when determining HOT-safety of updates, even though they |
| must ignore it for both insertion and searching purposes. |
| |
| We must do this to avoid making incorrect index entries. For example, |
| suppose we are building an index on column X and we make an index entry for |
| a non-HOT tuple with X=1. Then some other backend, unaware that X is an |
| indexed column, HOT-updates the row to have X=2, and commits. We now have |
| an index entry for X=1 pointing at a HOT chain whose live row has X=2. |
| We could make an index entry with X=2 during the validation pass, but |
| there is no nice way to get rid of the wrong entry with X=1. So we must |
| have the HOT-safety property enforced before we start to build the new |
| index. |
| |
| After waiting for transactions which had the table open, we build the index |
| for all rows that are valid in a fresh snapshot. Any tuples visible in the |
| snapshot will have only valid forward-growing HOT chains. (They might have |
| older HOT updates behind them which are broken, but this is OK for the same |
| reason it's OK in a regular index build.) As above, we point the index |
| entry at the root of the HOT-update chain but we use the key value from the |
| live tuple. |
| |
| We mark the index open for inserts (but still not ready for reads) then |
| we again wait for transactions which have the table open. Then we take |
| a second reference snapshot and validate the index. This searches for |
| tuples missing from the index, and inserts any missing ones. Again, |
| the index entries have to have TIDs equal to HOT-chain root TIDs, but |
| the value to be inserted is the one from the live tuple. |
| |
| Then we wait until every transaction that could have a snapshot older than |
| the second reference snapshot is finished. This ensures that nobody is |
| alive any longer who could need to see any tuples that might be missing |
| from the index, as well as ensuring that no one can see any inconsistent |
| rows in a broken HOT chain (the first condition is stronger than the |
| second). Finally, we can mark the index valid for searches. |
| |
| Note that we do not need to set pg_index.indcheckxmin in this code path, |
| because we have outwaited any transactions that would need to avoid using |
| the index. (indcheckxmin is only needed because non-concurrent CREATE |
| INDEX doesn't want to wait; its stronger lock would create too much risk of |
| deadlock if it did.) |
| |
| |
| DROP INDEX CONCURRENTLY |
| ----------------------- |
| |
| DROP INDEX CONCURRENTLY is sort of the reverse sequence of CREATE INDEX |
| CONCURRENTLY. We first mark the index as not indisvalid, and then wait for |
| any transactions that could be using it in queries to end. (During this |
| time, index updates must still be performed as normal, since such |
| transactions might expect freshly inserted tuples to be findable.) |
| Then, we clear indisready and indislive, and again wait for transactions |
| that could be updating the index to end. Finally we can drop the index |
| normally (though taking only ShareUpdateExclusiveLock on its parent table). |
| |
| The reason we need the pg_index.indislive flag is that after the second |
| wait step begins, we don't want transactions to be touching the index at |
| all; otherwise they might suffer errors if the DROP finally commits while |
| they are reading catalog entries for the index. If we had only indisvalid |
| and indisready, this state would be indistinguishable from the first stage |
| of CREATE INDEX CONCURRENTLY --- but in that state, we *do* want |
| transactions to examine the index, since they must consider it in |
| HOT-safety checks. |
| |
| |
| Limitations and Restrictions |
| ---------------------------- |
| |
| It is worth noting that HOT forever forecloses alternative approaches |
| to vacuuming, specifically the recompute-the-index-keys approach alluded |
| to in Technical Challenges above. It'll be tough to recompute the index |
| keys for a root line pointer you don't have data for anymore ... |
| |
| |
| Glossary |
| -------- |
| |
| Broken HOT Chain |
| |
| A HOT chain in which the key value for an index has changed. |
| |
| This is not allowed to occur normally but if a new index is created |
| it can happen. In that case various strategies are used to ensure |
| that no transaction for which the older tuples are visible can |
| use the index. |
| |
| Cold update |
| |
| A normal, non-HOT update, in which index entries are made for |
| the new version of the tuple. |
| |
| Dead line pointer |
| |
| A stub line pointer, that does not point to anything, but cannot |
| be removed or reused yet because there are index pointers to it. |
| Semantically same as a dead tuple. It has state LP_DEAD. |
| |
| Heap-only tuple |
| |
| A heap tuple with no index pointers, which can only be reached |
| from indexes indirectly through its ancestral root tuple. |
| Marked with HEAP_ONLY_TUPLE flag. |
| |
| HOT-safe |
| |
| A proposed tuple update is said to be HOT-safe if it changes |
| none of the tuple's indexed columns. It will only become an |
| actual HOT update if we can find room on the same page for |
| the new tuple version. |
| |
| HOT update |
| |
| An UPDATE where the new tuple becomes a heap-only tuple, and no |
| new index entries are made. |
| |
| HOT-updated tuple |
| |
| An updated tuple, for which the next tuple in the chain is a |
| heap-only tuple. Marked with HEAP_HOT_UPDATED flag. |
| |
| Indexed column |
| |
| A column used in an index definition. The column might not |
| actually be stored in the index --- it could be used in a |
| functional index's expression, or used in a partial index |
| predicate. HOT treats all these cases alike. |
| |
| Redirecting line pointer |
| |
| A line pointer that points to another line pointer and has no |
| associated tuple. It has the special lp_flags state LP_REDIRECT, |
| and lp_off is the OffsetNumber of the line pointer it links to. |
| This is used when a root tuple becomes dead but we cannot prune |
| the line pointer because there are non-dead heap-only tuples |
| further down the chain. |
| |
| Root tuple |
| |
| The first tuple in a HOT update chain; the one that indexes point to. |
| |
| Update chain |
| |
| A chain of updated tuples, in which each tuple's ctid points to |
| the next tuple in the chain. A HOT update chain is an update chain |
| (or portion of an update chain) that consists of a root tuple and |
| one or more heap-only tuples. A complete update chain can contain |
| both HOT and non-HOT (cold) updated tuples. |