| src/backend/access/nbtree/README |
| |
| Btree Indexing |
| ============== |
| |
| This directory contains a correct implementation of Lehman and Yao's |
| high-concurrency B-tree management algorithm (P. Lehman and S. Yao, |
| Efficient Locking for Concurrent Operations on B-Trees, ACM Transactions |
| on Database Systems, Vol 6, No. 4, December 1981, pp 650-670). We also |
| use a simplified version of the deletion logic described in Lanin and |
| Shasha (V. Lanin and D. Shasha, A Symmetric Concurrent B-Tree Algorithm, |
| Proceedings of 1986 Fall Joint Computer Conference, pp 380-389). |
| |
| The basic Lehman & Yao Algorithm |
| -------------------------------- |
| |
| Compared to a classic B-tree, L&Y adds a right-link pointer to each page, |
| to the page's right sibling. It also adds a "high key" to each page, which |
| is an upper bound on the keys that are allowed on that page. These two |
| additions make it possible to detect a concurrent page split, which allows |
| the tree to be searched without holding any read locks (except to keep a |
| single page from being modified while reading it). |
| |
| When a search follows a downlink to a child page, it compares the page's |
| high key with the search key. If the search key is greater than the high |
| key, the page must've been split concurrently, and you must follow the |
| right-link to find the new page containing the key range you're looking |
| for. This might need to be repeated, if the page has been split more than |
| once. |
| |
| Lehman and Yao talk about alternating "separator" keys and downlinks in |
| internal pages rather than tuples or records. We use the term "pivot" |
| tuple to refer to tuples which don't point to heap tuples, that are used |
| only for tree navigation. All tuples on non-leaf pages and high keys on |
| leaf pages are pivot tuples. Since pivot tuples are only used to represent |
| which part of the key space belongs on each page, they can have attribute |
| values copied from non-pivot tuples that were deleted and killed by VACUUM |
| some time ago. A pivot tuple may contain a "separator" key and downlink, |
| just a separator key (i.e. the downlink value is implicitly undefined), or |
| just a downlink (i.e. all attributes are truncated away). |
| |
| The requirement that all btree keys be unique is satisfied by treating heap |
| TID as a tiebreaker attribute. Logical duplicates are sorted in heap TID |
| order. This is necessary because Lehman and Yao also require that the key |
| range for a subtree S is described by Ki < v <= Ki+1 where Ki and Ki+1 are |
| the adjacent keys in the parent page (Ki must be _strictly_ less than v, |
| which is assured by having reliably unique keys). Keys are always unique |
| on their level, with the exception of a leaf page's high key, which can be |
| fully equal to the last item on the page. |
| |
| The Postgres implementation of suffix truncation must make sure that the |
| Lehman and Yao invariants hold, and represents that absent/truncated |
| attributes in pivot tuples have the sentinel value "minus infinity". The |
| later section on suffix truncation will be helpful if it's unclear how the |
| Lehman & Yao invariants work with a real world example. |
| |
| Differences to the Lehman & Yao algorithm |
| ----------------------------------------- |
| |
| We have made the following changes in order to incorporate the L&Y algorithm |
| into Postgres: |
| |
| Lehman and Yao don't require read locks, but assume that in-memory |
| copies of tree pages are unshared. Postgres shares in-memory buffers |
| among backends. As a result, we do page-level read locking on btree |
| pages in order to guarantee that no record is modified while we are |
| examining it. This reduces concurrency but guarantees correct |
| behavior. |
| |
| We support the notion of an ordered "scan" of an index as well as |
| insertions, deletions, and simple lookups. A scan in the forward |
| direction is no problem, we just use the right-sibling pointers that |
| L&Y require anyway. (Thus, once we have descended the tree to the |
| correct start point for the scan, the scan looks only at leaf pages |
| and never at higher tree levels.) To support scans in the backward |
| direction, we also store a "left sibling" link much like the "right |
| sibling". (This adds an extra step to the L&Y split algorithm: while |
| holding the write lock on the page being split, we also lock its former |
| right sibling to update that page's left-link. This is safe since no |
| writer of that page can be interested in acquiring a write lock on our |
| page.) A backwards scan has one additional bit of complexity: after |
| following the left-link we must account for the possibility that the |
| left sibling page got split before we could read it. So, we have to |
| move right until we find a page whose right-link matches the page we |
| came from. (Actually, it's even harder than that; see page deletion |
| discussion below.) |
| |
| Page read locks are held only for as long as a scan is examining a page. |
| To minimize lock/unlock traffic, an index scan always searches a leaf page |
| to identify all the matching items at once, copying their heap tuple IDs |
| into backend-local storage. The heap tuple IDs are then processed while |
| not holding any page lock within the index. We do continue to hold a pin |
| on the leaf page in some circumstances, to protect against concurrent |
| deletions (see below). In this state the scan is effectively stopped |
| "between" pages, either before or after the page it has pinned. This is |
| safe in the presence of concurrent insertions and even page splits, because |
| items are never moved across pre-existing page boundaries --- so the scan |
| cannot miss any items it should have seen, nor accidentally return the same |
| item twice. The scan must remember the page's right-link at the time it |
| was scanned, since that is the page to move right to; if we move right to |
| the current right-link then we'd re-scan any items moved by a page split. |
| We don't similarly remember the left-link, since it's best to use the most |
| up-to-date left-link when trying to move left (see detailed move-left |
| algorithm below). |
| |
| In most cases we release our lock and pin on a page before attempting |
| to acquire pin and lock on the page we are moving to. In a few places |
| it is necessary to lock the next page before releasing the current one. |
| This is safe when moving right or up, but not when moving left or down |
| (else we'd create the possibility of deadlocks). |
| |
| Lehman and Yao fail to discuss what must happen when the root page |
| becomes full and must be split. Our implementation is to split the |
| root in the same way that any other page would be split, then construct |
| a new root page holding pointers to both of the resulting pages (which |
| now become siblings on the next level of the tree). The new root page |
| is then installed by altering the root pointer in the meta-data page (see |
| below). This works because the root is not treated specially in any |
| other way --- in particular, searches will move right using its link |
| pointer if the link is set. Therefore, searches will find the data |
| that's been moved into the right sibling even if they read the meta-data |
| page before it got updated. This is the same reasoning that makes a |
| split of a non-root page safe. The locking considerations are similar too. |
| |
| When an inserter recurses up the tree, splitting internal pages to insert |
| links to pages inserted on the level below, it is possible that it will |
| need to access a page above the level that was the root when it began its |
| descent (or more accurately, the level that was the root when it read the |
| meta-data page). In this case the stack it made while descending does not |
| help for finding the correct page. When this happens, we find the correct |
| place by re-descending the tree until we reach the level one above the |
| level we need to insert a link to, and then moving right as necessary. |
| (Typically this will take only two fetches, the meta-data page and the new |
| root, but in principle there could have been more than one root split |
| since we saw the root. We can identify the correct tree level by means of |
| the level numbers stored in each page. The situation is rare enough that |
| we do not need a more efficient solution.) |
| |
| Lehman and Yao must couple/chain locks as part of moving right when |
| relocating a child page's downlink during an ascent of the tree. This is |
| the only point where Lehman and Yao have to simultaneously hold three |
| locks (a lock on the child, the original parent, and the original parent's |
| right sibling). We don't need to couple internal page locks for pages on |
| the same level, though. We match a child's block number to a downlink |
| from a pivot tuple one level up, whereas Lehman and Yao match on the |
| separator key associated with the downlink that was followed during the |
| initial descent. We can release the lock on the original parent page |
| before acquiring a lock on its right sibling, since there is never any |
| need to deal with the case where the separator key that we must relocate |
| becomes the original parent's high key. Lanin and Shasha don't couple |
| locks here either, though they also don't couple locks between levels |
| during ascents. They are willing to "wait and try again" to avoid races. |
| Their algorithm is optimistic, which means that "an insertion holds no |
| more than one write lock at a time during its ascent". We more or less |
| stick with Lehman and Yao's approach of conservatively coupling parent and |
| child locks when ascending the tree, since it's far simpler. |
| |
| Lehman and Yao assume fixed-size keys, but we must deal with |
| variable-size keys. Therefore there is not a fixed maximum number of |
| keys per page; we just stuff in as many as will fit. When we split a |
| page, we try to equalize the number of bytes, not items, assigned to |
| pages (though suffix truncation is also considered). Note we must include |
| the incoming item in this calculation, otherwise it is possible to find |
| that the incoming item doesn't fit on the split page where it needs to go! |
| |
| Deleting index tuples during VACUUM |
| ----------------------------------- |
| |
| Before deleting a leaf item, we get a full cleanup lock on the target |
| page, so that no other backend has a pin on the page when the deletion |
| starts. This is not necessary for correctness in terms of the btree index |
| operations themselves; as explained above, index scans logically stop |
| "between" pages and so can't lose their place. The reason we do it is to |
| provide an interlock between VACUUM and index scans that are not prepared |
| to deal with concurrent TID recycling when visiting the heap. Since only |
| VACUUM can ever mark pointed-to items LP_UNUSED in the heap, and since |
| this only ever happens _after_ btbulkdelete returns, having index scans |
| hold on to the pin (used when reading from the leaf page) until _after_ |
| they're done visiting the heap (for TIDs from pinned leaf page) prevents |
| concurrent TID recycling. VACUUM cannot get a conflicting cleanup lock |
| until the index scan is totally finished processing its leaf page. |
| |
| This approach is fairly coarse, so we avoid it whenever possible. In |
| practice most index scans won't hold onto their pin, and so won't block |
| VACUUM. These index scans must deal with TID recycling directly, which is |
| more complicated and not always possible. See later section on making |
| concurrent TID recycling safe. |
| |
| Opportunistic index tuple deletion performs almost the same page-level |
| modifications while only holding an exclusive lock. This is safe because |
| there is no question of TID recycling taking place later on -- only VACUUM |
| can make TIDs recyclable. See also simple deletion and bottom-up |
| deletion, below. |
| |
| Because a pin is not always held, and a page can be split even while |
| someone does hold a pin on it, it is possible that an indexscan will |
| return items that are no longer stored on the page it has a pin on, but |
| rather somewhere to the right of that page. To ensure that VACUUM can't |
| prematurely make TIDs recyclable in this scenario, we require btbulkdelete |
| to obtain a cleanup lock on every leaf page in the index, even pages that |
| don't contain any deletable tuples. Note that this requirement does not |
| say that btbulkdelete must visit the pages in any particular order. |
| |
| VACUUM's linear scan, concurrent page splits |
| -------------------------------------------- |
| |
| VACUUM accesses the index by doing a linear scan to search for deletable |
| TIDs, while considering the possibility of deleting empty pages in |
| passing. This is in physical/block order, not logical/keyspace order. |
| The tricky part of this is avoiding missing any deletable tuples in the |
| presence of concurrent page splits: a page split could easily move some |
| tuples from a page not yet passed over by the sequential scan to a |
| lower-numbered page already passed over. |
| |
| To implement this, we provide a "vacuum cycle ID" mechanism that makes it |
| possible to determine whether a page has been split since the current |
| btbulkdelete cycle started. If btbulkdelete finds a page that has been |
| split since it started, and has a right-link pointing to a lower page |
| number, then it temporarily suspends its sequential scan and visits that |
| page instead. It must continue to follow right-links and vacuum dead |
| tuples until reaching a page that either hasn't been split since |
| btbulkdelete started, or is above the location of the outer sequential |
| scan. Then it can resume the sequential scan. This ensures that all |
| tuples are visited. It may be that some tuples are visited twice, but |
| that has no worse effect than an inaccurate index tuple count (and we |
| can't guarantee an accurate count anyway in the face of concurrent |
| activity). Note that this still works if the has-been-recently-split test |
| has a small probability of false positives, so long as it never gives a |
| false negative. This makes it possible to implement the test with a small |
| counter value stored on each index page. |
| |
| Deleting entire pages during VACUUM |
| ----------------------------------- |
| |
| We consider deleting an entire page from the btree only when it's become |
| completely empty of items. (Merging partly-full pages would allow better |
| space reuse, but it seems impractical to move existing data items left or |
| right to make this happen --- a scan moving in the opposite direction |
| might miss the items if so.) Also, we *never* delete the rightmost page |
| on a tree level (this restriction simplifies the traversal algorithms, as |
| explained below). Page deletion always begins from an empty leaf page. An |
| internal page can only be deleted as part of deleting an entire subtree. |
| This is always a "skinny" subtree consisting of a "chain" of internal pages |
| plus a single leaf page. There is one page on each level of the subtree, |
| and each level/page covers the same key space. |
| |
| Deleting a leaf page is a two-stage process. In the first stage, the page |
| is unlinked from its parent, and marked as half-dead. The parent page must |
| be found using the same type of search as used to find the parent during an |
| insertion split. We lock the target and the parent pages, change the |
| target's downlink to point to the right sibling, and remove its old |
| downlink. This causes the target page's key space to effectively belong to |
| its right sibling. (Neither the left nor right sibling pages need to |
| change their "high key" if any; so there is no problem with possibly not |
| having enough space to replace a high key.) At the same time, we mark the |
| target page as half-dead, which causes any subsequent searches to ignore it |
| and move right (or left, in a backwards scan). This leaves the tree in a |
| similar state as during a page split: the page has no downlink pointing to |
| it, but it's still linked to its siblings. |
| |
| (Note: Lanin and Shasha prefer to make the key space move left, but their |
| argument for doing so hinges on not having left-links, which we have |
| anyway. So we simplify the algorithm by moving the key space right. This |
| is only possible because we don't match on a separator key when ascending |
| the tree during a page split, unlike Lehman and Yao/Lanin and Shasha -- it |
| doesn't matter if the downlink is re-found in a pivot tuple whose separator |
| key does not match the one encountered when inserter initially descended |
| the tree.) |
| |
| To preserve consistency on the parent level, we cannot merge the key space |
| of a page into its right sibling unless the right sibling is a child of |
| the same parent --- otherwise, the parent's key space assignment changes |
| too, meaning we'd have to make bounding-key updates in its parent, and |
| perhaps all the way up the tree. Since we can't possibly do that |
| atomically, we forbid this case. That means that the rightmost child of a |
| parent node can't be deleted unless it's the only remaining child, in which |
| case we will delete the parent too (see below). |
| |
| In the second-stage, the half-dead leaf page is unlinked from its siblings. |
| We first lock the left sibling (if any) of the target, the target page |
| itself, and its right sibling (there must be one) in that order. Then we |
| update the side-links in the siblings, and mark the target page deleted. |
| |
| When we're about to delete the last remaining child of a parent page, things |
| are slightly more complicated. In the first stage, we leave the immediate |
| parent of the leaf page alone, and remove the downlink to the parent page |
| instead, from the grandparent. If it's the last child of the grandparent |
| too, we recurse up until we find a parent with more than one child, and |
| remove the downlink of that page. The leaf page is marked as half-dead, and |
| the block number of the page whose downlink was removed is stashed in the |
| half-dead leaf page. This leaves us with a chain of internal pages, with |
| one downlink each, leading to the half-dead leaf page, and no downlink |
| pointing to the topmost page in the chain. |
| |
| While we recurse up to find the topmost parent in the chain, we keep the |
| leaf page locked, but don't need to hold locks on the intermediate pages |
| between the leaf and the topmost parent -- insertions into upper tree levels |
| happen only as a result of splits of child pages, and that can't happen as |
| long as we're keeping the leaf locked. The internal pages in the chain |
| cannot acquire new children afterwards either, because the leaf page is |
| marked as half-dead and won't be split. |
| |
| Removing the downlink to the top of the to-be-deleted subtree/chain |
| effectively transfers the key space to the right sibling for all the |
| intermediate levels too, in one atomic operation. A concurrent search might |
| still visit the intermediate pages, but it will move right when it reaches |
| the half-dead page at the leaf level. In particular, the search will move to |
| the subtree to the right of the half-dead leaf page/to-be-deleted subtree, |
| since the half-dead leaf page's right sibling must be a "cousin" page, not a |
| "true" sibling page (or a second cousin page when the to-be-deleted chain |
| starts at leaf page's grandparent page, and so on). |
| |
| In the second stage, the topmost page in the chain is unlinked from its |
| siblings, and the half-dead leaf page is updated to point to the next page |
| down in the chain. This is repeated until there are no internal pages left |
| in the chain. Finally, the half-dead leaf page itself is unlinked from its |
| siblings. |
| |
| A deleted page cannot be recycled immediately, since there may be other |
| processes waiting to reference it (ie, search processes that just left the |
| parent, or scans moving right or left from one of the siblings). These |
| processes must be able to observe a deleted page for some time after the |
| deletion operation, in order to be able to at least recover from it (they |
| recover by moving right, as with concurrent page splits). Searchers never |
| have to worry about concurrent page recycling. |
| |
| See "Placing deleted pages in the FSM" section below for a description of |
| when and how deleted pages become safe for VACUUM to make recyclable. |
| |
| Page deletion and backwards scans |
| --------------------------------- |
| |
| Moving left in a backward scan is complicated because we must consider |
| the possibility that the left sibling was just split (meaning we must find |
| the rightmost page derived from the left sibling), plus the possibility |
| that the page we were just on has now been deleted and hence isn't in the |
| sibling chain at all anymore. So the move-left algorithm becomes: |
| |
| 0. Remember the page we are on as the "original page". |
| 1. Follow the original page's left-link (we're done if this is zero). |
| 2. If the current page is live and its right-link matches the "original |
| page", we are done. |
| 3. Otherwise, move right one or more times looking for a live page whose |
| right-link matches the "original page". If found, we are done. (In |
| principle we could scan all the way to the right end of the index, but |
| in practice it seems better to give up after a small number of tries. |
| It's unlikely the original page's sibling split more than a few times |
| while we were in flight to it; if we do not find a matching link in a |
| few tries, then most likely the original page is deleted.) |
| 4. Return to the "original page". If it is still live, return to step 1 |
| (we guessed wrong about it being deleted, and should restart with its |
| current left-link). If it is dead, move right until a non-dead page |
| is found (there must be one, since rightmost pages are never deleted), |
| mark that as the new "original page", and return to step 1. |
| |
| This algorithm is correct because the live page found by step 4 will have |
| the same left keyspace boundary as the page we started from. Therefore, |
| when we ultimately exit, it must be on a page whose right keyspace |
| boundary matches the left boundary of where we started --- which is what |
| we need to be sure we don't miss or re-scan any items. |
| |
| Page deletion and tree height |
| ----------------------------- |
| |
| Because we never delete the rightmost page of any level (and in particular |
| never delete the root), it's impossible for the height of the tree to |
| decrease. After massive deletions we might have a scenario in which the |
| tree is "skinny", with several single-page levels below the root. |
| Operations will still be correct in this case, but we'd waste cycles |
| descending through the single-page levels. To handle this we use an idea |
| from Lanin and Shasha: we keep track of the "fast root" level, which is |
| the lowest single-page level. The meta-data page keeps a pointer to this |
| level as well as the true root. All ordinary operations initiate their |
| searches at the fast root not the true root. When we split a page that is |
| alone on its level or delete the next-to-last page on a level (both cases |
| are easily detected), we have to make sure that the fast root pointer is |
| adjusted appropriately. In the split case, we do this work as part of the |
| atomic update for the insertion into the parent level; in the delete case |
| as part of the atomic update for the delete (either way, the metapage has |
| to be the last page locked in the update to avoid deadlock risks). This |
| avoids race conditions if two such operations are executing concurrently. |
| |
| Placing deleted pages in the FSM |
| -------------------------------- |
| |
| Recycling a page is decoupled from page deletion. A deleted page can only |
| be put in the FSM to be recycled once there is no possible scan or search |
| that has a reference to it; until then, it must stay in place with its |
| sibling links undisturbed, as a tombstone that allows concurrent searches |
| to detect and then recover from concurrent deletions (which are rather |
| like concurrent page splits to searchers). This design is an |
| implementation of what Lanin and Shasha call "the drain technique". |
| |
| We implement the technique by waiting until all active snapshots and |
| registered snapshots as of the page deletion are gone; which is overly |
| strong, but is simple to implement within Postgres. When marked fully |
| dead, a deleted page is labeled with the next-transaction counter value. |
| VACUUM can reclaim the page for re-use when the stored XID is guaranteed |
| to be "visible to everyone". As collateral damage, we wait for snapshots |
| taken until the next transaction to allocate an XID commits. We also wait |
| for running XIDs with no snapshots. |
| |
| Prior to PostgreSQL 14, VACUUM would only place _old_ deleted pages that |
| it encounters during its linear scan (pages deleted by a previous VACUUM |
| operation) in the FSM. Newly deleted pages were never placed in the FSM, |
| because that was assumed to _always_ be unsafe. That assumption was |
| unnecessarily pessimistic in practice, though -- it often doesn't take |
| very long for newly deleted pages to become safe to place in the FSM. |
| There is no truly principled way to predict when deleted pages will become |
| safe to place in the FSM for recycling -- it might become safe almost |
| immediately (long before the current VACUUM completes), or it might not |
| even be safe by the time the next VACUUM takes place. Recycle safety is |
| purely a question of maintaining the consistency (or at least the apparent |
| consistency) of a physical data structure. The state within the backend |
| running VACUUM is simply not relevant. |
| |
| PostgreSQL 14 added the ability for VACUUM to consider if it's possible to |
| recycle newly deleted pages at the end of the full index scan where the |
| page deletion took place. It is convenient to check if it's safe at that |
| point. This does require that VACUUM keep around a little bookkeeping |
| information about newly deleted pages, but that's very cheap. Using |
| in-memory state for this avoids the need to revisit newly deleted pages a |
| second time later on -- we can just use safexid values from the local |
| bookkeeping state to determine recycle safety in a deferred fashion. |
| |
| The need for additional FSM indirection after a page deletion operation |
| takes place is a natural consequence of the highly permissive rules for |
| index scans with Lehman and Yao's design. In general an index scan |
| doesn't have to hold a lock or even a pin on any page when it descends the |
| tree (nothing that you'd usually think of as an interlock is held "between |
| levels"). At the same time, index scans cannot be allowed to land on a |
| truly unrelated page due to concurrent recycling (not to be confused with |
| concurrent deletion), because that results in wrong answers to queries. |
| Simpler approaches to page deletion that don't need to defer recycling are |
| possible, but none seem compatible with Lehman and Yao's design. |
| |
| Placing an already-deleted page in the FSM to be recycled when needed |
| doesn't actually change the state of the page. The page will be changed |
| whenever it is subsequently taken from the FSM for reuse. The deleted |
| page's contents will be overwritten by the split operation (it will become |
| the new right sibling page). |
| |
| Making concurrent TID recycling safe |
| ------------------------------------ |
| |
| As explained in the earlier section about deleting index tuples during |
| VACUUM, we implement a locking protocol that allows individual index scans |
| to avoid concurrent TID recycling. Index scans opt-out (and so drop their |
| leaf page pin when visiting the heap) whenever it's safe to do so, though. |
| Dropping the pin early is useful because it avoids blocking progress by |
| VACUUM. This is particularly important with index scans used by cursors, |
| since idle cursors sometimes stop for relatively long periods of time. In |
| extreme cases, a client application may hold on to an idle cursors for |
| hours or even days. Blocking VACUUM for that long could be disastrous. |
| |
| Index scans that don't hold on to a buffer pin are protected by holding an |
| MVCC snapshot instead. This more limited interlock prevents wrong answers |
| to queries, but it does not prevent concurrent TID recycling itself (only |
| holding onto the leaf page pin while accessing the heap ensures that). |
| |
| Index-only scans can never drop their buffer pin, since they are unable to |
| tolerate having a referenced TID become recyclable. Index-only scans |
| typically just visit the visibility map (not the heap proper), and so will |
| not reliably notice that any stale TID reference (for a TID that pointed |
| to a dead-to-all heap item at first) was concurrently marked LP_UNUSED in |
| the heap by VACUUM. This could easily allow VACUUM to set the whole heap |
| page to all-visible in the visibility map immediately afterwards. An MVCC |
| snapshot is only sufficient to avoid problems during plain index scans |
| because they must access granular visibility information from the heap |
| proper. A plain index scan will even recognize LP_UNUSED items in the |
| heap (items that could be recycled but haven't been just yet) as "not |
| visible" -- even when the heap page is generally considered all-visible. |
| |
| LP_DEAD setting of index tuples by the kill_prior_tuple optimization |
| (described in full in simple deletion, below) is also more complicated for |
| index scans that drop their leaf page pins. We must be careful to avoid |
| LP_DEAD-marking any new index tuple that looks like a known-dead index |
| tuple because it happens to share the same TID, following concurrent TID |
| recycling. It's just about possible that some other session inserted a |
| new, unrelated index tuple, on the same leaf page, which has the same |
| original TID. It would be totally wrong to LP_DEAD-set this new, |
| unrelated index tuple. |
| |
| We handle this kill_prior_tuple race condition by having affected index |
| scans conservatively assume that any change to the leaf page at all |
| implies that it was reached by btbulkdelete in the interim period when no |
| buffer pin was held. This is implemented by not setting any LP_DEAD bits |
| on the leaf page at all when the page's LSN has changed. (That won't work |
| with an unlogged index, so for now we don't ever apply the "don't hold |
| onto pin" optimization there.) |
| |
| Fastpath For Index Insertion |
| ---------------------------- |
| |
| We optimize for a common case of insertion of increasing index key |
| values by caching the last page to which this backend inserted the last |
| value, if this page was the rightmost leaf page. For the next insert, we |
| can then quickly check if the cached page is still the rightmost leaf |
| page and also the correct place to hold the current value. We can avoid |
| the cost of walking down the tree in such common cases. |
| |
| The optimization works on the assumption that there can only be one |
| non-ignorable leaf rightmost page, and so not even a visible-to-everyone |
| style interlock is required. We cannot fail to detect that our hint was |
| invalidated, because there can only be one such page in the B-Tree at |
| any time. It's possible that the page will be deleted and recycled |
| without a backend's cached page also being detected as invalidated, but |
| only when we happen to recycle a block that once again gets recycled as the |
| rightmost leaf page. |
| |
| Simple deletion |
| --------------- |
| |
| If a process visits a heap tuple and finds that it's dead and removable |
| (ie, dead to all open transactions, not only that process), then we can |
| return to the index and mark the corresponding index entry "known dead", |
| allowing subsequent index scans to skip visiting the heap tuple. The |
| "known dead" marking works by setting the index item's lp_flags state |
| to LP_DEAD. This is currently only done in plain indexscans, not bitmap |
| scans, because only plain scans visit the heap and index "in sync" and so |
| there's not a convenient way to do it for bitmap scans. Note also that |
| LP_DEAD bits are often set when checking a unique index for conflicts on |
| insert (this is simpler because it takes place when we hold an exclusive |
| lock on the leaf page). |
| |
| Once an index tuple has been marked LP_DEAD it can actually be deleted |
| from the index immediately; since index scans only stop "between" pages, |
| no scan can lose its place from such a deletion. We separate the steps |
| because we allow LP_DEAD to be set with only a share lock (it's like a |
| hint bit for a heap tuple), but physically deleting tuples requires an |
| exclusive lock. We also need to generate a snapshotConflictHorizon for |
| each deletion operation's WAL record, which requires additional |
| coordinating with the tableam when the deletion actually takes place. |
| (snapshotConflictHorizon value may be used to generate a conflict during |
| subsequent REDO of the record by a standby.) |
| |
| Delaying and batching index tuple deletion like this enables a further |
| optimization: opportunistic checking of "extra" nearby index tuples |
| (tuples that are not LP_DEAD-set) when they happen to be very cheap to |
| check in passing (because we already know that the tableam will be |
| visiting their table block to generate a snapshotConflictHorizon). Any |
| index tuples that turn out to be safe to delete will also be deleted. |
| Simple deletion will behave as if the extra tuples that actually turn |
| out to be delete-safe had their LP_DEAD bits set right from the start. |
| |
| Deduplication can also prevent a page split, but index tuple deletion is |
| our preferred approach. Note that posting list tuples can only have |
| their LP_DEAD bit set when every table TID within the posting list is |
| known dead. This isn't much of a problem in practice because LP_DEAD |
| bits are just a starting point for deletion. What really matters is |
| that _some_ deletion operation that targets related nearby-in-table TIDs |
| takes place at some point before the page finally splits. That's all |
| that's required for the deletion process to perform granular removal of |
| groups of dead TIDs from posting list tuples (without the situation ever |
| being allowed to get out of hand). |
| |
| Bottom-Up deletion |
| ------------------ |
| |
| We attempt to delete whatever duplicates happen to be present on the page |
| when the duplicates are suspected to be caused by version churn from |
| successive UPDATEs. This only happens when we receive an executor hint |
| indicating that optimizations like heapam's HOT have not worked out for |
| the index -- the incoming tuple must be a logically unchanged duplicate |
| which is needed for MVCC purposes, suggesting that that might well be the |
| dominant source of new index tuples on the leaf page in question. (Also, |
| bottom-up deletion is triggered within unique indexes in cases with |
| continual INSERT and DELETE related churn, since that is easy to detect |
| without any external hint.) |
| |
| Simple deletion will already have failed to prevent a page split when a |
| bottom-up deletion pass takes place (often because no LP_DEAD bits were |
| ever set on the page). The two mechanisms have closely related |
| implementations. The same WAL records are used for each operation, and |
| the same tableam infrastructure is used to determine what TIDs/tuples are |
| actually safe to delete. The implementations only differ in how they pick |
| TIDs to consider for deletion, and whether or not the tableam will give up |
| before accessing all table blocks (bottom-up deletion lives with the |
| uncertainty of its success by keeping the cost of failure low). Even |
| still, the two mechanisms are clearly distinct at the conceptual level. |
| |
| Bottom-up index deletion is driven entirely by heuristics (whereas simple |
| deletion is guaranteed to delete at least those index tuples that are |
| already LP_DEAD marked -- there must be at least one). We have no |
| certainty that we'll find even one index tuple to delete. That's why we |
| closely cooperate with the tableam to keep the costs it pays in balance |
| with the benefits we receive. The interface that we use for this is |
| described in detail in access/tableam.h. |
| |
| Bottom-up index deletion can be thought of as a backstop mechanism against |
| unnecessary version-driven page splits. It is based in part on an idea |
| from generational garbage collection: the "generational hypothesis". This |
| is the empirical observation that "most objects die young". Within |
| nbtree, new index tuples often quickly appear in the same place, and then |
| quickly become garbage. There can be intense concentrations of garbage in |
| relatively few leaf pages with certain workloads (or there could be in |
| earlier versions of PostgreSQL without bottom-up index deletion, at |
| least). See doc/src/sgml/btree.sgml for a high-level description of the |
| design principles behind bottom-up index deletion in nbtree, including |
| details of how it complements VACUUM. |
| |
| We expect to find a reasonably large number of tuples that are safe to |
| delete within each bottom-up pass. If we don't then we won't need to |
| consider the question of bottom-up deletion for the same leaf page for |
| quite a while (usually because the page splits, which resolves the |
| situation for the time being). We expect to perform regular bottom-up |
| deletion operations against pages that are at constant risk of unnecessary |
| page splits caused only by version churn. When the mechanism works well |
| we'll constantly be "on the verge" of having version-churn-driven page |
| splits, but never actually have even one. |
| |
| Our duplicate heuristics work well despite being fairly simple. |
| Unnecessary page splits only occur when there are truly pathological |
| levels of version churn (in theory a small amount of version churn could |
| make a page split occur earlier than strictly necessary, but that's pretty |
| harmless). We don't have to understand the underlying workload; we only |
| have to understand the general nature of the pathology that we target. |
| Version churn is easy to spot when it is truly pathological. Affected |
| leaf pages are fairly homogeneous. |
| |
| WAL Considerations |
| ------------------ |
| |
| The insertion and deletion algorithms in themselves don't guarantee btree |
| consistency after a crash. To provide robustness, we depend on WAL |
| replay. A single WAL entry is effectively an atomic action --- we can |
| redo it from the log if it fails to complete. |
| |
| Ordinary item insertions (that don't force a page split) are of course |
| single WAL entries, since they only affect one page. The same for |
| leaf-item deletions (if the deletion brings the leaf page to zero items, |
| it is now a candidate to be deleted, but that is a separate action). |
| |
| An insertion that causes a page split is logged as a single WAL entry for |
| the changes occurring on the insertion's level --- including update of the |
| right sibling's left-link --- followed by a second WAL entry for the |
| insertion on the parent level (which might itself be a page split, requiring |
| an additional insertion above that, etc). |
| |
| For a root split, the follow-on WAL entry is a "new root" entry rather than |
| an "insertion" entry, but details are otherwise much the same. |
| |
| Because splitting involves multiple atomic actions, it's possible that the |
| system crashes between splitting a page and inserting the downlink for the |
| new half to the parent. After recovery, the downlink for the new page will |
| be missing. The search algorithm works correctly, as the page will be found |
| by following the right-link from its left sibling, although if a lot of |
| downlinks in the tree are missing, performance will suffer. A more serious |
| consequence is that if the page without a downlink gets split again, the |
| insertion algorithm will fail to find the location in the parent level to |
| insert the downlink. |
| |
| Our approach is to create any missing downlinks on-the-fly, when searching |
| the tree for a new insertion. It could be done during searches, too, but |
| it seems best not to put any extra updates in what would otherwise be a |
| read-only operation (updating is not possible in hot standby mode anyway). |
| It would seem natural to add the missing downlinks in VACUUM, but since |
| inserting a downlink might require splitting a page, it might fail if you |
| run out of disk space. That would be bad during VACUUM - the reason for |
| running VACUUM in the first place might be that you run out of disk space, |
| and now VACUUM won't finish because you're out of disk space. In contrast, |
| an insertion can require enlarging the physical file anyway. There is one |
| minor exception: VACUUM finishes interrupted splits of internal pages when |
| deleting their children. This allows the code for re-finding parent items |
| to be used by both page splits and page deletion. |
| |
| To identify missing downlinks, when a page is split, the left page is |
| flagged to indicate that the split is not yet complete (INCOMPLETE_SPLIT). |
| When the downlink is inserted to the parent, the flag is cleared atomically |
| with the insertion. The child page is kept locked until the insertion in |
| the parent is finished and the flag in the child cleared, but can be |
| released immediately after that, before recursing up the tree if the parent |
| also needs to be split. This ensures that incompletely split pages should |
| not be seen under normal circumstances; only if insertion to the parent |
| has failed for some reason. (It's also possible for a reader to observe |
| a page with the incomplete split flag set during recovery; see later |
| section on "Scans during Recovery" for details.) |
| |
| We flag the left page, even though it's the right page that's missing the |
| downlink, because it's more convenient to know already when following the |
| right-link from the left page to the right page that it will need to have |
| its downlink inserted to the parent. |
| |
| When splitting a non-root page that is alone on its level, the required |
| metapage update (of the "fast root" link) is performed and logged as part |
| of the insertion into the parent level. When splitting the root page, the |
| metapage update is handled as part of the "new root" action. |
| |
| Each step in page deletion is logged as a separate WAL entry: marking the |
| leaf as half-dead and removing the downlink is one record, and unlinking a |
| page is a second record. If vacuum is interrupted for some reason, or the |
| system crashes, the tree is consistent for searches and insertions. The |
| next VACUUM will find the half-dead leaf page and continue the deletion. |
| |
| Before 9.4, we used to keep track of incomplete splits and page deletions |
| during recovery and finish them immediately at end of recovery, instead of |
| doing it lazily at the next insertion or vacuum. However, that made the |
| recovery much more complicated, and only fixed the problem when crash |
| recovery was performed. An incomplete split can also occur if an otherwise |
| recoverable error, like out-of-memory or out-of-disk-space, happens while |
| inserting the downlink to the parent. |
| |
| Scans during Recovery |
| --------------------- |
| |
| nbtree indexes support read queries in Hot Standby mode. Every atomic |
| action/WAL record makes isolated changes that leave the tree in a |
| consistent state for readers. Readers lock pages according to the same |
| rules that readers follow on the primary. (Readers may have to move |
| right to recover from a "concurrent" page split or page deletion, just |
| like on the primary.) |
| |
| However, there are a couple of differences in how pages are locked by |
| replay/the startup process as compared to the original write operation |
| on the primary. The exceptions involve page splits and page deletions. |
| The first phase and second phase of a page split are processed |
| independently during replay, since they are independent atomic actions. |
| We do not attempt to recreate the coupling of parent and child page |
| write locks that took place on the primary. This is safe because readers |
| never care about the incomplete split flag anyway. Holding on to an |
| extra write lock on the primary is only necessary so that a second |
| writer cannot observe the incomplete split flag before the first writer |
| finishes the split. If we let concurrent writers on the primary observe |
| an incomplete split flag on the same page, each writer would attempt to |
| complete the unfinished split, corrupting the parent page. (Similarly, |
| replay of page deletion records does not hold a write lock on the target |
| leaf page throughout; only the primary needs to block out concurrent |
| writers that insert on to the page being deleted.) |
| |
| WAL replay holds same-level locks in a way that matches the approach |
| taken during original execution, though. This prevent readers from |
| observing same-level inconsistencies. It's probably possible to be more |
| lax about how same-level locks are acquired during recovery (most kinds |
| of readers could still move right to recover if we didn't couple |
| same-level locks), but we prefer to be conservative here. |
| |
| During recovery all index scans start with ignore_killed_tuples = false |
| and we never set kill_prior_tuple. We do this because the oldest xmin |
| on the standby server can be older than the oldest xmin on the primary |
| server, which means tuples can be marked LP_DEAD even when they are |
| still visible on the standby. We don't WAL log tuple LP_DEAD bits, but |
| they can still appear in the standby because of full page writes. So |
| we must always ignore them in standby, and that means it's not worth |
| setting them either. (When LP_DEAD-marked tuples are eventually deleted |
| on the primary, the deletion is WAL-logged. Queries that run on a |
| standby therefore get much of the benefit of any LP_DEAD setting that |
| takes place on the primary.) |
| |
| Note that we talk about scans that are started during recovery. We go to |
| a little trouble to allow a scan to start during recovery and end during |
| normal running after recovery has completed. This is a key capability |
| because it allows running applications to continue while the standby |
| changes state into a normally running server. |
| |
| The interlocking required to avoid returning incorrect results from |
| non-MVCC scans is not required on standby nodes. We still get a full |
| cleanup lock when replaying VACUUM records during recovery, but recovery |
| does not need to lock every leaf page (only those leaf pages that have |
| items to delete) -- that's sufficient to avoid breaking index-only scans |
| during recovery (see section above about making TID recycling safe). That |
| leaves concern only for plain index scans. (XXX: Not actually clear why |
| this is totally unnecessary during recovery.) |
| |
| MVCC snapshot plain index scans are always safe, for the same reasons that |
| they're safe during original execution. HeapTupleSatisfiesToast() doesn't |
| use MVCC semantics, though that's because it doesn't need to - if the main |
| heap row is visible then the toast rows will also be visible. So as long |
| as we follow a toast pointer from a visible (live) tuple the corresponding |
| toast rows will also be visible, so we do not need to recheck MVCC on |
| them. |
| |
| Other Things That Are Handy to Know |
| ----------------------------------- |
| |
| Page zero of every btree is a meta-data page. This page stores the |
| location of the root page --- both the true root and the current effective |
| root ("fast" root). To avoid fetching the metapage for every single index |
| search, we cache a copy of the meta-data information in the index's |
| relcache entry (rd_amcache). This is a bit ticklish since using the cache |
| implies following a root page pointer that could be stale. However, a |
| backend following a cached pointer can sufficiently verify whether it |
| reached the intended page; either by checking the is-root flag when it |
| is going to the true root, or by checking that the page has no siblings |
| when going to the fast root. At worst, this could result in descending |
| some extra tree levels if we have a cached pointer to a fast root that is |
| now above the real fast root. Such cases shouldn't arise often enough to |
| be worth optimizing; and in any case we can expect a relcache flush will |
| discard the cached metapage before long, since a VACUUM that's moved the |
| fast root pointer can be expected to issue a statistics update for the |
| index. |
| |
| The algorithm assumes we can fit at least three items per page |
| (a "high key" and two real data items). Therefore it's unsafe |
| to accept items larger than 1/3rd page size. Larger items would |
| work sometimes, but could cause failures later on depending on |
| what else gets put on their page. |
| |
| "ScanKey" data structures are used in two fundamentally different ways |
| in this code, which we describe as "search" scankeys and "insertion" |
| scankeys. A search scankey is the kind passed to btbeginscan() or |
| btrescan() from outside the btree code. The sk_func pointers in a search |
| scankey point to comparison functions that return boolean, such as int4lt. |
| There might be more than one scankey entry for a given index column, or |
| none at all. (We require the keys to appear in index column order, but |
| the order of multiple keys for a given column is unspecified.) An |
| insertion scankey ("BTScanInsert" data structure) uses a similar |
| array-of-ScanKey data structure, but the sk_func pointers point to btree |
| comparison support functions (ie, 3-way comparators that return int4 values |
| interpreted as <0, =0, >0). In an insertion scankey there is at most one |
| entry per index column. There is also other data about the rules used to |
| locate where to begin the scan, such as whether or not the scan is a |
| "nextkey" scan. Insertion scankeys are built within the btree code (eg, by |
| _bt_mkscankey()) and are used to locate the starting point of a scan, as |
| well as for locating the place to insert a new index tuple. (Note: in the |
| case of an insertion scankey built from a search scankey or built from a |
| truncated pivot tuple, there might be fewer keys than index columns, |
| indicating that we have no constraints for the remaining index columns.) |
| After we have located the starting point of a scan, the original search |
| scankey is consulted as each index entry is sequentially scanned to decide |
| whether to return the entry and whether the scan can stop (see |
| _bt_checkkeys()). |
| |
| Notes about suffix truncation |
| ----------------------------- |
| |
| We truncate away suffix key attributes that are not needed for a page high |
| key during a leaf page split. The remaining attributes must distinguish |
| the last index tuple on the post-split left page as belonging on the left |
| page, and the first index tuple on the post-split right page as belonging |
| on the right page. Tuples logically retain truncated key attributes, |
| though they implicitly have "negative infinity" as their value, and have no |
| storage overhead. Since the high key is subsequently reused as the |
| downlink in the parent page for the new right page, suffix truncation makes |
| pivot tuples short. INCLUDE indexes are guaranteed to have non-key |
| attributes truncated at the time of a leaf page split, but may also have |
| some key attributes truncated away, based on the usual criteria for key |
| attributes. They are not a special case, since non-key attributes are |
| merely payload to B-Tree searches. |
| |
| The goal of suffix truncation of key attributes is to improve index |
| fan-out. The technique was first described by Bayer and Unterauer (R.Bayer |
| and K.Unterauer, Prefix B-Trees, ACM Transactions on Database Systems, Vol |
| 2, No. 1, March 1977, pp 11-26). The Postgres implementation is loosely |
| based on their paper. Note that Postgres only implements what the paper |
| refers to as simple prefix B-Trees. Note also that the paper assumes that |
| the tree has keys that consist of single strings that maintain the "prefix |
| property", much like strings that are stored in a suffix tree (comparisons |
| of earlier bytes must always be more significant than comparisons of later |
| bytes, and, in general, the strings must compare in a way that doesn't |
| break transitive consistency as they're split into pieces). Suffix |
| truncation in Postgres currently only works at the whole-attribute |
| granularity, but it would be straightforward to invent opclass |
| infrastructure that manufactures a smaller attribute value in the case of |
| variable-length types, such as text. An opclass support function could |
| manufacture the shortest possible key value that still correctly separates |
| each half of a leaf page split. |
| |
| There is sophisticated criteria for choosing a leaf page split point. The |
| general idea is to make suffix truncation effective without unduly |
| influencing the balance of space for each half of the page split. The |
| choice of leaf split point can be thought of as a choice among points |
| *between* items on the page to be split, at least if you pretend that the |
| incoming tuple was placed on the page already (you have to pretend because |
| there won't actually be enough space for it on the page). Choosing the |
| split point between two index tuples where the first non-equal attribute |
| appears as early as possible results in truncating away as many suffix |
| attributes as possible. Evenly balancing space among each half of the |
| split is usually the first concern, but even small adjustments in the |
| precise split point can allow truncation to be far more effective. |
| |
| Suffix truncation is primarily valuable because it makes pivot tuples |
| smaller, which delays splits of internal pages, but that isn't the only |
| reason why it's effective. Even truncation that doesn't make pivot tuples |
| smaller due to alignment still prevents pivot tuples from being more |
| restrictive than truly necessary in how they describe which values belong |
| on which pages. |
| |
| While it's not possible to correctly perform suffix truncation during |
| internal page splits, it's still useful to be discriminating when splitting |
| an internal page. The split point that implies a downlink be inserted in |
| the parent that's the smallest one available within an acceptable range of |
| the fillfactor-wise optimal split point is chosen. This idea also comes |
| from the Prefix B-Tree paper. This process has much in common with what |
| happens at the leaf level to make suffix truncation effective. The overall |
| effect is that suffix truncation tends to produce smaller, more |
| discriminating pivot tuples, especially early in the lifetime of the index, |
| while biasing internal page splits makes the earlier, smaller pivot tuples |
| end up in the root page, delaying root page splits. |
| |
| Logical duplicates are given special consideration. The logic for |
| selecting a split point goes to great lengths to avoid having duplicates |
| span more than one page, and almost always manages to pick a split point |
| between two user-key-distinct tuples, accepting a completely lopsided split |
| if it must. When a page that's already full of duplicates must be split, |
| the fallback strategy assumes that duplicates are mostly inserted in |
| ascending heap TID order. The page is split in a way that leaves the left |
| half of the page mostly full, and the right half of the page mostly empty. |
| The overall effect is that leaf page splits gracefully adapt to inserts of |
| large groups of duplicates, maximizing space utilization. Note also that |
| "trapping" large groups of duplicates on the same leaf page like this makes |
| deduplication more efficient. Deduplication can be performed infrequently, |
| without merging together existing posting list tuples too often. |
| |
| Notes about deduplication |
| ------------------------- |
| |
| We deduplicate non-pivot tuples in non-unique indexes to reduce storage |
| overhead, and to avoid (or at least delay) page splits. Note that the |
| goals for deduplication in unique indexes are rather different; see later |
| section for details. Deduplication alters the physical representation of |
| tuples without changing the logical contents of the index, and without |
| adding overhead to read queries. Non-pivot tuples are merged together |
| into a single physical tuple with a posting list (a simple array of heap |
| TIDs with the standard item pointer format). Deduplication is always |
| applied lazily, at the point where it would otherwise be necessary to |
| perform a page split. It occurs only when LP_DEAD items have been |
| removed, as our last line of defense against splitting a leaf page |
| (bottom-up index deletion may be attempted first, as our second last line |
| of defense). We can set the LP_DEAD bit with posting list tuples, though |
| only when all TIDs are known dead. |
| |
| Our lazy approach to deduplication allows the page space accounting used |
| during page splits to have absolutely minimal special case logic for |
| posting lists. Posting lists can be thought of as extra payload that |
| suffix truncation will reliably truncate away as needed during page |
| splits, just like non-key columns from an INCLUDE index tuple. |
| Incoming/new tuples can generally be treated as non-overlapping plain |
| items (though see section on posting list splits for information about how |
| overlapping new/incoming items are really handled). |
| |
| The representation of posting lists is almost identical to the posting |
| lists used by GIN, so it would be straightforward to apply GIN's varbyte |
| encoding compression scheme to individual posting lists. Posting list |
| compression would break the assumptions made by posting list splits about |
| page space accounting (see later section), so it's not clear how |
| compression could be integrated with nbtree. Besides, posting list |
| compression does not offer a compelling trade-off for nbtree, since in |
| general nbtree is optimized for consistent performance with many |
| concurrent readers and writers. Compression would also make the deletion |
| of a subset of TIDs from a posting list slow and complicated, which would |
| be a big problem for workloads that depend heavily on bottom-up index |
| deletion. |
| |
| A major goal of our lazy approach to deduplication is to limit the |
| performance impact of deduplication with random updates. Even concurrent |
| append-only inserts of the same key value will tend to have inserts of |
| individual index tuples in an order that doesn't quite match heap TID |
| order. Delaying deduplication minimizes page level fragmentation. |
| |
| Deduplication in unique indexes |
| ------------------------------- |
| |
| Very often, the number of distinct values that can ever be placed on |
| almost any given leaf page in a unique index is fixed and permanent. For |
| example, a primary key on an identity column will usually only have leaf |
| page splits caused by the insertion of new logical rows within the |
| rightmost leaf page. If there is a split of a non-rightmost leaf page, |
| then the split must have been triggered by inserts associated with UPDATEs |
| of existing logical rows. Splitting a leaf page purely to store multiple |
| versions is a false economy. In effect, we're permanently degrading the |
| index structure just to absorb a temporary burst of duplicates. |
| |
| Deduplication in unique indexes helps to prevent these pathological page |
| splits. Storing duplicates in a space efficient manner is not the goal, |
| since in the long run there won't be any duplicates anyway. Rather, we're |
| buying time for standard garbage collection mechanisms to run before a |
| page split is needed. |
| |
| Unique index leaf pages only get a deduplication pass when an insertion |
| (that might have to split the page) observed an existing duplicate on the |
| page in passing. This is based on the assumption that deduplication will |
| only work out when _all_ new insertions are duplicates from UPDATEs. This |
| may mean that we miss an opportunity to delay a page split, but that's |
| okay because our ultimate goal is to delay leaf page splits _indefinitely_ |
| (i.e. to prevent them altogether). There is little point in trying to |
| delay a split that is probably inevitable anyway. This allows us to avoid |
| the overhead of attempting to deduplicate with unique indexes that always |
| have few or no duplicates. |
| |
| Note: Avoiding "unnecessary" page splits driven by version churn is also |
| the goal of bottom-up index deletion, which was added to PostgreSQL 14. |
| Bottom-up index deletion is now the preferred way to deal with this |
| problem (with all kinds of indexes, though especially with unique |
| indexes). Still, deduplication can sometimes augment bottom-up index |
| deletion. When deletion cannot free tuples (due to an old snapshot |
| holding up cleanup), falling back on deduplication provides additional |
| capacity. Delaying the page split by deduplicating can allow a future |
| bottom-up deletion pass of the same page to succeed. |
| |
| Posting list splits |
| ------------------- |
| |
| When the incoming tuple happens to overlap with an existing posting list, |
| a posting list split is performed. Like a page split, a posting list |
| split resolves a situation where a new/incoming item "won't fit", while |
| inserting the incoming item in passing (i.e. as part of the same atomic |
| action). It's possible (though not particularly likely) that an insert of |
| a new item on to an almost-full page will overlap with a posting list, |
| resulting in both a posting list split and a page split. Even then, the |
| atomic action that splits the posting list also inserts the new item |
| (since page splits always insert the new item in passing). Including the |
| posting list split in the same atomic action as the insert avoids problems |
| caused by concurrent inserts into the same posting list -- the exact |
| details of how we change the posting list depend upon the new item, and |
| vice-versa. A single atomic action also minimizes the volume of extra |
| WAL required for a posting list split, since we don't have to explicitly |
| WAL-log the original posting list tuple. |
| |
| Despite piggy-backing on the same atomic action that inserts a new tuple, |
| posting list splits can be thought of as a separate, extra action to the |
| insert itself (or to the page split itself). Posting list splits |
| conceptually "rewrite" an insert that overlaps with an existing posting |
| list into an insert that adds its final new item just to the right of the |
| posting list instead. The size of the posting list won't change, and so |
| page space accounting code does not need to care about posting list splits |
| at all. This is an important upside of our design; the page split point |
| choice logic is very subtle even without it needing to deal with posting |
| list splits. |
| |
| Only a few isolated extra steps are required to preserve the illusion that |
| the new item never overlapped with an existing posting list in the first |
| place: the heap TID of the incoming tuple has its TID replaced with the |
| rightmost/max heap TID from the existing/originally overlapping posting |
| list. Similarly, the original incoming item's TID is relocated to the |
| appropriate offset in the posting list (we usually shift TIDs out of the |
| way to make a hole for it). Finally, the posting-split-with-page-split |
| case must generate a new high key based on an imaginary version of the |
| original page that has both the final new item and the after-list-split |
| posting tuple (page splits usually just operate against an imaginary |
| version that contains the new item/item that won't fit). |
| |
| This approach avoids inventing an "eager" atomic posting split operation |
| that splits the posting list without simultaneously finishing the insert |
| of the incoming item. This alternative design might seem cleaner, but it |
| creates subtle problems for page space accounting. In general, there |
| might not be enough free space on the page to split a posting list such |
| that the incoming/new item no longer overlaps with either posting list |
| half --- the operation could fail before the actual retail insert of the |
| new item even begins. We'd end up having to handle posting list splits |
| that need a page split anyway. Besides, supporting variable "split points" |
| while splitting posting lists won't actually improve overall space |
| utilization. |
| |
| Notes About Data Representation |
| ------------------------------- |
| |
| The right-sibling link required by L&Y is kept in the page "opaque |
| data" area, as is the left-sibling link, the page level, and some flags. |
| The page level counts upwards from zero at the leaf level, to the tree |
| depth minus 1 at the root. (Counting up from the leaves ensures that we |
| don't need to renumber any existing pages when splitting the root.) |
| |
| The Postgres disk block data format (an array of items) doesn't fit |
| Lehman and Yao's alternating-keys-and-pointers notion of a disk page, |
| so we have to play some games. (The alternating-keys-and-pointers |
| notion is important for internal page splits, which conceptually split |
| at the middle of an existing pivot tuple -- the tuple's "separator" key |
| goes on the left side of the split as the left side's new high key, |
| while the tuple's pointer/downlink goes on the right side as the |
| first/minus infinity downlink.) |
| |
| On a page that is not rightmost in its tree level, the "high key" is |
| kept in the page's first item, and real data items start at item 2. |
| The link portion of the "high key" item goes unused. A page that is |
| rightmost has no "high key" (it's implicitly positive infinity), so |
| data items start with the first item. Putting the high key at the |
| left, rather than the right, may seem odd, but it avoids moving the |
| high key as we add data items. |
| |
| On a leaf page, the data items are simply links to (TIDs of) tuples |
| in the relation being indexed, with the associated key values. |
| |
| On a non-leaf page, the data items are down-links to child pages with |
| bounding keys. The key in each data item is a strict lower bound for |
| keys on that child page, so logically the key is to the left of that |
| downlink. The high key (if present) is the upper bound for the last |
| downlink. The first data item on each such page has no lower bound |
| --- or lower bound of minus infinity, if you prefer. The comparison |
| routines must treat it accordingly. The actual key stored in the |
| item is irrelevant, and need not be stored at all. This arrangement |
| corresponds to the fact that an L&Y non-leaf page has one more pointer |
| than key. Suffix truncation's negative infinity attributes behave in |
| the same way. |