src/backend/access/nbtree/README - cloudberry - Git at Google

 src/backend/access/nbtree/README

 Btree Indexing
 ==============

 This directory contains a correct implementation of Lehman and Yao's
 high-concurrency B-tree management algorithm (P. Lehman and S. Yao,
 Efficient Locking for Concurrent Operations on B-Trees, ACM Transactions
 on Database Systems, Vol 6, No. 4, December 1981, pp 650-670).  We also
 use a simplified version of the deletion logic described in Lanin and
 Shasha (V. Lanin and D. Shasha, A Symmetric Concurrent B-Tree Algorithm,
 Proceedings of 1986 Fall Joint Computer Conference, pp 380-389).

 The basic Lehman & Yao Algorithm
 --------------------------------

 Compared to a classic B-tree, L&Y adds a right-link pointer to each page,
 to the page's right sibling.  It also adds a "high key" to each page, which
 is an upper bound on the keys that are allowed on that page.  These two
 additions make it possible to detect a concurrent page split, which allows
 the tree to be searched without holding any read locks (except to keep a
 single page from being modified while reading it).

 When a search follows a downlink to a child page, it compares the page's
 high key with the search key.  If the search key is greater than the high
 key, the page must've been split concurrently, and you must follow the
 right-link to find the new page containing the key range you're looking
 for.  This might need to be repeated, if the page has been split more than
 once.

 Lehman and Yao talk about alternating "separator" keys and downlinks in
 internal pages rather than tuples or records.  We use the term "pivot"
 tuple to refer to tuples which don't point to heap tuples, that are used
 only for tree navigation.  All tuples on non-leaf pages and high keys on
 leaf pages are pivot tuples.  Since pivot tuples are only used to represent
 which part of the key space belongs on each page, they can have attribute
 values copied from non-pivot tuples that were deleted and killed by VACUUM
 some time ago.  A pivot tuple may contain a "separator" key and downlink,
 just a separator key (i.e. the downlink value is implicitly undefined), or
 just a downlink (i.e. all attributes are truncated away).

 The requirement that all btree keys be unique is satisfied by treating heap
 TID as a tiebreaker attribute.  Logical duplicates are sorted in heap TID
 order.  This is necessary because Lehman and Yao also require that the key
 range for a subtree S is described by Ki < v <= Ki+1 where Ki and Ki+1 are
 the adjacent keys in the parent page (Ki must be _strictly_ less than v,
 which is assured by having reliably unique keys).  Keys are always unique
 on their level, with the exception of a leaf page's high key, which can be
 fully equal to the last item on the page.

 The Postgres implementation of suffix truncation must make sure that the
 Lehman and Yao invariants hold, and represents that absent/truncated
 attributes in pivot tuples have the sentinel value "minus infinity".  The
 later section on suffix truncation will be helpful if it's unclear how the
 Lehman & Yao invariants work with a real world example.

 Differences to the Lehman & Yao algorithm
 -----------------------------------------

 We have made the following changes in order to incorporate the L&Y algorithm
 into Postgres:

 Lehman and Yao don't require read locks, but assume that in-memory
 copies of tree pages are unshared.  Postgres shares in-memory buffers
 among backends.  As a result, we do page-level read locking on btree
 pages in order to guarantee that no record is modified while we are
 examining it.  This reduces concurrency but guarantees correct
 behavior.

 We support the notion of an ordered "scan" of an index as well as
 insertions, deletions, and simple lookups.  A scan in the forward
 direction is no problem, we just use the right-sibling pointers that
 L&Y require anyway.  (Thus, once we have descended the tree to the
 correct start point for the scan, the scan looks only at leaf pages
 and never at higher tree levels.)  To support scans in the backward
 direction, we also store a "left sibling" link much like the "right
 sibling".  (This adds an extra step to the L&Y split algorithm: while
 holding the write lock on the page being split, we also lock its former
 right sibling to update that page's left-link.  This is safe since no
 writer of that page can be interested in acquiring a write lock on our
 page.)  A backwards scan has one additional bit of complexity: after
 following the left-link we must account for the possibility that the
 left sibling page got split before we could read it.  So, we have to
 move right until we find a page whose right-link matches the page we
 came from.  (Actually, it's even harder than that; see page deletion
 discussion below.)

 Page read locks are held only for as long as a scan is examining a page.
 To minimize lock/unlock traffic, an index scan always searches a leaf page
 to identify all the matching items at once, copying their heap tuple IDs
 into backend-local storage.  The heap tuple IDs are then processed while
 not holding any page lock within the index.  We do continue to hold a pin
 on the leaf page in some circumstances, to protect against concurrent
 deletions (see below).  In this state the scan is effectively stopped
 "between" pages, either before or after the page it has pinned.  This is
 safe in the presence of concurrent insertions and even page splits, because
 items are never moved across pre-existing page boundaries --- so the scan
 cannot miss any items it should have seen, nor accidentally return the same
 item twice.  The scan must remember the page's right-link at the time it
 was scanned, since that is the page to move right to; if we move right to
 the current right-link then we'd re-scan any items moved by a page split.
 We don't similarly remember the left-link, since it's best to use the most
 up-to-date left-link when trying to move left (see detailed move-left
 algorithm below).

 In most cases we release our lock and pin on a page before attempting
 to acquire pin and lock on the page we are moving to.  In a few places
 it is necessary to lock the next page before releasing the current one.
 This is safe when moving right or up, but not when moving left or down
 (else we'd create the possibility of deadlocks).

 Lehman and Yao fail to discuss what must happen when the root page
 becomes full and must be split.  Our implementation is to split the
 root in the same way that any other page would be split, then construct
 a new root page holding pointers to both of the resulting pages (which
 now become siblings on the next level of the tree).  The new root page
 is then installed by altering the root pointer in the meta-data page (see
 below).  This works because the root is not treated specially in any
 other way --- in particular, searches will move right using its link
 pointer if the link is set.  Therefore, searches will find the data
 that's been moved into the right sibling even if they read the meta-data
 page before it got updated.  This is the same reasoning that makes a
 split of a non-root page safe.  The locking considerations are similar too.

 When an inserter recurses up the tree, splitting internal pages to insert
 links to pages inserted on the level below, it is possible that it will
 need to access a page above the level that was the root when it began its
 descent (or more accurately, the level that was the root when it read the
 meta-data page).  In this case the stack it made while descending does not
 help for finding the correct page.  When this happens, we find the correct
 place by re-descending the tree until we reach the level one above the
 level we need to insert a link to, and then moving right as necessary.
 (Typically this will take only two fetches, the meta-data page and the new
 root, but in principle there could have been more than one root split
 since we saw the root.  We can identify the correct tree level by means of
 the level numbers stored in each page.  The situation is rare enough that
 we do not need a more efficient solution.)

 Lehman and Yao must couple/chain locks as part of moving right when
 relocating a child page's downlink during an ascent of the tree.  This is
 the only point where Lehman and Yao have to simultaneously hold three
 locks (a lock on the child, the original parent, and the original parent's
 right sibling).  We don't need to couple internal page locks for pages on
 the same level, though.  We match a child's block number to a downlink
 from a pivot tuple one level up, whereas Lehman and Yao match on the
 separator key associated with the downlink that was followed during the
 initial descent.  We can release the lock on the original parent page
 before acquiring a lock on its right sibling, since there is never any
 need to deal with the case where the separator key that we must relocate
 becomes the original parent's high key.  Lanin and Shasha don't couple
 locks here either, though they also don't couple locks between levels
 during ascents.  They are willing to "wait and try again" to avoid races.
 Their algorithm is optimistic, which means that "an insertion holds no
 more than one write lock at a time during its ascent".  We more or less
 stick with Lehman and Yao's approach of conservatively coupling parent and
 child locks when ascending the tree, since it's far simpler.

 Lehman and Yao assume fixed-size keys, but we must deal with
 variable-size keys.  Therefore there is not a fixed maximum number of
 keys per page; we just stuff in as many as will fit.  When we split a
 page, we try to equalize the number of bytes, not items, assigned to
 pages (though suffix truncation is also considered).  Note we must include
 the incoming item in this calculation, otherwise it is possible to find
 that the incoming item doesn't fit on the split page where it needs to go!

 Deleting index tuples during VACUUM
 -----------------------------------

 Before deleting a leaf item, we get a full cleanup lock on the target
 page, so that no other backend has a pin on the page when the deletion
 starts.  This is not necessary for correctness in terms of the btree index
 operations themselves; as explained above, index scans logically stop
 "between" pages and so can't lose their place.  The reason we do it is to
 provide an interlock between VACUUM and index scans that are not prepared
 to deal with concurrent TID recycling when visiting the heap.  Since only
 VACUUM can ever mark pointed-to items LP_UNUSED in the heap, and since
 this only ever happens _after_ btbulkdelete returns, having index scans
 hold on to the pin (used when reading from the leaf page) until _after_
 they're done visiting the heap (for TIDs from pinned leaf page) prevents
 concurrent TID recycling.  VACUUM cannot get a conflicting cleanup lock
 until the index scan is totally finished processing its leaf page.

 This approach is fairly coarse, so we avoid it whenever possible.  In
 practice most index scans won't hold onto their pin, and so won't block
 VACUUM.  These index scans must deal with TID recycling directly, which is
 more complicated and not always possible.  See later section on making
 concurrent TID recycling safe.

 Opportunistic index tuple deletion performs almost the same page-level
 modifications while only holding an exclusive lock.  This is safe because
 there is no question of TID recycling taking place later on -- only VACUUM
 can make TIDs recyclable.  See also simple deletion and bottom-up
 deletion, below.

 Because a pin is not always held, and a page can be split even while
 someone does hold a pin on it, it is possible that an indexscan will
 return items that are no longer stored on the page it has a pin on, but
 rather somewhere to the right of that page.  To ensure that VACUUM can't
 prematurely make TIDs recyclable in this scenario, we require btbulkdelete
 to obtain a cleanup lock on every leaf page in the index, even pages that
 don't contain any deletable tuples.  Note that this requirement does not
 say that btbulkdelete must visit the pages in any particular order.

 VACUUM's linear scan, concurrent page splits
 --------------------------------------------

 VACUUM accesses the index by doing a linear scan to search for deletable
 TIDs, while considering the possibility of deleting empty pages in
 passing.  This is in physical/block order, not logical/keyspace order.
 The tricky part of this is avoiding missing any deletable tuples in the
 presence of concurrent page splits: a page split could easily move some
 tuples from a page not yet passed over by the sequential scan to a
 lower-numbered page already passed over.

 To implement this, we provide a "vacuum cycle ID" mechanism that makes it
 possible to determine whether a page has been split since the current
 btbulkdelete cycle started.  If btbulkdelete finds a page that has been
 split since it started, and has a right-link pointing to a lower page
 number, then it temporarily suspends its sequential scan and visits that
 page instead.  It must continue to follow right-links and vacuum dead
 tuples until reaching a page that either hasn't been split since
 btbulkdelete started, or is above the location of the outer sequential
 scan.  Then it can resume the sequential scan.  This ensures that all
 tuples are visited.  It may be that some tuples are visited twice, but
 that has no worse effect than an inaccurate index tuple count (and we
 can't guarantee an accurate count anyway in the face of concurrent
 activity).  Note that this still works if the has-been-recently-split test
 has a small probability of false positives, so long as it never gives a
 false negative.  This makes it possible to implement the test with a small
 counter value stored on each index page.

 Deleting entire pages during VACUUM
 -----------------------------------

 We consider deleting an entire page from the btree only when it's become
 completely empty of items.  (Merging partly-full pages would allow better
 space reuse, but it seems impractical to move existing data items left or
 right to make this happen --- a scan moving in the opposite direction
 might miss the items if so.)  Also, we *never* delete the rightmost page
 on a tree level (this restriction simplifies the traversal algorithms, as
 explained below).  Page deletion always begins from an empty leaf page.  An
 internal page can only be deleted as part of deleting an entire subtree.
 This is always a "skinny" subtree consisting of a "chain" of internal pages
 plus a single leaf page.  There is one page on each level of the subtree,
 and each level/page covers the same key space.

 Deleting a leaf page is a two-stage process.  In the first stage, the page
 is unlinked from its parent, and marked as half-dead.  The parent page must
 be found using the same type of search as used to find the parent during an
 insertion split.  We lock the target and the parent pages, change the
 target's downlink to point to the right sibling, and remove its old
 downlink.  This causes the target page's key space to effectively belong to
 its right sibling.  (Neither the left nor right sibling pages need to
 change their "high key" if any; so there is no problem with possibly not
 having enough space to replace a high key.)  At the same time, we mark the
 target page as half-dead, which causes any subsequent searches to ignore it
 and move right (or left, in a backwards scan).  This leaves the tree in a
 similar state as during a page split: the page has no downlink pointing to
 it, but it's still linked to its siblings.

 (Note: Lanin and Shasha prefer to make the key space move left, but their
 argument for doing so hinges on not having left-links, which we have
 anyway.  So we simplify the algorithm by moving the key space right.  This
 is only possible because we don't match on a separator key when ascending
 the tree during a page split, unlike Lehman and Yao/Lanin and Shasha -- it
 doesn't matter if the downlink is re-found in a pivot tuple whose separator
 key does not match the one encountered when inserter initially descended
 the tree.)

 To preserve consistency on the parent level, we cannot merge the key space
 of a page into its right sibling unless the right sibling is a child of
 the same parent --- otherwise, the parent's key space assignment changes
 too, meaning we'd have to make bounding-key updates in its parent, and
 perhaps all the way up the tree.  Since we can't possibly do that
 atomically, we forbid this case.  That means that the rightmost child of a
 parent node can't be deleted unless it's the only remaining child, in which
 case we will delete the parent too (see below).

 In the second-stage, the half-dead leaf page is unlinked from its siblings.
 We first lock the left sibling (if any) of the target, the target page
 itself, and its right sibling (there must be one) in that order.  Then we
 update the side-links in the siblings, and mark the target page deleted.

 When we're about to delete the last remaining child of a parent page, things
 are slightly more complicated.  In the first stage, we leave the immediate
 parent of the leaf page alone, and remove the downlink to the parent page
 instead, from the grandparent.  If it's the last child of the grandparent
 too, we recurse up until we find a parent with more than one child, and
 remove the downlink of that page.  The leaf page is marked as half-dead, and
 the block number of the page whose downlink was removed is stashed in the
 half-dead leaf page.  This leaves us with a chain of internal pages, with
 one downlink each, leading to the half-dead leaf page, and no downlink
 pointing to the topmost page in the chain.

 While we recurse up to find the topmost parent in the chain, we keep the
 leaf page locked, but don't need to hold locks on the intermediate pages
 between the leaf and the topmost parent -- insertions into upper tree levels
 happen only as a result of splits of child pages, and that can't happen as
 long as we're keeping the leaf locked.  The internal pages in the chain
 cannot acquire new children afterwards either, because the leaf page is
 marked as half-dead and won't be split.

 Removing the downlink to the top of the to-be-deleted subtree/chain
 effectively transfers the key space to the right sibling for all the
 intermediate levels too, in one atomic operation.  A concurrent search might
 still visit the intermediate pages, but it will move right when it reaches
 the half-dead page at the leaf level.  In particular, the search will move to
 the subtree to the right of the half-dead leaf page/to-be-deleted subtree,
 since the half-dead leaf page's right sibling must be a "cousin" page, not a
 "true" sibling page (or a second cousin page when the to-be-deleted chain
 starts at leaf page's grandparent page, and so on).

 In the second stage, the topmost page in the chain is unlinked from its
 siblings, and the half-dead leaf page is updated to point to the next page
 down in the chain.  This is repeated until there are no internal pages left
 in the chain.  Finally, the half-dead leaf page itself is unlinked from its
 siblings.

 A deleted page cannot be recycled immediately, since there may be other
 processes waiting to reference it (ie, search processes that just left the
 parent, or scans moving right or left from one of the siblings).  These
 processes must be able to observe a deleted page for some time after the
 deletion operation, in order to be able to at least recover from it (they
 recover by moving right, as with concurrent page splits).  Searchers never
 have to worry about concurrent page recycling.

 See "Placing deleted pages in the FSM" section below for a description of
 when and how deleted pages become safe for VACUUM to make recyclable.

 Page deletion and backwards scans
 ---------------------------------

 Moving left in a backward scan is complicated because we must consider
 the possibility that the left sibling was just split (meaning we must find
 the rightmost page derived from the left sibling), plus the possibility
 that the page we were just on has now been deleted and hence isn't in the
 sibling chain at all anymore.  So the move-left algorithm becomes:

 0. Remember the page we are on as the "original page".
 1. Follow the original page's left-link (we're done if this is zero).
 2. If the current page is live and its right-link matches the "original
    page", we are done.
 3. Otherwise, move right one or more times looking for a live page whose
    right-link matches the "original page".  If found, we are done.  (In
    principle we could scan all the way to the right end of the index, but
    in practice it seems better to give up after a small number of tries.
    It's unlikely the original page's sibling split more than a few times
    while we were in flight to it; if we do not find a matching link in a
    few tries, then most likely the original page is deleted.)
 4. Return to the "original page".  If it is still live, return to step 1
    (we guessed wrong about it being deleted, and should restart with its
    current left-link).  If it is dead, move right until a non-dead page
    is found (there must be one, since rightmost pages are never deleted),
    mark that as the new "original page", and return to step 1.

 This algorithm is correct because the live page found by step 4 will have
 the same left keyspace boundary as the page we started from.  Therefore,
 when we ultimately exit, it must be on a page whose right keyspace
 boundary matches the left boundary of where we started --- which is what
 we need to be sure we don't miss or re-scan any items.

 Page deletion and tree height
 -----------------------------

 Because we never delete the rightmost page of any level (and in particular
 never delete the root), it's impossible for the height of the tree to
 decrease.  After massive deletions we might have a scenario in which the
 tree is "skinny", with several single-page levels below the root.
 Operations will still be correct in this case, but we'd waste cycles
 descending through the single-page levels.  To handle this we use an idea
 from Lanin and Shasha: we keep track of the "fast root" level, which is
 the lowest single-page level.  The meta-data page keeps a pointer to this
 level as well as the true root.  All ordinary operations initiate their
 searches at the fast root not the true root.  When we split a page that is
 alone on its level or delete the next-to-last page on a level (both cases
 are easily detected), we have to make sure that the fast root pointer is
 adjusted appropriately.  In the split case, we do this work as part of the
 atomic update for the insertion into the parent level; in the delete case
 as part of the atomic update for the delete (either way, the metapage has
 to be the last page locked in the update to avoid deadlock risks).  This
 avoids race conditions if two such operations are executing concurrently.

 Placing deleted pages in the FSM
 --------------------------------

 Recycling a page is decoupled from page deletion.  A deleted page can only
 be put in the FSM to be recycled once there is no possible scan or search
 that has a reference to it; until then, it must stay in place with its
 sibling links undisturbed, as a tombstone that allows concurrent searches
 to detect and then recover from concurrent deletions (which are rather
 like concurrent page splits to searchers).  This design is an
 implementation of what Lanin and Shasha call "the drain technique".

 We implement the technique by waiting until all active snapshots and
 registered snapshots as of the page deletion are gone; which is overly
 strong, but is simple to implement within Postgres.  When marked fully
 dead, a deleted page is labeled with the next-transaction counter value.
 VACUUM can reclaim the page for re-use when the stored XID is guaranteed
 to be "visible to everyone".  As collateral damage, we wait for snapshots
 taken until the next transaction to allocate an XID commits.  We also wait
 for running XIDs with no snapshots.

 Prior to PostgreSQL 14, VACUUM would only place _old_ deleted pages that
 it encounters during its linear scan (pages deleted by a previous VACUUM
 operation) in the FSM.  Newly deleted pages were never placed in the FSM,
 because that was assumed to _always_ be unsafe.  That assumption was
 unnecessarily pessimistic in practice, though -- it often doesn't take
 very long for newly deleted pages to become safe to place in the FSM.
 There is no truly principled way to predict when deleted pages will become
 safe to place in the FSM for recycling -- it might become safe almost
 immediately (long before the current VACUUM completes), or it might not
 even be safe by the time the next VACUUM takes place.  Recycle safety is
 purely a question of maintaining the consistency (or at least the apparent
 consistency) of a physical data structure.  The state within the backend
 running VACUUM is simply not relevant.

 PostgreSQL 14 added the ability for VACUUM to consider if it's possible to
 recycle newly deleted pages at the end of the full index scan where the
 page deletion took place.  It is convenient to check if it's safe at that
 point.  This does require that VACUUM keep around a little bookkeeping
 information about newly deleted pages, but that's very cheap.  Using
 in-memory state for this avoids the need to revisit newly deleted pages a
 second time later on -- we can just use safexid values from the local
 bookkeeping state to determine recycle safety in a deferred fashion.

 The need for additional FSM indirection after a page deletion operation
 takes place is a natural consequence of the highly permissive rules for
 index scans with Lehman and Yao's design.  In general an index scan
 doesn't have to hold a lock or even a pin on any page when it descends the
 tree (nothing that you'd usually think of as an interlock is held "between
 levels").  At the same time, index scans cannot be allowed to land on a
 truly unrelated page due to concurrent recycling (not to be confused with
 concurrent deletion), because that results in wrong answers to queries.
 Simpler approaches to page deletion that don't need to defer recycling are
 possible, but none seem compatible with Lehman and Yao's design.

 Placing an already-deleted page in the FSM to be recycled when needed
 doesn't actually change the state of the page.  The page will be changed
 whenever it is subsequently taken from the FSM for reuse.  The deleted
 page's contents will be overwritten by the split operation (it will become
 the new right sibling page).

 Making concurrent TID recycling safe
 ------------------------------------

 As explained in the earlier section about deleting index tuples during
 VACUUM, we implement a locking protocol that allows individual index scans
 to avoid concurrent TID recycling.  Index scans opt-out (and so drop their
 leaf page pin when visiting the heap) whenever it's safe to do so, though.
 Dropping the pin early is useful because it avoids blocking progress by
 VACUUM.  This is particularly important with index scans used by cursors,
 since idle cursors sometimes stop for relatively long periods of time.  In
 extreme cases, a client application may hold on to an idle cursors for
 hours or even days.  Blocking VACUUM for that long could be disastrous.

 Index scans that don't hold on to a buffer pin are protected by holding an
 MVCC snapshot instead.  This more limited interlock prevents wrong answers
 to queries, but it does not prevent concurrent TID recycling itself (only
 holding onto the leaf page pin while accessing the heap ensures that).

 Index-only scans can never drop their buffer pin, since they are unable to
 tolerate having a referenced TID become recyclable.  Index-only scans
 typically just visit the visibility map (not the heap proper), and so will
 not reliably notice that any stale TID reference (for a TID that pointed
 to a dead-to-all heap item at first) was concurrently marked LP_UNUSED in
 the heap by VACUUM.  This could easily allow VACUUM to set the whole heap
 page to all-visible in the visibility map immediately afterwards.  An MVCC
 snapshot is only sufficient to avoid problems during plain index scans
 because they must access granular visibility information from the heap
 proper.  A plain index scan will even recognize LP_UNUSED items in the
 heap (items that could be recycled but haven't been just yet) as "not
 visible" -- even when the heap page is generally considered all-visible.

 LP_DEAD setting of index tuples by the kill_prior_tuple optimization
 (described in full in simple deletion, below) is also more complicated for
 index scans that drop their leaf page pins.  We must be careful to avoid
 LP_DEAD-marking any new index tuple that looks like a known-dead index
 tuple because it happens to share the same TID, following concurrent TID
 recycling.  It's just about possible that some other session inserted a
 new, unrelated index tuple, on the same leaf page, which has the same
 original TID.  It would be totally wrong to LP_DEAD-set this new,
 unrelated index tuple.

 We handle this kill_prior_tuple race condition by having affected index
 scans conservatively assume that any change to the leaf page at all
 implies that it was reached by btbulkdelete in the interim period when no
 buffer pin was held.  This is implemented by not setting any LP_DEAD bits
 on the leaf page at all when the page's LSN has changed.  (That won't work
 with an unlogged index, so for now we don't ever apply the "don't hold
 onto pin" optimization there.)

 Fastpath For Index Insertion
 ----------------------------

 We optimize for a common case of insertion of increasing index key
 values by caching the last page to which this backend inserted the last
 value, if this page was the rightmost leaf page. For the next insert, we
 can then quickly check if the cached page is still the rightmost leaf
 page and also the correct place to hold the current value. We can avoid
 the cost of walking down the tree in such common cases.

 The optimization works on the assumption that there can only be one
 non-ignorable leaf rightmost page, and so not even a visible-to-everyone
 style interlock is required.  We cannot fail to detect that our hint was
 invalidated, because there can only be one such page in the B-Tree at
 any time. It's possible that the page will be deleted and recycled
 without a backend's cached page also being detected as invalidated, but
 only when we happen to recycle a block that once again gets recycled as the
 rightmost leaf page.

 Simple deletion
 ---------------

 If a process visits a heap tuple and finds that it's dead and removable
 (ie, dead to all open transactions, not only that process), then we can
 return to the index and mark the corresponding index entry "known dead",
 allowing subsequent index scans to skip visiting the heap tuple.  The
 "known dead" marking works by setting the index item's lp_flags state
 to LP_DEAD.  This is currently only done in plain indexscans, not bitmap
 scans, because only plain scans visit the heap and index "in sync" and so
 there's not a convenient way to do it for bitmap scans.  Note also that
 LP_DEAD bits are often set when checking a unique index for conflicts on
 insert (this is simpler because it takes place when we hold an exclusive
 lock on the leaf page).

 Once an index tuple has been marked LP_DEAD it can actually be deleted
 from the index immediately; since index scans only stop "between" pages,
 no scan can lose its place from such a deletion.  We separate the steps
 because we allow LP_DEAD to be set with only a share lock (it's like a
 hint bit for a heap tuple), but physically deleting tuples requires an
 exclusive lock.  We also need to generate a snapshotConflictHorizon for
 each deletion operation's WAL record, which requires additional
 coordinating with the tableam when the deletion actually takes place.
 (snapshotConflictHorizon value may be used to generate a conflict during
 subsequent REDO of the record by a standby.)

 Delaying and batching index tuple deletion like this enables a further
 optimization: opportunistic checking of "extra" nearby index tuples
 (tuples that are not LP_DEAD-set) when they happen to be very cheap to
 check in passing (because we already know that the tableam will be
 visiting their table block to generate a snapshotConflictHorizon).  Any
 index tuples that turn out to be safe to delete will also be deleted.
 Simple deletion will behave as if the extra tuples that actually turn
 out to be delete-safe had their LP_DEAD bits set right from the start.

 Deduplication can also prevent a page split, but index tuple deletion is
 our preferred approach.  Note that posting list tuples can only have
 their LP_DEAD bit set when every table TID within the posting list is
 known dead.  This isn't much of a problem in practice because LP_DEAD
 bits are just a starting point for deletion.  What really matters is
 that _some_ deletion operation that targets related nearby-in-table TIDs
 takes place at some point before the page finally splits.  That's all
 that's required for the deletion process to perform granular removal of
 groups of dead TIDs from posting list tuples (without the situation ever
 being allowed to get out of hand).

 Bottom-Up deletion
 ------------------

 We attempt to delete whatever duplicates happen to be present on the page
 when the duplicates are suspected to be caused by version churn from
 successive UPDATEs.  This only happens when we receive an executor hint
 indicating that optimizations like heapam's HOT have not worked out for
 the index -- the incoming tuple must be a logically unchanged duplicate
 which is needed for MVCC purposes, suggesting that that might well be the
 dominant source of new index tuples on the leaf page in question.  (Also,
 bottom-up deletion is triggered within unique indexes in cases with
 continual INSERT and DELETE related churn, since that is easy to detect
 without any external hint.)

 Simple deletion will already have failed to prevent a page split when a
 bottom-up deletion pass takes place (often because no LP_DEAD bits were
 ever set on the page).  The two mechanisms have closely related
 implementations.  The same WAL records are used for each operation, and
 the same tableam infrastructure is used to determine what TIDs/tuples are
 actually safe to delete.  The implementations only differ in how they pick
 TIDs to consider for deletion, and whether or not the tableam will give up
 before accessing all table blocks (bottom-up deletion lives with the
 uncertainty of its success by keeping the cost of failure low).  Even
 still, the two mechanisms are clearly distinct at the conceptual level.

 Bottom-up index deletion is driven entirely by heuristics (whereas simple
 deletion is guaranteed to delete at least those index tuples that are
 already LP_DEAD marked -- there must be at least one).  We have no
 certainty that we'll find even one index tuple to delete.  That's why we
 closely cooperate with the tableam to keep the costs it pays in balance
 with the benefits we receive.  The interface that we use for this is
 described in detail in access/tableam.h.

 Bottom-up index deletion can be thought of as a backstop mechanism against
 unnecessary version-driven page splits.  It is based in part on an idea
 from generational garbage collection: the "generational hypothesis".  This
 is the empirical observation that "most objects die young".  Within
 nbtree, new index tuples often quickly appear in the same place, and then
 quickly become garbage.  There can be intense concentrations of garbage in
 relatively few leaf pages with certain workloads (or there could be in
 earlier versions of PostgreSQL without bottom-up index deletion, at
 least).  See doc/src/sgml/btree.sgml for a high-level description of the
 design principles behind bottom-up index deletion in nbtree, including
 details of how it complements VACUUM.

 We expect to find a reasonably large number of tuples that are safe to
 delete within each bottom-up pass.  If we don't then we won't need to
 consider the question of bottom-up deletion for the same leaf page for
 quite a while (usually because the page splits, which resolves the
 situation for the time being).  We expect to perform regular bottom-up
 deletion operations against pages that are at constant risk of unnecessary
 page splits caused only by version churn.  When the mechanism works well
 we'll constantly be "on the verge" of having version-churn-driven page
 splits, but never actually have even one.

 Our duplicate heuristics work well despite being fairly simple.
 Unnecessary page splits only occur when there are truly pathological
 levels of version churn (in theory a small amount of version churn could
 make a page split occur earlier than strictly necessary, but that's pretty
 harmless).  We don't have to understand the underlying workload; we only
 have to understand the general nature of the pathology that we target.
 Version churn is easy to spot when it is truly pathological.  Affected
 leaf pages are fairly homogeneous.

 WAL Considerations
 ------------------

 The insertion and deletion algorithms in themselves don't guarantee btree
 consistency after a crash.  To provide robustness, we depend on WAL
 replay.  A single WAL entry is effectively an atomic action --- we can
 redo it from the log if it fails to complete.

 Ordinary item insertions (that don't force a page split) are of course
 single WAL entries, since they only affect one page.  The same for
 leaf-item deletions (if the deletion brings the leaf page to zero items,
 it is now a candidate to be deleted, but that is a separate action).

 An insertion that causes a page split is logged as a single WAL entry for
 the changes occurring on the insertion's level --- including update of the
 right sibling's left-link --- followed by a second WAL entry for the
 insertion on the parent level (which might itself be a page split, requiring
 an additional insertion above that, etc).

 For a root split, the follow-on WAL entry is a "new root" entry rather than
 an "insertion" entry, but details are otherwise much the same.

 Because splitting involves multiple atomic actions, it's possible that the
 system crashes between splitting a page and inserting the downlink for the
 new half to the parent.  After recovery, the downlink for the new page will
 be missing.  The search algorithm works correctly, as the page will be found
 by following the right-link from its left sibling, although if a lot of
 downlinks in the tree are missing, performance will suffer.  A more serious
 consequence is that if the page without a downlink gets split again, the
 insertion algorithm will fail to find the location in the parent level to
 insert the downlink.

 Our approach is to create any missing downlinks on-the-fly, when searching
 the tree for a new insertion.  It could be done during searches, too, but
 it seems best not to put any extra updates in what would otherwise be a
 read-only operation (updating is not possible in hot standby mode anyway).
 It would seem natural to add the missing downlinks in VACUUM, but since
 inserting a downlink might require splitting a page, it might fail if you
 run out of disk space.  That would be bad during VACUUM - the reason for
 running VACUUM in the first place might be that you run out of disk space,
 and now VACUUM won't finish because you're out of disk space.  In contrast,
 an insertion can require enlarging the physical file anyway.  There is one
 minor exception: VACUUM finishes interrupted splits of internal pages when
 deleting their children.  This allows the code for re-finding parent items
 to be used by both page splits and page deletion.

 To identify missing downlinks, when a page is split, the left page is
 flagged to indicate that the split is not yet complete (INCOMPLETE_SPLIT).
 When the downlink is inserted to the parent, the flag is cleared atomically
 with the insertion.  The child page is kept locked until the insertion in
 the parent is finished and the flag in the child cleared, but can be
 released immediately after that, before recursing up the tree if the parent
 also needs to be split.  This ensures that incompletely split pages should
 not be seen under normal circumstances; only if insertion to the parent
 has failed for some reason. (It's also possible for a reader to observe
 a page with the incomplete split flag set during recovery; see later
 section on "Scans during Recovery" for details.)

 We flag the left page, even though it's the right page that's missing the
 downlink, because it's more convenient to know already when following the
 right-link from the left page to the right page that it will need to have
 its downlink inserted to the parent.

 When splitting a non-root page that is alone on its level, the required
 metapage update (of the "fast root" link) is performed and logged as part
 of the insertion into the parent level.  When splitting the root page, the
 metapage update is handled as part of the "new root" action.

 Each step in page deletion is logged as a separate WAL entry: marking the
 leaf as half-dead and removing the downlink is one record, and unlinking a
 page is a second record.  If vacuum is interrupted for some reason, or the
 system crashes, the tree is consistent for searches and insertions.  The
 next VACUUM will find the half-dead leaf page and continue the deletion.

 Before 9.4, we used to keep track of incomplete splits and page deletions
 during recovery and finish them immediately at end of recovery, instead of
 doing it lazily at the next insertion or vacuum.  However, that made the
 recovery much more complicated, and only fixed the problem when crash
 recovery was performed.  An incomplete split can also occur if an otherwise
 recoverable error, like out-of-memory or out-of-disk-space, happens while
 inserting the downlink to the parent.

 Scans during Recovery
 ---------------------

 nbtree indexes support read queries in Hot Standby mode. Every atomic
 action/WAL record makes isolated changes that leave the tree in a
 consistent state for readers. Readers lock pages according to the same
 rules that readers follow on the primary. (Readers may have to move
 right to recover from a "concurrent" page split or page deletion, just
 like on the primary.)

 However, there are a couple of differences in how pages are locked by
 replay/the startup process as compared to the original write operation
 on the primary. The exceptions involve page splits and page deletions.
 The first phase and second phase of a page split are processed
 independently during replay, since they are independent atomic actions.
 We do not attempt to recreate the coupling of parent and child page
 write locks that took place on the primary. This is safe because readers
 never care about the incomplete split flag anyway. Holding on to an
 extra write lock on the primary is only necessary so that a second
 writer cannot observe the incomplete split flag before the first writer
 finishes the split. If we let concurrent writers on the primary observe
 an incomplete split flag on the same page, each writer would attempt to
 complete the unfinished split, corrupting the parent page.  (Similarly,
 replay of page deletion records does not hold a write lock on the target
 leaf page throughout; only the primary needs to block out concurrent
 writers that insert on to the page being deleted.)

 WAL replay holds same-level locks in a way that matches the approach
 taken during original execution, though. This prevent readers from
 observing same-level inconsistencies. It's probably possible to be more
 lax about how same-level locks are acquired during recovery (most kinds
 of readers could still move right to recover if we didn't couple
 same-level locks), but we prefer to be conservative here.

 During recovery all index scans start with ignore_killed_tuples = false
 and we never set kill_prior_tuple. We do this because the oldest xmin
 on the standby server can be older than the oldest xmin on the primary
 server, which means tuples can be marked LP_DEAD even when they are
 still visible on the standby. We don't WAL log tuple LP_DEAD bits, but
 they can still appear in the standby because of full page writes. So
 we must always ignore them in standby, and that means it's not worth
 setting them either.  (When LP_DEAD-marked tuples are eventually deleted
 on the primary, the deletion is WAL-logged.  Queries that run on a
 standby therefore get much of the benefit of any LP_DEAD setting that
 takes place on the primary.)

 Note that we talk about scans that are started during recovery. We go to
 a little trouble to allow a scan to start during recovery and end during
 normal running after recovery has completed. This is a key capability
 because it allows running applications to continue while the standby
 changes state into a normally running server.

 The interlocking required to avoid returning incorrect results from
 non-MVCC scans is not required on standby nodes. We still get a full
 cleanup lock when replaying VACUUM records during recovery, but recovery
 does not need to lock every leaf page (only those leaf pages that have
 items to delete) -- that's sufficient to avoid breaking index-only scans
 during recovery (see section above about making TID recycling safe). That
 leaves concern only for plain index scans. (XXX: Not actually clear why
 this is totally unnecessary during recovery.)

 MVCC snapshot plain index scans are always safe, for the same reasons that
 they're safe during original execution.  HeapTupleSatisfiesToast() doesn't
 use MVCC semantics, though that's because it doesn't need to - if the main
 heap row is visible then the toast rows will also be visible. So as long
 as we follow a toast pointer from a visible (live) tuple the corresponding
 toast rows will also be visible, so we do not need to recheck MVCC on
 them.

 Other Things That Are Handy to Know
 -----------------------------------

 Page zero of every btree is a meta-data page.  This page stores the
 location of the root page --- both the true root and the current effective
 root ("fast" root).  To avoid fetching the metapage for every single index
 search, we cache a copy of the meta-data information in the index's
 relcache entry (rd_amcache).  This is a bit ticklish since using the cache
 implies following a root page pointer that could be stale.  However, a
 backend following a cached pointer can sufficiently verify whether it
 reached the intended page; either by checking the is-root flag when it
 is going to the true root, or by checking that the page has no siblings
 when going to the fast root.  At worst, this could result in descending
 some extra tree levels if we have a cached pointer to a fast root that is
 now above the real fast root.  Such cases shouldn't arise often enough to
 be worth optimizing; and in any case we can expect a relcache flush will
 discard the cached metapage before long, since a VACUUM that's moved the
 fast root pointer can be expected to issue a statistics update for the
 index.

 The algorithm assumes we can fit at least three items per page
 (a "high key" and two real data items).  Therefore it's unsafe
 to accept items larger than 1/3rd page size.  Larger items would
 work sometimes, but could cause failures later on depending on
 what else gets put on their page.

 "ScanKey" data structures are used in two fundamentally different ways
 in this code, which we describe as "search" scankeys and "insertion"
 scankeys.  A search scankey is the kind passed to btbeginscan() or
 btrescan() from outside the btree code.  The sk_func pointers in a search
 scankey point to comparison functions that return boolean, such as int4lt.
 There might be more than one scankey entry for a given index column, or
 none at all.  (We require the keys to appear in index column order, but
 the order of multiple keys for a given column is unspecified.)  An
 insertion scankey ("BTScanInsert" data structure) uses a similar
 array-of-ScanKey data structure, but the sk_func pointers point to btree
 comparison support functions (ie, 3-way comparators that return int4 values
 interpreted as <0, =0, >0).  In an insertion scankey there is at most one
 entry per index column.  There is also other data about the rules used to
 locate where to begin the scan, such as whether or not the scan is a
 "nextkey" scan.  Insertion scankeys are built within the btree code (eg, by
 _bt_mkscankey()) and are used to locate the starting point of a scan, as
 well as for locating the place to insert a new index tuple.  (Note: in the
 case of an insertion scankey built from a search scankey or built from a
 truncated pivot tuple, there might be fewer keys than index columns,
 indicating that we have no constraints for the remaining index columns.)
 After we have located the starting point of a scan, the original search
 scankey is consulted as each index entry is sequentially scanned to decide
 whether to return the entry and whether the scan can stop (see
 _bt_checkkeys()).

 Notes about suffix truncation
 -----------------------------

 We truncate away suffix key attributes that are not needed for a page high
 key during a leaf page split.  The remaining attributes must distinguish
 the last index tuple on the post-split left page as belonging on the left
 page, and the first index tuple on the post-split right page as belonging
 on the right page.  Tuples logically retain truncated key attributes,
 though they implicitly have "negative infinity" as their value, and have no
 storage overhead.  Since the high key is subsequently reused as the
 downlink in the parent page for the new right page, suffix truncation makes
 pivot tuples short.  INCLUDE indexes are guaranteed to have non-key
 attributes truncated at the time of a leaf page split, but may also have
 some key attributes truncated away, based on the usual criteria for key
 attributes.  They are not a special case, since non-key attributes are
 merely payload to B-Tree searches.

 The goal of suffix truncation of key attributes is to improve index
 fan-out.  The technique was first described by Bayer and Unterauer (R.Bayer
 and K.Unterauer, Prefix B-Trees, ACM Transactions on Database Systems, Vol
 2, No. 1, March 1977, pp 11-26).  The Postgres implementation is loosely
 based on their paper.  Note that Postgres only implements what the paper
 refers to as simple prefix B-Trees.  Note also that the paper assumes that
 the tree has keys that consist of single strings that maintain the "prefix
 property", much like strings that are stored in a suffix tree (comparisons
 of earlier bytes must always be more significant than comparisons of later
 bytes, and, in general, the strings must compare in a way that doesn't
 break transitive consistency as they're split into pieces).  Suffix
 truncation in Postgres currently only works at the whole-attribute
 granularity, but it would be straightforward to invent opclass
 infrastructure that manufactures a smaller attribute value in the case of
 variable-length types, such as text.  An opclass support function could
 manufacture the shortest possible key value that still correctly separates
 each half of a leaf page split.

 There is sophisticated criteria for choosing a leaf page split point.  The
 general idea is to make suffix truncation effective without unduly
 influencing the balance of space for each half of the page split.  The
 choice of leaf split point can be thought of as a choice among points
 *between* items on the page to be split, at least if you pretend that the
 incoming tuple was placed on the page already (you have to pretend because
 there won't actually be enough space for it on the page).  Choosing the
 split point between two index tuples where the first non-equal attribute
 appears as early as possible results in truncating away as many suffix
 attributes as possible.  Evenly balancing space among each half of the
 split is usually the first concern, but even small adjustments in the
 precise split point can allow truncation to be far more effective.

 Suffix truncation is primarily valuable because it makes pivot tuples
 smaller, which delays splits of internal pages, but that isn't the only
 reason why it's effective.  Even truncation that doesn't make pivot tuples
 smaller due to alignment still prevents pivot tuples from being more
 restrictive than truly necessary in how they describe which values belong
 on which pages.

 While it's not possible to correctly perform suffix truncation during
 internal page splits, it's still useful to be discriminating when splitting
 an internal page.  The split point that implies a downlink be inserted in
 the parent that's the smallest one available within an acceptable range of
 the fillfactor-wise optimal split point is chosen.  This idea also comes
 from the Prefix B-Tree paper.  This process has much in common with what
 happens at the leaf level to make suffix truncation effective.  The overall
 effect is that suffix truncation tends to produce smaller, more
 discriminating pivot tuples, especially early in the lifetime of the index,
 while biasing internal page splits makes the earlier, smaller pivot tuples
 end up in the root page, delaying root page splits.

 Logical duplicates are given special consideration.  The logic for
 selecting a split point goes to great lengths to avoid having duplicates
 span more than one page, and almost always manages to pick a split point
 between two user-key-distinct tuples, accepting a completely lopsided split
 if it must.  When a page that's already full of duplicates must be split,
 the fallback strategy assumes that duplicates are mostly inserted in
 ascending heap TID order.  The page is split in a way that leaves the left
 half of the page mostly full, and the right half of the page mostly empty.
 The overall effect is that leaf page splits gracefully adapt to inserts of
 large groups of duplicates, maximizing space utilization.  Note also that
 "trapping" large groups of duplicates on the same leaf page like this makes
 deduplication more efficient.  Deduplication can be performed infrequently,
 without merging together existing posting list tuples too often.

 Notes about deduplication
 -------------------------

 We deduplicate non-pivot tuples in non-unique indexes to reduce storage
 overhead, and to avoid (or at least delay) page splits.  Note that the
 goals for deduplication in unique indexes are rather different; see later
 section for details.  Deduplication alters the physical representation of
 tuples without changing the logical contents of the index, and without
 adding overhead to read queries.  Non-pivot tuples are merged together
 into a single physical tuple with a posting list (a simple array of heap
 TIDs with the standard item pointer format).  Deduplication is always
 applied lazily, at the point where it would otherwise be necessary to
 perform a page split.  It occurs only when LP_DEAD items have been
 removed, as our last line of defense against splitting a leaf page
 (bottom-up index deletion may be attempted first, as our second last line
 of defense).  We can set the LP_DEAD bit with posting list tuples, though
 only when all TIDs are known dead.

 Our lazy approach to deduplication allows the page space accounting used
 during page splits to have absolutely minimal special case logic for
 posting lists.  Posting lists can be thought of as extra payload that
 suffix truncation will reliably truncate away as needed during page
 splits, just like non-key columns from an INCLUDE index tuple.
 Incoming/new tuples can generally be treated as non-overlapping plain
 items (though see section on posting list splits for information about how
 overlapping new/incoming items are really handled).

 The representation of posting lists is almost identical to the posting
 lists used by GIN, so it would be straightforward to apply GIN's varbyte
 encoding compression scheme to individual posting lists.  Posting list
 compression would break the assumptions made by posting list splits about
 page space accounting (see later section), so it's not clear how
 compression could be integrated with nbtree.  Besides, posting list
 compression does not offer a compelling trade-off for nbtree, since in
 general nbtree is optimized for consistent performance with many
 concurrent readers and writers.  Compression would also make the deletion
 of a subset of TIDs from a posting list slow and complicated, which would
 be a big problem for workloads that depend heavily on bottom-up index
 deletion.

 A major goal of our lazy approach to deduplication is to limit the
 performance impact of deduplication with random updates.  Even concurrent
 append-only inserts of the same key value will tend to have inserts of
 individual index tuples in an order that doesn't quite match heap TID
 order.  Delaying deduplication minimizes page level fragmentation.

 Deduplication in unique indexes
 -------------------------------

 Very often, the number of distinct values that can ever be placed on
 almost any given leaf page in a unique index is fixed and permanent.  For
 example, a primary key on an identity column will usually only have leaf
 page splits caused by the insertion of new logical rows within the
 rightmost leaf page.  If there is a split of a non-rightmost leaf page,
 then the split must have been triggered by inserts associated with UPDATEs
 of existing logical rows.  Splitting a leaf page purely to store multiple
 versions is a false economy.  In effect, we're permanently degrading the
 index structure just to absorb a temporary burst of duplicates.

 Deduplication in unique indexes helps to prevent these pathological page
 splits.  Storing duplicates in a space efficient manner is not the goal,
 since in the long run there won't be any duplicates anyway.  Rather, we're
 buying time for standard garbage collection mechanisms to run before a
 page split is needed.

 Unique index leaf pages only get a deduplication pass when an insertion
 (that might have to split the page) observed an existing duplicate on the
 page in passing.  This is based on the assumption that deduplication will
 only work out when _all_ new insertions are duplicates from UPDATEs.  This
 may mean that we miss an opportunity to delay a page split, but that's
 okay because our ultimate goal is to delay leaf page splits _indefinitely_
 (i.e. to prevent them altogether).  There is little point in trying to
 delay a split that is probably inevitable anyway.  This allows us to avoid
 the overhead of attempting to deduplicate with unique indexes that always
 have few or no duplicates.

 Note: Avoiding "unnecessary" page splits driven by version churn is also
 the goal of bottom-up index deletion, which was added to PostgreSQL 14.
 Bottom-up index deletion is now the preferred way to deal with this
 problem (with all kinds of indexes, though especially with unique
 indexes).  Still, deduplication can sometimes augment bottom-up index
 deletion.  When deletion cannot free tuples (due to an old snapshot
 holding up cleanup), falling back on deduplication provides additional
 capacity.  Delaying the page split by deduplicating can allow a future
 bottom-up deletion pass of the same page to succeed.

 Posting list splits
 -------------------

 When the incoming tuple happens to overlap with an existing posting list,
 a posting list split is performed.  Like a page split, a posting list
 split resolves a situation where a new/incoming item "won't fit", while
 inserting the incoming item in passing (i.e. as part of the same atomic
 action).  It's possible (though not particularly likely) that an insert of
 a new item on to an almost-full page will overlap with a posting list,
 resulting in both a posting list split and a page split.  Even then, the
 atomic action that splits the posting list also inserts the new item
 (since page splits always insert the new item in passing).  Including the
 posting list split in the same atomic action as the insert avoids problems
 caused by concurrent inserts into the same posting list --  the exact
 details of how we change the posting list depend upon the new item, and
 vice-versa.  A single atomic action also minimizes the volume of extra
 WAL required for a posting list split, since we don't have to explicitly
 WAL-log the original posting list tuple.

 Despite piggy-backing on the same atomic action that inserts a new tuple,
 posting list splits can be thought of as a separate, extra action to the
 insert itself (or to the page split itself).  Posting list splits
 conceptually "rewrite" an insert that overlaps with an existing posting
 list into an insert that adds its final new item just to the right of the
 posting list instead.  The size of the posting list won't change, and so
 page space accounting code does not need to care about posting list splits
 at all.  This is an important upside of our design; the page split point
 choice logic is very subtle even without it needing to deal with posting
 list splits.

 Only a few isolated extra steps are required to preserve the illusion that
 the new item never overlapped with an existing posting list in the first
 place: the heap TID of the incoming tuple has its TID replaced with the
 rightmost/max heap TID from the existing/originally overlapping posting
 list.  Similarly, the original incoming item's TID is relocated to the
 appropriate offset in the posting list (we usually shift TIDs out of the
 way to make a hole for it).  Finally, the posting-split-with-page-split
 case must generate a new high key based on an imaginary version of the
 original page that has both the final new item and the after-list-split
 posting tuple (page splits usually just operate against an imaginary
 version that contains the new item/item that won't fit).

 This approach avoids inventing an "eager" atomic posting split operation
 that splits the posting list without simultaneously finishing the insert
 of the incoming item.  This alternative design might seem cleaner, but it
 creates subtle problems for page space accounting.  In general, there
 might not be enough free space on the page to split a posting list such
 that the incoming/new item no longer overlaps with either posting list
 half --- the operation could fail before the actual retail insert of the
 new item even begins.  We'd end up having to handle posting list splits
 that need a page split anyway.  Besides, supporting variable "split points"
 while splitting posting lists won't actually improve overall space
 utilization.

 Notes About Data Representation
 -------------------------------

 The right-sibling link required by L&Y is kept in the page "opaque
 data" area, as is the left-sibling link, the page level, and some flags.
 The page level counts upwards from zero at the leaf level, to the tree
 depth minus 1 at the root.  (Counting up from the leaves ensures that we
 don't need to renumber any existing pages when splitting the root.)

 The Postgres disk block data format (an array of items) doesn't fit
 Lehman and Yao's alternating-keys-and-pointers notion of a disk page,
 so we have to play some games.  (The alternating-keys-and-pointers
 notion is important for internal page splits, which conceptually split
 at the middle of an existing pivot tuple -- the tuple's "separator" key
 goes on the left side of the split as the left side's new high key,
 while the tuple's pointer/downlink goes on the right side as the
 first/minus infinity downlink.)

 On a page that is not rightmost in its tree level, the "high key" is
 kept in the page's first item, and real data items start at item 2.
 The link portion of the "high key" item goes unused.  A page that is
 rightmost has no "high key" (it's implicitly positive infinity), so
 data items start with the first item.  Putting the high key at the
 left, rather than the right, may seem odd, but it avoids moving the
 high key as we add data items.

 On a leaf page, the data items are simply links to (TIDs of) tuples
 in the relation being indexed, with the associated key values.

 On a non-leaf page, the data items are down-links to child pages with
 bounding keys.  The key in each data item is a strict lower bound for
 keys on that child page, so logically the key is to the left of that
 downlink.  The high key (if present) is the upper bound for the last
 downlink.  The first data item on each such page has no lower bound
 --- or lower bound of minus infinity, if you prefer.  The comparison
 routines must treat it accordingly.  The actual key stored in the
 item is irrelevant, and need not be stored at all.  This arrangement
 corresponds to the fact that an L&Y non-leaf page has one more pointer
 than key.  Suffix truncation's negative infinity attributes behave in
 the same way.