src/backend/access/spgist/README - cloudberry - Git at Google

 src/backend/access/spgist/README

 SP-GiST is an abbreviation of space-partitioned GiST.  It provides a
 generalized infrastructure for implementing space-partitioned data
 structures, such as quadtrees, k-d trees, and radix trees (tries).  When
 implemented in main memory, these structures are usually designed as a set of
 dynamically-allocated nodes linked by pointers.  This is not suitable for
 direct storing on disk, since the chains of pointers can be rather long and
 require too many disk accesses. In contrast, disk based data structures
 should have a high fanout to minimize I/O.  The challenge is to map tree
 nodes to disk pages in such a way that the search algorithm accesses only a
 few disk pages, even if it traverses many nodes.


 COMMON STRUCTURE DESCRIPTION

 Logically, an SP-GiST tree is a set of tuples, each of which can be either
 an inner or leaf tuple.  Each inner tuple contains "nodes", which are
 (label,pointer) pairs, where the pointer (ItemPointerData) is a pointer to
 another inner tuple or to the head of a list of leaf tuples.  Inner tuples
 can have different numbers of nodes (children).  Branches can be of different
 depth (actually, there is no control or code to support balancing), which
 means that the tree is non-balanced.  However, leaf and inner tuples cannot
 be intermixed at the same level: a downlink from a node of an inner tuple
 leads either to one inner tuple, or to a list of leaf tuples.

 The SP-GiST core requires that inner and leaf tuples fit on a single index
 page, and even more stringently that the list of leaf tuples reached from a
 single inner-tuple node all be stored on the same index page.  (Restricting
 such lists to not cross pages reduces seeks, and allows the list links to be
 stored as simple 2-byte OffsetNumbers.)  SP-GiST index opclasses should
 therefore ensure that not too many nodes can be needed in one inner tuple,
 and that inner-tuple prefixes and leaf-node datum values not be too large.

 Inner and leaf tuples are stored separately: the former are stored only on
 "inner" pages, the latter only on "leaf" pages.  Also, there are special
 restrictions on the root page.  Early in an index's life, when there is only
 one page's worth of data, the root page contains an unorganized set of leaf
 tuples.  After the first page split has occurred, the root is required to
 contain exactly one inner tuple.

 When the search traversal algorithm reaches an inner tuple, it chooses a set
 of nodes to continue tree traverse in depth.  If it reaches a leaf page it
 scans a list of leaf tuples to find the ones that match the query. SP-GiST
 also supports ordered (nearest-neighbor) searches - that is during scan pending
 nodes are put into priority queue, so traversal is performed by the
 closest-first model.


 The insertion algorithm descends the tree similarly, except it must choose
 just one node to descend to from each inner tuple.  Insertion might also have
 to modify the inner tuple before it can descend: it could add a new node, or
 it could "split" the tuple to obtain a less-specific prefix that can match
 the value to be inserted.  If it's necessary to append a new leaf tuple to a
 list and there is no free space on page, then SP-GiST creates a new inner
 tuple and distributes leaf tuples into a set of lists on, perhaps, several
 pages.

 An inner tuple consists of:

   optional prefix value - all successors must be consistent with it.
     Example:
         radix tree   - prefix value is a common prefix string
         quad tree    - centroid
         k-d tree     - one coordinate

   list of nodes, where node is a (label, pointer) pair.
     Example of a label: a single character for radix tree

 A leaf tuple consists of:

   a leaf value
     Example:
         radix tree - the rest of string (postfix)
         quad and k-d tree - the point itself

   ItemPointer to the corresponding heap tuple
   nextOffset number of next leaf tuple in a chain on a leaf page

   optional nulls bitmask
   optional INCLUDE-column values

 For compatibility with pre-v14 indexes, a leaf tuple has a nulls bitmask
 only if there are null values (among the leaf value and the INCLUDE values)
 *and* there is at least one INCLUDE column.  The null-ness of the leaf
 value can be inferred from whether the tuple is on a "nulls page" (see below)
 so it is not necessary to represent it explicitly.  But we include it anyway
 in a bitmask used with INCLUDE values, so that standard tuple deconstruction
 code can be used.


 NULLS HANDLING

 We assume that SPGiST-indexable operators are strict (can never succeed for
 null inputs).  It is still desirable to index nulls, so that whole-table
 indexscans are possible and so that "x IS NULL" can be implemented by an
 SPGiST indexscan.  However, we prefer that SPGiST index opclasses not have
 to cope with nulls.  Therefore, the main tree of an SPGiST index does not
 include any null entries.  We store null entries in a separate SPGiST tree
 occupying a disjoint set of pages (in particular, its own root page).
 Insertions and searches in the nulls tree do not use any of the
 opclass-supplied functions, but just use hardwired logic comparable to
 AllTheSame cases in the normal tree.


 INSERTION ALGORITHM

 Insertion algorithm is designed to keep the tree in a consistent state at
 any moment.  Here is a simplified insertion algorithm specification
 (numbers refer to notes below):

   Start with the first tuple on the root page (1)

   loop:
     if (page is leaf) then
         if (enough space)
             insert on page and exit (5)
         else (7)
             call PickSplitFn() (2)
         end if
     else
         switch (chooseFn())
             case MatchNode  - descend through selected node
             case AddNode    - add node and then retry chooseFn (3, 6)
             case SplitTuple - split inner tuple to prefix and postfix, then
                               retry chooseFn with the prefix tuple (4, 6)
     end if

 Notes:

 (1) Initially, we just dump leaf tuples into the root page until it is full;
 then we split it.  Once the root is not a leaf page, it can have only one
 inner tuple, so as to keep the amount of free space on the root as large as
 possible.  Both of these rules are meant to postpone doing PickSplit on the
 root for as long as possible, so that the topmost partitioning of the search
 space is as good as we can easily make it.

 (2) Current implementation allows to do picksplit and insert a new leaf tuple
 in one operation, if the new list of leaf tuples fits on one page. It's
 always possible for trees with small nodes like quad tree or k-d tree, but
 radix trees may require another picksplit.

 (3) Addition of node must keep size of inner tuple small enough to fit on a
 page.  After addition, inner tuple could become too large to be stored on
 current page because of other tuples on page. In this case it will be moved
 to another inner page (see notes about page management). When moving tuple to
 another page, we can't change the numbers of other tuples on the page, else
 we'd make downlink pointers to them invalid. To prevent that, SP-GiST leaves
 a "placeholder" tuple, which can be reused later whenever another tuple is
 added to the page. See also Concurrency and Vacuum sections below. Right now
 only radix trees could add a node to the tuple; quad trees and k-d trees
 make all possible nodes at once in PickSplitFn() call.

 (4) Prefix value could only partially match a new value, so the SplitTuple
 action allows breaking the current tree branch into upper and lower sections.
 Another way to say it is that we can split the current inner tuple into
 "prefix" and "postfix" parts, where the prefix part is able to match the
 incoming new value. Consider example of insertion into a radix tree. We use
 the following notation, where tuple's id is just for discussion (no such id
 is actually stored):

 inner tuple: {tuple id}(prefix string)[ comma separated list of node labels ]
 leaf tuple: {tuple id}<value>

 Suppose we need to insert string 'www.gogo.com' into inner tuple

     {1}(www.google.com/)[a, i]

 The string does not match the prefix so we cannot descend.  We must
 split the inner tuple into two tuples:

     {2}(www.go)[o]  - prefix tuple
                 |
                 {3}(gle.com/)[a,i] - postfix tuple

 On the next iteration of loop we find that 'www.gogo.com' matches the
 prefix, but not any node label, so we add a node [g] to tuple {2}:

                    NIL (no child exists yet)
                    |
     {2}(www.go)[o, g]
                 |
                 {3}(gle.com/)[a,i]

 Now we can descend through the [g] node, which will cause us to update
 the target string to just 'o.com'.  Finally, we'll insert a leaf tuple
 bearing that string:

                   {4}<o.com>
                    |
     {2}(www.go)[o, g]
                 |
                 {3}(gle.com/)[a,i]

 As we can see, the original tuple's node array moves to postfix tuple without
 any changes.  Note also that SP-GiST core assumes that prefix tuple is not
 larger than old inner tuple.  That allows us to store prefix tuple directly
 in place of old inner tuple.  SP-GiST core will try to store postfix tuple on
 the same page if possible, but will use another page if there is not enough
 free space (see notes 5 and 6).  Currently, quad and k-d trees don't use this
 feature, because they have no concept of a prefix being "inconsistent" with
 any new value.  They grow their depth only by PickSplitFn() call.

 (5) If pointer from node of parent is a NIL pointer, algorithm chooses a leaf
 page to store on.  At first, it tries to use the last-used leaf page with the
 largest free space (which we track in each backend) to better utilize disk
 space.  If that's not large enough, then the algorithm allocates a new page.

 (6) Management of inner pages is very similar to management of leaf pages,
 described in (5).

 (7) Actually, current implementation can move the whole list of leaf tuples
 and a new tuple to another page, if the list is short enough. This improves
 space utilization, but doesn't change the basis of the algorithm.


 CONCURRENCY

 While descending the tree, the insertion algorithm holds exclusive lock on
 two tree levels at a time, ie both parent and child pages (but parent and
 child pages can be the same, see notes above).  There is a possibility of
 deadlock between two insertions if there are cross-referenced pages in
 different branches.  That is, if inner tuple on page M has a child on page N
 while an inner tuple from another branch is on page N and has a child on
 page M, then two insertions descending the two branches could deadlock,
 since they will each hold their parent page's lock while trying to get the
 child page's lock.

 Currently, we deal with this by conditionally locking buffers as we descend
 the tree.  If we fail to get lock on a buffer, we release both buffers and
 restart the insertion process.  This is potentially inefficient, but the
 locking costs of a more deterministic approach seem very high.

 To reduce the number of cases where that happens, we introduce a concept of
 "triple parity" of pages: if inner tuple is on page with BlockNumber N, then
 its child tuples should be placed on the same page, or else on a page with
 BlockNumber M where (N+1) mod 3 == M mod 3.  This rule ensures that tuples
 on page M will have no children on page N, since (M+1) mod 3 != N mod 3.
 That makes it unlikely that two insertion processes will conflict against
 each other while descending the tree.  It's not perfect though: in the first
 place, we could still get a deadlock among three or more insertion processes,
 and in the second place, it's impractical to preserve this invariant in every
 case when we expand or split an inner tuple.  So we still have to allow for
 deadlocks.

 Insertion may also need to take locks on an additional inner and/or leaf page
 to add tuples of the right type(s), when there's not enough room on the pages
 it descended through.  However, we don't care exactly which such page we add
 to, so deadlocks can be avoided by conditionally locking the additional
 buffers: if we fail to get lock on an additional page, just try another one.

 Search traversal algorithm is rather traditional.  At each non-leaf level, it
 share-locks the page, identifies which node(s) in the current inner tuple
 need to be visited, and puts those addresses on a stack of pages to examine
 later.  It then releases lock on the current buffer before visiting the next
 stack item.  So only one page is locked at a time, and no deadlock is
 possible.  But instead, we have to worry about race conditions: by the time
 we arrive at a pointed-to page, a concurrent insertion could have replaced
 the target inner tuple (or leaf tuple chain) with data placed elsewhere.
 To handle that, whenever the insertion algorithm changes a nonempty downlink
 in an inner tuple, it places a "redirect tuple" in place of the lower-level
 inner tuple or leaf-tuple chain head that the link formerly led to.  Scans
 (though not insertions) must be prepared to honor such redirects.  Only a
 scan that had already visited the parent level could possibly reach such a
 redirect tuple, so we can remove redirects once all active transactions have
 been flushed out of the system.


 DEAD TUPLES

 Tuples on leaf pages can be in one of four states:

 SPGIST_LIVE: normal, live pointer to a heap tuple.

 SPGIST_REDIRECT: placeholder that contains a link to another place in the
 index.  When a chain of leaf tuples has to be moved to another page, a
 redirect tuple is inserted in place of the chain's head tuple.  The parent
 inner tuple's downlink is updated when this happens, but concurrent scans
 might be "in flight" from the parent page to the child page (since they
 release lock on the parent page before attempting to lock the child).
 The redirect pointer serves to tell such a scan where to go.  A redirect
 pointer is only needed for as long as such concurrent scans could be in
 progress.  Eventually, it's converted into a PLACEHOLDER dead tuple by
 VACUUM, and is then a candidate for replacement.  Searches that find such
 a tuple (which should never be part of a chain) should immediately proceed
 to the other place, forgetting about the redirect tuple.  Insertions that
 reach such a tuple should raise error, since a valid downlink should never
 point to such a tuple.

 SPGIST_DEAD: tuple is dead, but it cannot be removed or moved to a
 different offset on the page because there is a link leading to it from
 some inner tuple elsewhere in the index.  (Such a tuple is never part of a
 chain, since we don't need one unless there is nothing live left in its
 chain.)  Searches should ignore such entries.  If an insertion action
 arrives at such a tuple, it should either replace it in-place (if there's
 room on the page to hold the desired new leaf tuple) or replace it with a
 redirection pointer to wherever it puts the new leaf tuple.

 SPGIST_PLACEHOLDER: tuple is dead, and there are known to be no links to
 it from elsewhere.  When a live tuple is deleted or moved away, and not
 replaced by a redirect pointer, it is replaced by a placeholder to keep
 the offsets of later tuples on the same page from changing.  Placeholders
 can be freely replaced when adding a new tuple to the page, and also
 VACUUM will delete any that are at the end of the range of valid tuple
 offsets.  Both searches and insertions should complain if a link from
 elsewhere leads them to a placeholder tuple.

 When the root page is also a leaf, all its tuple should be in LIVE state;
 there's no need for the others since there are no links and no need to
 preserve offset numbers.

 Tuples on inner pages can be in LIVE, REDIRECT, or PLACEHOLDER states.
 The REDIRECT state has the same function as on leaf pages, to send
 concurrent searches to the place where they need to go after an inner
 tuple is moved to another page.  Expired REDIRECT pointers are converted
 to PLACEHOLDER status by VACUUM, and are then candidates for replacement.
 DEAD state is not currently possible, since VACUUM does not attempt to
 remove unused inner tuples.


 VACUUM

 VACUUM (or more precisely, spgbulkdelete) performs a single sequential scan
 over the entire index.  On both leaf and inner pages, we can convert old
 REDIRECT tuples into PLACEHOLDER status, and then remove any PLACEHOLDERs
 that are at the end of the page (since they aren't needed to preserve the
 offsets of any live tuples).  On leaf pages, we scan for tuples that need
 to be deleted because their heap TIDs match a vacuum target TID.

 If we find a deletable tuple that is not at the head of its chain, we
 can simply replace it with a PLACEHOLDER, updating the chain links to
 remove it from the chain.  If it is at the head of its chain, but there's
 at least one live tuple remaining in the chain, we move that live tuple
 to the head tuple's offset, replacing it with a PLACEHOLDER to preserve
 the offsets of other tuples.  This keeps the parent inner tuple's downlink
 valid.  If we find ourselves deleting all live tuples in a chain, we
 replace the head tuple with a DEAD tuple and the rest with PLACEHOLDERS.
 The parent inner tuple's downlink thus points to the DEAD tuple, and the
 rules explained in the previous section keep everything working.

 VACUUM doesn't know a-priori which tuples are heads of their chains, but
 it can easily figure that out by constructing a predecessor array that's
 the reverse map of the nextOffset links (ie, when we see tuple x links to
 tuple y, we set predecessor[y] = x).  Then head tuples are the ones with
 no predecessor.

 Because insertions can occur while VACUUM runs, a pure sequential scan
 could miss deleting some target leaf tuples, because they could get moved
 from a not-yet-visited leaf page to an already-visited leaf page as a
 consequence of a PickSplit or MoveLeafs operation.  Failing to delete any
 target TID is not acceptable, so we have to extend the algorithm to cope
 with such cases.  We recognize that such a move might have occurred when
 we see a leaf-page REDIRECT tuple whose XID indicates it might have been
 created after the VACUUM scan started.  We add the redirection target TID
 to a "pending list" of places we need to recheck.  Between pages of the
 main sequential scan, we empty the pending list by visiting each listed
 TID.  If it points to an inner tuple (from a PickSplit), add each downlink
 TID to the pending list.  If it points to a leaf page, vacuum that page.
 (We could just vacuum the single pointed-to chain, but vacuuming the
 whole page simplifies the code and reduces the odds of VACUUM having to
 modify the same page multiple times.)  To ensure that pending-list
 processing can never get into an endless loop, even in the face of
 concurrent index changes, we don't remove list entries immediately but
 only after we've completed all pending-list processing; instead we just
 mark items as done after processing them.  Adding a TID that's already in
 the list is a no-op, whether or not that item is marked done yet.

 spgbulkdelete also updates the index's free space map.

 Currently, spgvacuumcleanup has nothing to do if spgbulkdelete was
 performed; otherwise, it does an spgbulkdelete scan with an empty target
 list, so as to clean up redirections and placeholders, update the free
 space map, and gather statistics.


 LAST USED PAGE MANAGEMENT

 The list of last used pages contains four pages - a leaf page and three
 inner pages, one from each "triple parity" group.  (Actually, there's one
 such list for the main tree and a separate one for the nulls tree.)  This
 list is stored between calls on the index meta page, but updates are never
 WAL-logged to decrease WAL traffic.  Incorrect data on meta page isn't
 critical, because we could allocate a new page at any moment.


 AUTHORS

     Teodor Sigaev <teodor@sigaev.ru>
     Oleg Bartunov <oleg@sai.msu.su>
	src/backend/access/spgist/README

	SP-GiST is an abbreviation of space-partitioned GiST. It provides a
	generalized infrastructure for implementing space-partitioned data
	structures, such as quadtrees, k-d trees, and radix trees (tries). When
	implemented in main memory, these structures are usually designed as a set of
	dynamically-allocated nodes linked by pointers. This is not suitable for
	direct storing on disk, since the chains of pointers can be rather long and
	require too many disk accesses. In contrast, disk based data structures
	should have a high fanout to minimize I/O. The challenge is to map tree
	nodes to disk pages in such a way that the search algorithm accesses only a
	few disk pages, even if it traverses many nodes.


	COMMON STRUCTURE DESCRIPTION

	Logically, an SP-GiST tree is a set of tuples, each of which can be either
	an inner or leaf tuple. Each inner tuple contains "nodes", which are
	(label,pointer) pairs, where the pointer (ItemPointerData) is a pointer to
	another inner tuple or to the head of a list of leaf tuples. Inner tuples
	can have different numbers of nodes (children). Branches can be of different
	depth (actually, there is no control or code to support balancing), which
	means that the tree is non-balanced. However, leaf and inner tuples cannot
	be intermixed at the same level: a downlink from a node of an inner tuple
	leads either to one inner tuple, or to a list of leaf tuples.

	The SP-GiST core requires that inner and leaf tuples fit on a single index
	page, and even more stringently that the list of leaf tuples reached from a
	single inner-tuple node all be stored on the same index page. (Restricting
	such lists to not cross pages reduces seeks, and allows the list links to be
	stored as simple 2-byte OffsetNumbers.) SP-GiST index opclasses should
	therefore ensure that not too many nodes can be needed in one inner tuple,
	and that inner-tuple prefixes and leaf-node datum values not be too large.

	Inner and leaf tuples are stored separately: the former are stored only on
	"inner" pages, the latter only on "leaf" pages. Also, there are special
	restrictions on the root page. Early in an index's life, when there is only
	one page's worth of data, the root page contains an unorganized set of leaf
	tuples. After the first page split has occurred, the root is required to
	contain exactly one inner tuple.

	When the search traversal algorithm reaches an inner tuple, it chooses a set
	of nodes to continue tree traverse in depth. If it reaches a leaf page it
	scans a list of leaf tuples to find the ones that match the query. SP-GiST
	also supports ordered (nearest-neighbor) searches - that is during scan pending
	nodes are put into priority queue, so traversal is performed by the
	closest-first model.


	The insertion algorithm descends the tree similarly, except it must choose
	just one node to descend to from each inner tuple. Insertion might also have
	to modify the inner tuple before it can descend: it could add a new node, or
	it could "split" the tuple to obtain a less-specific prefix that can match
	the value to be inserted. If it's necessary to append a new leaf tuple to a
	list and there is no free space on page, then SP-GiST creates a new inner
	tuple and distributes leaf tuples into a set of lists on, perhaps, several
	pages.

	An inner tuple consists of:

	optional prefix value - all successors must be consistent with it.
	Example:
	radix tree - prefix value is a common prefix string
	quad tree - centroid
	k-d tree - one coordinate

	list of nodes, where node is a (label, pointer) pair.
	Example of a label: a single character for radix tree

	A leaf tuple consists of:

	a leaf value
	Example:
	radix tree - the rest of string (postfix)
	quad and k-d tree - the point itself

	ItemPointer to the corresponding heap tuple
	nextOffset number of next leaf tuple in a chain on a leaf page

	optional nulls bitmask
	optional INCLUDE-column values

	For compatibility with pre-v14 indexes, a leaf tuple has a nulls bitmask
	only if there are null values (among the leaf value and the INCLUDE values)
	and there is at least one INCLUDE column. The null-ness of the leaf
	value can be inferred from whether the tuple is on a "nulls page" (see below)
	so it is not necessary to represent it explicitly. But we include it anyway
	in a bitmask used with INCLUDE values, so that standard tuple deconstruction
	code can be used.


	NULLS HANDLING

	We assume that SPGiST-indexable operators are strict (can never succeed for
	null inputs). It is still desirable to index nulls, so that whole-table
	indexscans are possible and so that "x IS NULL" can be implemented by an
	SPGiST indexscan. However, we prefer that SPGiST index opclasses not have
	to cope with nulls. Therefore, the main tree of an SPGiST index does not
	include any null entries. We store null entries in a separate SPGiST tree
	occupying a disjoint set of pages (in particular, its own root page).
	Insertions and searches in the nulls tree do not use any of the
	opclass-supplied functions, but just use hardwired logic comparable to
	AllTheSame cases in the normal tree.


	INSERTION ALGORITHM

	Insertion algorithm is designed to keep the tree in a consistent state at
	any moment. Here is a simplified insertion algorithm specification
	(numbers refer to notes below):

	Start with the first tuple on the root page (1)

	loop:
	if (page is leaf) then
	if (enough space)
	insert on page and exit (5)
	else (7)
	call PickSplitFn() (2)
	end if
	else
	switch (chooseFn())
	case MatchNode - descend through selected node
	case AddNode - add node and then retry chooseFn (3, 6)
	case SplitTuple - split inner tuple to prefix and postfix, then
	retry chooseFn with the prefix tuple (4, 6)
	end if

	Notes:

	(1) Initially, we just dump leaf tuples into the root page until it is full;
	then we split it. Once the root is not a leaf page, it can have only one
	inner tuple, so as to keep the amount of free space on the root as large as
	possible. Both of these rules are meant to postpone doing PickSplit on the
	root for as long as possible, so that the topmost partitioning of the search
	space is as good as we can easily make it.

	(2) Current implementation allows to do picksplit and insert a new leaf tuple
	in one operation, if the new list of leaf tuples fits on one page. It's
	always possible for trees with small nodes like quad tree or k-d tree, but
	radix trees may require another picksplit.

	(3) Addition of node must keep size of inner tuple small enough to fit on a
	page. After addition, inner tuple could become too large to be stored on
	current page because of other tuples on page. In this case it will be moved
	to another inner page (see notes about page management). When moving tuple to
	another page, we can't change the numbers of other tuples on the page, else
	we'd make downlink pointers to them invalid. To prevent that, SP-GiST leaves
	a "placeholder" tuple, which can be reused later whenever another tuple is
	added to the page. See also Concurrency and Vacuum sections below. Right now
	only radix trees could add a node to the tuple; quad trees and k-d trees
	make all possible nodes at once in PickSplitFn() call.

	(4) Prefix value could only partially match a new value, so the SplitTuple
	action allows breaking the current tree branch into upper and lower sections.
	Another way to say it is that we can split the current inner tuple into
	"prefix" and "postfix" parts, where the prefix part is able to match the
	incoming new value. Consider example of insertion into a radix tree. We use
	the following notation, where tuple's id is just for discussion (no such id
	is actually stored):

	inner tuple: {tuple id}(prefix string)[ comma separated list of node labels ]
	leaf tuple: {tuple id}<value>

	Suppose we need to insert string 'www.gogo.com' into inner tuple

	{1}(www.google.com/)[a, i]

	The string does not match the prefix so we cannot descend. We must
	split the inner tuple into two tuples:

	{2}(www.go)[o] - prefix tuple
	\|
	{3}(gle.com/)[a,i] - postfix tuple

	On the next iteration of loop we find that 'www.gogo.com' matches the
	prefix, but not any node label, so we add a node [g] to tuple {2}:

	NIL (no child exists yet)
	\|
	{2}(www.go)[o, g]
	\|
	{3}(gle.com/)[a,i]

	Now we can descend through the [g] node, which will cause us to update
	the target string to just 'o.com'. Finally, we'll insert a leaf tuple
	bearing that string:

	{4}<o.com>
	\|
	{2}(www.go)[o, g]
	\|
	{3}(gle.com/)[a,i]

	As we can see, the original tuple's node array moves to postfix tuple without
	any changes. Note also that SP-GiST core assumes that prefix tuple is not
	larger than old inner tuple. That allows us to store prefix tuple directly
	in place of old inner tuple. SP-GiST core will try to store postfix tuple on
	the same page if possible, but will use another page if there is not enough
	free space (see notes 5 and 6). Currently, quad and k-d trees don't use this
	feature, because they have no concept of a prefix being "inconsistent" with
	any new value. They grow their depth only by PickSplitFn() call.

	(5) If pointer from node of parent is a NIL pointer, algorithm chooses a leaf
	page to store on. At first, it tries to use the last-used leaf page with the
	largest free space (which we track in each backend) to better utilize disk
	space. If that's not large enough, then the algorithm allocates a new page.

	(6) Management of inner pages is very similar to management of leaf pages,
	described in (5).

	(7) Actually, current implementation can move the whole list of leaf tuples
	and a new tuple to another page, if the list is short enough. This improves
	space utilization, but doesn't change the basis of the algorithm.


	CONCURRENCY

	While descending the tree, the insertion algorithm holds exclusive lock on
	two tree levels at a time, ie both parent and child pages (but parent and
	child pages can be the same, see notes above). There is a possibility of
	deadlock between two insertions if there are cross-referenced pages in
	different branches. That is, if inner tuple on page M has a child on page N
	while an inner tuple from another branch is on page N and has a child on
	page M, then two insertions descending the two branches could deadlock,
	since they will each hold their parent page's lock while trying to get the
	child page's lock.

	Currently, we deal with this by conditionally locking buffers as we descend
	the tree. If we fail to get lock on a buffer, we release both buffers and
	restart the insertion process. This is potentially inefficient, but the
	locking costs of a more deterministic approach seem very high.

	To reduce the number of cases where that happens, we introduce a concept of
	"triple parity" of pages: if inner tuple is on page with BlockNumber N, then
	its child tuples should be placed on the same page, or else on a page with
	BlockNumber M where (N+1) mod 3 == M mod 3. This rule ensures that tuples
	on page M will have no children on page N, since (M+1) mod 3 != N mod 3.
	That makes it unlikely that two insertion processes will conflict against
	each other while descending the tree. It's not perfect though: in the first
	place, we could still get a deadlock among three or more insertion processes,
	and in the second place, it's impractical to preserve this invariant in every
	case when we expand or split an inner tuple. So we still have to allow for
	deadlocks.

	Insertion may also need to take locks on an additional inner and/or leaf page
	to add tuples of the right type(s), when there's not enough room on the pages
	it descended through. However, we don't care exactly which such page we add
	to, so deadlocks can be avoided by conditionally locking the additional
	buffers: if we fail to get lock on an additional page, just try another one.

	Search traversal algorithm is rather traditional. At each non-leaf level, it
	share-locks the page, identifies which node(s) in the current inner tuple
	need to be visited, and puts those addresses on a stack of pages to examine
	later. It then releases lock on the current buffer before visiting the next
	stack item. So only one page is locked at a time, and no deadlock is
	possible. But instead, we have to worry about race conditions: by the time
	we arrive at a pointed-to page, a concurrent insertion could have replaced
	the target inner tuple (or leaf tuple chain) with data placed elsewhere.
	To handle that, whenever the insertion algorithm changes a nonempty downlink
	in an inner tuple, it places a "redirect tuple" in place of the lower-level
	inner tuple or leaf-tuple chain head that the link formerly led to. Scans
	(though not insertions) must be prepared to honor such redirects. Only a
	scan that had already visited the parent level could possibly reach such a
	redirect tuple, so we can remove redirects once all active transactions have
	been flushed out of the system.


	DEAD TUPLES

	Tuples on leaf pages can be in one of four states:

	SPGIST_LIVE: normal, live pointer to a heap tuple.

	SPGIST_REDIRECT: placeholder that contains a link to another place in the
	index. When a chain of leaf tuples has to be moved to another page, a
	redirect tuple is inserted in place of the chain's head tuple. The parent
	inner tuple's downlink is updated when this happens, but concurrent scans
	might be "in flight" from the parent page to the child page (since they
	release lock on the parent page before attempting to lock the child).
	The redirect pointer serves to tell such a scan where to go. A redirect
	pointer is only needed for as long as such concurrent scans could be in
	progress. Eventually, it's converted into a PLACEHOLDER dead tuple by
	VACUUM, and is then a candidate for replacement. Searches that find such
	a tuple (which should never be part of a chain) should immediately proceed
	to the other place, forgetting about the redirect tuple. Insertions that
	reach such a tuple should raise error, since a valid downlink should never
	point to such a tuple.

	SPGIST_DEAD: tuple is dead, but it cannot be removed or moved to a
	different offset on the page because there is a link leading to it from
	some inner tuple elsewhere in the index. (Such a tuple is never part of a
	chain, since we don't need one unless there is nothing live left in its
	chain.) Searches should ignore such entries. If an insertion action
	arrives at such a tuple, it should either replace it in-place (if there's
	room on the page to hold the desired new leaf tuple) or replace it with a
	redirection pointer to wherever it puts the new leaf tuple.

	SPGIST_PLACEHOLDER: tuple is dead, and there are known to be no links to
	it from elsewhere. When a live tuple is deleted or moved away, and not
	replaced by a redirect pointer, it is replaced by a placeholder to keep
	the offsets of later tuples on the same page from changing. Placeholders
	can be freely replaced when adding a new tuple to the page, and also
	VACUUM will delete any that are at the end of the range of valid tuple
	offsets. Both searches and insertions should complain if a link from
	elsewhere leads them to a placeholder tuple.

	When the root page is also a leaf, all its tuple should be in LIVE state;
	there's no need for the others since there are no links and no need to
	preserve offset numbers.

	Tuples on inner pages can be in LIVE, REDIRECT, or PLACEHOLDER states.
	The REDIRECT state has the same function as on leaf pages, to send
	concurrent searches to the place where they need to go after an inner
	tuple is moved to another page. Expired REDIRECT pointers are converted
	to PLACEHOLDER status by VACUUM, and are then candidates for replacement.
	DEAD state is not currently possible, since VACUUM does not attempt to
	remove unused inner tuples.


	VACUUM

	VACUUM (or more precisely, spgbulkdelete) performs a single sequential scan
	over the entire index. On both leaf and inner pages, we can convert old
	REDIRECT tuples into PLACEHOLDER status, and then remove any PLACEHOLDERs
	that are at the end of the page (since they aren't needed to preserve the
	offsets of any live tuples). On leaf pages, we scan for tuples that need
	to be deleted because their heap TIDs match a vacuum target TID.

	If we find a deletable tuple that is not at the head of its chain, we
	can simply replace it with a PLACEHOLDER, updating the chain links to
	remove it from the chain. If it is at the head of its chain, but there's
	at least one live tuple remaining in the chain, we move that live tuple
	to the head tuple's offset, replacing it with a PLACEHOLDER to preserve
	the offsets of other tuples. This keeps the parent inner tuple's downlink
	valid. If we find ourselves deleting all live tuples in a chain, we
	replace the head tuple with a DEAD tuple and the rest with PLACEHOLDERS.
	The parent inner tuple's downlink thus points to the DEAD tuple, and the
	rules explained in the previous section keep everything working.

	VACUUM doesn't know a-priori which tuples are heads of their chains, but
	it can easily figure that out by constructing a predecessor array that's
	the reverse map of the nextOffset links (ie, when we see tuple x links to
	tuple y, we set predecessor[y] = x). Then head tuples are the ones with
	no predecessor.

	Because insertions can occur while VACUUM runs, a pure sequential scan
	could miss deleting some target leaf tuples, because they could get moved
	from a not-yet-visited leaf page to an already-visited leaf page as a
	consequence of a PickSplit or MoveLeafs operation. Failing to delete any
	target TID is not acceptable, so we have to extend the algorithm to cope
	with such cases. We recognize that such a move might have occurred when
	we see a leaf-page REDIRECT tuple whose XID indicates it might have been
	created after the VACUUM scan started. We add the redirection target TID
	to a "pending list" of places we need to recheck. Between pages of the
	main sequential scan, we empty the pending list by visiting each listed
	TID. If it points to an inner tuple (from a PickSplit), add each downlink
	TID to the pending list. If it points to a leaf page, vacuum that page.
	(We could just vacuum the single pointed-to chain, but vacuuming the
	whole page simplifies the code and reduces the odds of VACUUM having to
	modify the same page multiple times.) To ensure that pending-list
	processing can never get into an endless loop, even in the face of
	concurrent index changes, we don't remove list entries immediately but
	only after we've completed all pending-list processing; instead we just
	mark items as done after processing them. Adding a TID that's already in
	the list is a no-op, whether or not that item is marked done yet.

	spgbulkdelete also updates the index's free space map.

	Currently, spgvacuumcleanup has nothing to do if spgbulkdelete was
	performed; otherwise, it does an spgbulkdelete scan with an empty target
	list, so as to clean up redirections and placeholders, update the free
	space map, and gather statistics.


	LAST USED PAGE MANAGEMENT

	The list of last used pages contains four pages - a leaf page and three
	inner pages, one from each "triple parity" group. (Actually, there's one
	such list for the main tree and a separate one for the nulls tree.) This
	list is stored between calls on the index meta page, but updates are never
	WAL-logged to decrease WAL traffic. Incorrect data on meta page isn't
	critical, because we could allocate a new page at any moment.


	AUTHORS

	Teodor Sigaev <teodor@sigaev.ru>
	Oleg Bartunov <oleg@sai.msu.su>