| src/backend/access/spgist/README |
| |
| SP-GiST is an abbreviation of space-partitioned GiST. It provides a |
| generalized infrastructure for implementing space-partitioned data |
| structures, such as quadtrees, k-d trees, and radix trees (tries). When |
| implemented in main memory, these structures are usually designed as a set of |
| dynamically-allocated nodes linked by pointers. This is not suitable for |
| direct storing on disk, since the chains of pointers can be rather long and |
| require too many disk accesses. In contrast, disk based data structures |
| should have a high fanout to minimize I/O. The challenge is to map tree |
| nodes to disk pages in such a way that the search algorithm accesses only a |
| few disk pages, even if it traverses many nodes. |
| |
| |
| COMMON STRUCTURE DESCRIPTION |
| |
| Logically, an SP-GiST tree is a set of tuples, each of which can be either |
| an inner or leaf tuple. Each inner tuple contains "nodes", which are |
| (label,pointer) pairs, where the pointer (ItemPointerData) is a pointer to |
| another inner tuple or to the head of a list of leaf tuples. Inner tuples |
| can have different numbers of nodes (children). Branches can be of different |
| depth (actually, there is no control or code to support balancing), which |
| means that the tree is non-balanced. However, leaf and inner tuples cannot |
| be intermixed at the same level: a downlink from a node of an inner tuple |
| leads either to one inner tuple, or to a list of leaf tuples. |
| |
| The SP-GiST core requires that inner and leaf tuples fit on a single index |
| page, and even more stringently that the list of leaf tuples reached from a |
| single inner-tuple node all be stored on the same index page. (Restricting |
| such lists to not cross pages reduces seeks, and allows the list links to be |
| stored as simple 2-byte OffsetNumbers.) SP-GiST index opclasses should |
| therefore ensure that not too many nodes can be needed in one inner tuple, |
| and that inner-tuple prefixes and leaf-node datum values not be too large. |
| |
| Inner and leaf tuples are stored separately: the former are stored only on |
| "inner" pages, the latter only on "leaf" pages. Also, there are special |
| restrictions on the root page. Early in an index's life, when there is only |
| one page's worth of data, the root page contains an unorganized set of leaf |
| tuples. After the first page split has occurred, the root is required to |
| contain exactly one inner tuple. |
| |
| When the search traversal algorithm reaches an inner tuple, it chooses a set |
| of nodes to continue tree traverse in depth. If it reaches a leaf page it |
| scans a list of leaf tuples to find the ones that match the query. SP-GiST |
| also supports ordered (nearest-neighbor) searches - that is during scan pending |
| nodes are put into priority queue, so traversal is performed by the |
| closest-first model. |
| |
| |
| The insertion algorithm descends the tree similarly, except it must choose |
| just one node to descend to from each inner tuple. Insertion might also have |
| to modify the inner tuple before it can descend: it could add a new node, or |
| it could "split" the tuple to obtain a less-specific prefix that can match |
| the value to be inserted. If it's necessary to append a new leaf tuple to a |
| list and there is no free space on page, then SP-GiST creates a new inner |
| tuple and distributes leaf tuples into a set of lists on, perhaps, several |
| pages. |
| |
| An inner tuple consists of: |
| |
| optional prefix value - all successors must be consistent with it. |
| Example: |
| radix tree - prefix value is a common prefix string |
| quad tree - centroid |
| k-d tree - one coordinate |
| |
| list of nodes, where node is a (label, pointer) pair. |
| Example of a label: a single character for radix tree |
| |
| A leaf tuple consists of: |
| |
| a leaf value |
| Example: |
| radix tree - the rest of string (postfix) |
| quad and k-d tree - the point itself |
| |
| ItemPointer to the corresponding heap tuple |
| nextOffset number of next leaf tuple in a chain on a leaf page |
| |
| optional nulls bitmask |
| optional INCLUDE-column values |
| |
| For compatibility with pre-v14 indexes, a leaf tuple has a nulls bitmask |
| only if there are null values (among the leaf value and the INCLUDE values) |
| *and* there is at least one INCLUDE column. The null-ness of the leaf |
| value can be inferred from whether the tuple is on a "nulls page" (see below) |
| so it is not necessary to represent it explicitly. But we include it anyway |
| in a bitmask used with INCLUDE values, so that standard tuple deconstruction |
| code can be used. |
| |
| |
| NULLS HANDLING |
| |
| We assume that SPGiST-indexable operators are strict (can never succeed for |
| null inputs). It is still desirable to index nulls, so that whole-table |
| indexscans are possible and so that "x IS NULL" can be implemented by an |
| SPGiST indexscan. However, we prefer that SPGiST index opclasses not have |
| to cope with nulls. Therefore, the main tree of an SPGiST index does not |
| include any null entries. We store null entries in a separate SPGiST tree |
| occupying a disjoint set of pages (in particular, its own root page). |
| Insertions and searches in the nulls tree do not use any of the |
| opclass-supplied functions, but just use hardwired logic comparable to |
| AllTheSame cases in the normal tree. |
| |
| |
| INSERTION ALGORITHM |
| |
| Insertion algorithm is designed to keep the tree in a consistent state at |
| any moment. Here is a simplified insertion algorithm specification |
| (numbers refer to notes below): |
| |
| Start with the first tuple on the root page (1) |
| |
| loop: |
| if (page is leaf) then |
| if (enough space) |
| insert on page and exit (5) |
| else (7) |
| call PickSplitFn() (2) |
| end if |
| else |
| switch (chooseFn()) |
| case MatchNode - descend through selected node |
| case AddNode - add node and then retry chooseFn (3, 6) |
| case SplitTuple - split inner tuple to prefix and postfix, then |
| retry chooseFn with the prefix tuple (4, 6) |
| end if |
| |
| Notes: |
| |
| (1) Initially, we just dump leaf tuples into the root page until it is full; |
| then we split it. Once the root is not a leaf page, it can have only one |
| inner tuple, so as to keep the amount of free space on the root as large as |
| possible. Both of these rules are meant to postpone doing PickSplit on the |
| root for as long as possible, so that the topmost partitioning of the search |
| space is as good as we can easily make it. |
| |
| (2) Current implementation allows to do picksplit and insert a new leaf tuple |
| in one operation, if the new list of leaf tuples fits on one page. It's |
| always possible for trees with small nodes like quad tree or k-d tree, but |
| radix trees may require another picksplit. |
| |
| (3) Addition of node must keep size of inner tuple small enough to fit on a |
| page. After addition, inner tuple could become too large to be stored on |
| current page because of other tuples on page. In this case it will be moved |
| to another inner page (see notes about page management). When moving tuple to |
| another page, we can't change the numbers of other tuples on the page, else |
| we'd make downlink pointers to them invalid. To prevent that, SP-GiST leaves |
| a "placeholder" tuple, which can be reused later whenever another tuple is |
| added to the page. See also Concurrency and Vacuum sections below. Right now |
| only radix trees could add a node to the tuple; quad trees and k-d trees |
| make all possible nodes at once in PickSplitFn() call. |
| |
| (4) Prefix value could only partially match a new value, so the SplitTuple |
| action allows breaking the current tree branch into upper and lower sections. |
| Another way to say it is that we can split the current inner tuple into |
| "prefix" and "postfix" parts, where the prefix part is able to match the |
| incoming new value. Consider example of insertion into a radix tree. We use |
| the following notation, where tuple's id is just for discussion (no such id |
| is actually stored): |
| |
| inner tuple: {tuple id}(prefix string)[ comma separated list of node labels ] |
| leaf tuple: {tuple id}<value> |
| |
| Suppose we need to insert string 'www.gogo.com' into inner tuple |
| |
| {1}(www.google.com/)[a, i] |
| |
| The string does not match the prefix so we cannot descend. We must |
| split the inner tuple into two tuples: |
| |
| {2}(www.go)[o] - prefix tuple |
| | |
| {3}(gle.com/)[a,i] - postfix tuple |
| |
| On the next iteration of loop we find that 'www.gogo.com' matches the |
| prefix, but not any node label, so we add a node [g] to tuple {2}: |
| |
| NIL (no child exists yet) |
| | |
| {2}(www.go)[o, g] |
| | |
| {3}(gle.com/)[a,i] |
| |
| Now we can descend through the [g] node, which will cause us to update |
| the target string to just 'o.com'. Finally, we'll insert a leaf tuple |
| bearing that string: |
| |
| {4}<o.com> |
| | |
| {2}(www.go)[o, g] |
| | |
| {3}(gle.com/)[a,i] |
| |
| As we can see, the original tuple's node array moves to postfix tuple without |
| any changes. Note also that SP-GiST core assumes that prefix tuple is not |
| larger than old inner tuple. That allows us to store prefix tuple directly |
| in place of old inner tuple. SP-GiST core will try to store postfix tuple on |
| the same page if possible, but will use another page if there is not enough |
| free space (see notes 5 and 6). Currently, quad and k-d trees don't use this |
| feature, because they have no concept of a prefix being "inconsistent" with |
| any new value. They grow their depth only by PickSplitFn() call. |
| |
| (5) If pointer from node of parent is a NIL pointer, algorithm chooses a leaf |
| page to store on. At first, it tries to use the last-used leaf page with the |
| largest free space (which we track in each backend) to better utilize disk |
| space. If that's not large enough, then the algorithm allocates a new page. |
| |
| (6) Management of inner pages is very similar to management of leaf pages, |
| described in (5). |
| |
| (7) Actually, current implementation can move the whole list of leaf tuples |
| and a new tuple to another page, if the list is short enough. This improves |
| space utilization, but doesn't change the basis of the algorithm. |
| |
| |
| CONCURRENCY |
| |
| While descending the tree, the insertion algorithm holds exclusive lock on |
| two tree levels at a time, ie both parent and child pages (but parent and |
| child pages can be the same, see notes above). There is a possibility of |
| deadlock between two insertions if there are cross-referenced pages in |
| different branches. That is, if inner tuple on page M has a child on page N |
| while an inner tuple from another branch is on page N and has a child on |
| page M, then two insertions descending the two branches could deadlock, |
| since they will each hold their parent page's lock while trying to get the |
| child page's lock. |
| |
| Currently, we deal with this by conditionally locking buffers as we descend |
| the tree. If we fail to get lock on a buffer, we release both buffers and |
| restart the insertion process. This is potentially inefficient, but the |
| locking costs of a more deterministic approach seem very high. |
| |
| To reduce the number of cases where that happens, we introduce a concept of |
| "triple parity" of pages: if inner tuple is on page with BlockNumber N, then |
| its child tuples should be placed on the same page, or else on a page with |
| BlockNumber M where (N+1) mod 3 == M mod 3. This rule ensures that tuples |
| on page M will have no children on page N, since (M+1) mod 3 != N mod 3. |
| That makes it unlikely that two insertion processes will conflict against |
| each other while descending the tree. It's not perfect though: in the first |
| place, we could still get a deadlock among three or more insertion processes, |
| and in the second place, it's impractical to preserve this invariant in every |
| case when we expand or split an inner tuple. So we still have to allow for |
| deadlocks. |
| |
| Insertion may also need to take locks on an additional inner and/or leaf page |
| to add tuples of the right type(s), when there's not enough room on the pages |
| it descended through. However, we don't care exactly which such page we add |
| to, so deadlocks can be avoided by conditionally locking the additional |
| buffers: if we fail to get lock on an additional page, just try another one. |
| |
| Search traversal algorithm is rather traditional. At each non-leaf level, it |
| share-locks the page, identifies which node(s) in the current inner tuple |
| need to be visited, and puts those addresses on a stack of pages to examine |
| later. It then releases lock on the current buffer before visiting the next |
| stack item. So only one page is locked at a time, and no deadlock is |
| possible. But instead, we have to worry about race conditions: by the time |
| we arrive at a pointed-to page, a concurrent insertion could have replaced |
| the target inner tuple (or leaf tuple chain) with data placed elsewhere. |
| To handle that, whenever the insertion algorithm changes a nonempty downlink |
| in an inner tuple, it places a "redirect tuple" in place of the lower-level |
| inner tuple or leaf-tuple chain head that the link formerly led to. Scans |
| (though not insertions) must be prepared to honor such redirects. Only a |
| scan that had already visited the parent level could possibly reach such a |
| redirect tuple, so we can remove redirects once all active transactions have |
| been flushed out of the system. |
| |
| |
| DEAD TUPLES |
| |
| Tuples on leaf pages can be in one of four states: |
| |
| SPGIST_LIVE: normal, live pointer to a heap tuple. |
| |
| SPGIST_REDIRECT: placeholder that contains a link to another place in the |
| index. When a chain of leaf tuples has to be moved to another page, a |
| redirect tuple is inserted in place of the chain's head tuple. The parent |
| inner tuple's downlink is updated when this happens, but concurrent scans |
| might be "in flight" from the parent page to the child page (since they |
| release lock on the parent page before attempting to lock the child). |
| The redirect pointer serves to tell such a scan where to go. A redirect |
| pointer is only needed for as long as such concurrent scans could be in |
| progress. Eventually, it's converted into a PLACEHOLDER dead tuple by |
| VACUUM, and is then a candidate for replacement. Searches that find such |
| a tuple (which should never be part of a chain) should immediately proceed |
| to the other place, forgetting about the redirect tuple. Insertions that |
| reach such a tuple should raise error, since a valid downlink should never |
| point to such a tuple. |
| |
| SPGIST_DEAD: tuple is dead, but it cannot be removed or moved to a |
| different offset on the page because there is a link leading to it from |
| some inner tuple elsewhere in the index. (Such a tuple is never part of a |
| chain, since we don't need one unless there is nothing live left in its |
| chain.) Searches should ignore such entries. If an insertion action |
| arrives at such a tuple, it should either replace it in-place (if there's |
| room on the page to hold the desired new leaf tuple) or replace it with a |
| redirection pointer to wherever it puts the new leaf tuple. |
| |
| SPGIST_PLACEHOLDER: tuple is dead, and there are known to be no links to |
| it from elsewhere. When a live tuple is deleted or moved away, and not |
| replaced by a redirect pointer, it is replaced by a placeholder to keep |
| the offsets of later tuples on the same page from changing. Placeholders |
| can be freely replaced when adding a new tuple to the page, and also |
| VACUUM will delete any that are at the end of the range of valid tuple |
| offsets. Both searches and insertions should complain if a link from |
| elsewhere leads them to a placeholder tuple. |
| |
| When the root page is also a leaf, all its tuple should be in LIVE state; |
| there's no need for the others since there are no links and no need to |
| preserve offset numbers. |
| |
| Tuples on inner pages can be in LIVE, REDIRECT, or PLACEHOLDER states. |
| The REDIRECT state has the same function as on leaf pages, to send |
| concurrent searches to the place where they need to go after an inner |
| tuple is moved to another page. Expired REDIRECT pointers are converted |
| to PLACEHOLDER status by VACUUM, and are then candidates for replacement. |
| DEAD state is not currently possible, since VACUUM does not attempt to |
| remove unused inner tuples. |
| |
| |
| VACUUM |
| |
| VACUUM (or more precisely, spgbulkdelete) performs a single sequential scan |
| over the entire index. On both leaf and inner pages, we can convert old |
| REDIRECT tuples into PLACEHOLDER status, and then remove any PLACEHOLDERs |
| that are at the end of the page (since they aren't needed to preserve the |
| offsets of any live tuples). On leaf pages, we scan for tuples that need |
| to be deleted because their heap TIDs match a vacuum target TID. |
| |
| If we find a deletable tuple that is not at the head of its chain, we |
| can simply replace it with a PLACEHOLDER, updating the chain links to |
| remove it from the chain. If it is at the head of its chain, but there's |
| at least one live tuple remaining in the chain, we move that live tuple |
| to the head tuple's offset, replacing it with a PLACEHOLDER to preserve |
| the offsets of other tuples. This keeps the parent inner tuple's downlink |
| valid. If we find ourselves deleting all live tuples in a chain, we |
| replace the head tuple with a DEAD tuple and the rest with PLACEHOLDERS. |
| The parent inner tuple's downlink thus points to the DEAD tuple, and the |
| rules explained in the previous section keep everything working. |
| |
| VACUUM doesn't know a-priori which tuples are heads of their chains, but |
| it can easily figure that out by constructing a predecessor array that's |
| the reverse map of the nextOffset links (ie, when we see tuple x links to |
| tuple y, we set predecessor[y] = x). Then head tuples are the ones with |
| no predecessor. |
| |
| Because insertions can occur while VACUUM runs, a pure sequential scan |
| could miss deleting some target leaf tuples, because they could get moved |
| from a not-yet-visited leaf page to an already-visited leaf page as a |
| consequence of a PickSplit or MoveLeafs operation. Failing to delete any |
| target TID is not acceptable, so we have to extend the algorithm to cope |
| with such cases. We recognize that such a move might have occurred when |
| we see a leaf-page REDIRECT tuple whose XID indicates it might have been |
| created after the VACUUM scan started. We add the redirection target TID |
| to a "pending list" of places we need to recheck. Between pages of the |
| main sequential scan, we empty the pending list by visiting each listed |
| TID. If it points to an inner tuple (from a PickSplit), add each downlink |
| TID to the pending list. If it points to a leaf page, vacuum that page. |
| (We could just vacuum the single pointed-to chain, but vacuuming the |
| whole page simplifies the code and reduces the odds of VACUUM having to |
| modify the same page multiple times.) To ensure that pending-list |
| processing can never get into an endless loop, even in the face of |
| concurrent index changes, we don't remove list entries immediately but |
| only after we've completed all pending-list processing; instead we just |
| mark items as done after processing them. Adding a TID that's already in |
| the list is a no-op, whether or not that item is marked done yet. |
| |
| spgbulkdelete also updates the index's free space map. |
| |
| Currently, spgvacuumcleanup has nothing to do if spgbulkdelete was |
| performed; otherwise, it does an spgbulkdelete scan with an empty target |
| list, so as to clean up redirections and placeholders, update the free |
| space map, and gather statistics. |
| |
| |
| LAST USED PAGE MANAGEMENT |
| |
| The list of last used pages contains four pages - a leaf page and three |
| inner pages, one from each "triple parity" group. (Actually, there's one |
| such list for the main tree and a separate one for the nulls tree.) This |
| list is stored between calls on the index meta page, but updates are never |
| WAL-logged to decrease WAL traffic. Incorrect data on meta page isn't |
| critical, because we could allocate a new page at any moment. |
| |
| |
| AUTHORS |
| |
| Teodor Sigaev <teodor@sigaev.ru> |
| Oleg Bartunov <oleg@sai.msu.su> |