src/backend/access/gin/README - cloudberry - Git at Google

 src/backend/access/gin/README

 Gin for PostgreSQL
 ==================

 Gin was sponsored by jfg://networks (http://www.jfg-networks.com/)

 Gin stands for Generalized Inverted Index and should be considered as a genie,
 not a drink.

 Generalized means that the index does not know which operation it accelerates.
 It instead works with custom strategies, defined for specific data types (read
 "Index Method Strategies" in the PostgreSQL documentation). In that sense, Gin
 is similar to GiST and differs from btree indices, which have predefined,
 comparison-based operations.

 An inverted index is an index structure storing a set of (key, posting list)
 pairs, where 'posting list' is a set of heap rows in which the key occurs.
 (A text document would usually contain many keys.)  The primary goal of
 Gin indices is support for highly scalable, full-text search in PostgreSQL.

 A Gin index consists of a B-tree index constructed over key values,
 where each key is an element of some indexed items (element of array, lexeme
 for tsvector) and where each tuple in a leaf page contains either a pointer to
 a B-tree over item pointers (posting tree), or a simple list of item pointers
 (posting list) if the list is small enough.

 Note: There is no delete operation in the key (entry) tree. The reason for
 this is that in our experience, the set of distinct words in a large corpus
 changes very slowly.  This greatly simplifies the code and concurrency
 algorithms.

 Core PostgreSQL includes built-in Gin support for one-dimensional arrays
 (eg. integer[], text[]).  The following operations are available:

   * contains: value_array @> query_array
   * overlaps: value_array && query_array
   * is contained by: value_array <@ query_array

 Synopsis
 --------

 =# create index txt_idx on aa using gin(a);

 Features
 --------

   * Concurrency
   * Write-Ahead Logging (WAL).  (Recoverability from crashes.)
   * User-defined opclasses.  (The scheme is similar to GiST.)
   * Optimized index creation (Makes use of maintenance_work_mem to accumulate
     postings in memory.)
   * Text search support via an opclass
   * Soft upper limit on the returned results set using a GUC variable:
     gin_fuzzy_search_limit

 Gin Fuzzy Limit
 ---------------

 There are often situations when a full-text search returns a very large set of
 results.  Since reading tuples from the disk and sorting them could take a
 lot of time, this is unacceptable for production.  (Note that the search
 itself is very fast.)

 Such queries usually contain very frequent lexemes, so the results are not
 very helpful. To facilitate execution of such queries Gin has a configurable
 soft upper limit on the size of the returned set, determined by the
 'gin_fuzzy_search_limit' GUC variable.  This is set to 0 by default (no
 limit).

 If a non-zero search limit is set, then the returned set is a subset of the
 whole result set, chosen at random.

 "Soft" means that the actual number of returned results could differ
 from the specified limit, depending on the query and the quality of the
 system's random number generator.

 From experience, a value of 'gin_fuzzy_search_limit' in the thousands
 (eg. 5000-20000) works well.  This means that 'gin_fuzzy_search_limit' will
 have no effect for queries returning a result set with less tuples than this
 number.

 Index structure
 ---------------

 The "items" that a GIN index indexes are composite values that contain
 zero or more "keys".  For example, an item might be an integer array, and
 then the keys would be the individual integer values.  The index actually
 stores and searches for the key values, not the items per se.  In the
 pg_opclass entry for a GIN opclass, the opcintype is the data type of the
 items, and the opckeytype is the data type of the keys.  GIN is optimized
 for cases where items contain many keys and the same key values appear
 in many different items.

 A GIN index contains a metapage, a btree of key entries, and possibly
 "posting tree" pages, which hold the overflow when a key entry acquires
 too many heap tuple pointers to fit in a btree page.  Additionally, if the
 fast-update feature is enabled, there can be "list pages" holding "pending"
 key entries that haven't yet been merged into the main btree.  The list
 pages have to be scanned linearly when doing a search, so the pending
 entries should be merged into the main btree before there get to be too
 many of them.  The advantage of the pending list is that bulk insertion of
 a few thousand entries can be much faster than retail insertion.  (The win
 comes mainly from not having to do multiple searches/insertions when the
 same key appears in multiple new heap tuples.)

 Key entries are nominally of the same IndexTuple format as used in other
 index types, but since a leaf key entry typically refers to multiple heap
 tuples, there are significant differences.  (See GinFormTuple, which works
 by building a "normal" index tuple and then modifying it.)  The points to
 know are:

 * In a single-column index, a key tuple just contains the key datum, but
 in a multi-column index, a key tuple contains the pair (column number,
 key datum) where the column number is stored as an int2.  This is needed
 to support different key data types in different columns.  This much of
 the tuple is built by index_form_tuple according to the usual rules.
 The column number (if present) can never be null, but the key datum can
 be, in which case a null bitmap is present as usual.  (As usual for index
 tuples, the size of the null bitmap is fixed at INDEX_MAX_KEYS.)

 * If the key datum is null (ie, IndexTupleHasNulls() is true), then
 just after the nominal index data (ie, at offset IndexInfoFindDataOffset
 or IndexInfoFindDataOffset + sizeof(int2)) there is a byte indicating
 the "category" of the null entry.  These are the possible categories:
 	1 = ordinary null key value extracted from an indexable item
 	2 = placeholder for zero-key indexable item
 	3 = placeholder for null indexable item
 Placeholder null entries are inserted into the index because otherwise
 there would be no index entry at all for an empty or null indexable item,
 which would mean that full index scans couldn't be done and various corner
 cases would give wrong answers.  The different categories of null entries
 are treated as distinct keys by the btree, but heap itempointers for the
 same category of null entry are merged into one index entry just as happens
 with ordinary key entries.

 * In a key entry at the btree leaf level, at the next SHORTALIGN boundary,
 there is a list of item pointers, in compressed format (see Posting List
 Compression section), pointing to the heap tuples for which the indexable
 items contain this key. This is called the "posting list".

 If the list would be too big for the index tuple to fit on an index page, the
 ItemPointers are pushed out to a separate posting page or pages, and none
 appear in the key entry itself.  The separate pages are called a "posting
 tree" (see below); Note that in either case, the ItemPointers associated with
 a key can easily be read out in sorted order; this is relied on by the scan
 algorithms.

 * The index tuple header fields of a leaf key entry are abused as follows:

 1) Posting list case:

 * ItemPointerGetBlockNumber(&itup->t_tid) contains the offset from index
   tuple start to the posting list.
   Access macros: GinGetPostingOffset(itup) / GinSetPostingOffset(itup,n)

 * ItemPointerGetOffsetNumber(&itup->t_tid) contains the number of elements
   in the posting list (number of heap itempointers).
   Access macros: GinGetNPosting(itup) / GinSetNPosting(itup,n)

 * If IndexTupleHasNulls(itup) is true, the null category byte can be
   accessed/set with GinGetNullCategory(itup,gs) / GinSetNullCategory(itup,gs,c)

 * The posting list can be accessed with GinGetPosting(itup)

 * If GinItupIsCompressed(itup), the posting list is stored in compressed
   format. Otherwise it is just an array of ItemPointers. New tuples are always
   stored in compressed format, uncompressed items can be present if the
   database was migrated from 9.3 or earlier version.

 2) Posting tree case:

 * ItemPointerGetBlockNumber(&itup->t_tid) contains the index block number
   of the root of the posting tree.
   Access macros: GinGetPostingTree(itup) / GinSetPostingTree(itup, blkno)

 * ItemPointerGetOffsetNumber(&itup->t_tid) contains the magic number
   GIN_TREE_POSTING, which distinguishes this from the posting-list case
   (it's large enough that that many heap itempointers couldn't possibly
   fit on an index page).  This value is inserted automatically by the
   GinSetPostingTree macro.

 * If IndexTupleHasNulls(itup) is true, the null category byte can be
   accessed/set with GinGetNullCategory(itup,gs) / GinSetNullCategory(itup,gs,c)

 * The posting list is not present and must not be accessed.

 Use the macro GinIsPostingTree(itup) to determine which case applies.

 In both cases, itup->t_info & INDEX_SIZE_MASK contains actual total size of
 tuple, and the INDEX_VAR_MASK and INDEX_NULL_MASK bits have their normal
 meanings as set by index_form_tuple.

 Index tuples in non-leaf levels of the btree contain the optional column
 number, key datum, and null category byte as above.  They do not contain
 a posting list.  ItemPointerGetBlockNumber(&itup->t_tid) is the downlink
 to the next lower btree level, and ItemPointerGetOffsetNumber(&itup->t_tid)
 is InvalidOffsetNumber.  Use the access macros GinGetDownlink/GinSetDownlink
 to get/set the downlink.

 Index entries that appear in "pending list" pages work a tad differently as
 well.  The optional column number, key datum, and null category byte are as
 for other GIN index entries.  However, there is always exactly one heap
 itempointer associated with a pending entry, and it is stored in the t_tid
 header field just as in non-GIN indexes.  There is no posting list.
 Furthermore, the code that searches the pending list assumes that all
 entries for a given heap tuple appear consecutively in the pending list and
 are sorted by the column-number-plus-key-datum.  The GIN_LIST_FULLROW page
 flag bit tells whether entries for a given heap tuple are spread across
 multiple pending-list pages.  If GIN_LIST_FULLROW is set, the page contains
 all the entries for one or more heap tuples.  If GIN_LIST_FULLROW is clear,
 the page contains entries for only one heap tuple, *and* they are not all
 the entries for that tuple.  (Thus, a heap tuple whose entries do not all
 fit on one pending-list page must have those pages to itself, even if this
 results in wasting much of the space on the preceding page and the last
 page for the tuple.)

 GIN packs downlinks and pivot keys into internal page tuples in a different way
 than nbtree does.  Lehman & Yao defines it as following.

 P_0, K_1, P_1, K_2, P_2, ... , K_n, P_n, K_{n+1}

 There P_i is a downlink and K_i is a key.  K_i splits key space between P_{i-1}
 and P_i (0 <= i <= n).  K_{n+1} is high key.

 In internal page tuple is key and downlink grouped together.  nbtree packs
 keys and downlinks into tuples as following.

 (K_{n+1}, None), (-Inf, P_0), (K_1, P_1), ... , (K_n, P_n)

 There tuples are shown in parentheses.  So, highkey is stored separately.  P_i
 is grouped with K_i.  P_0 is grouped with -Inf key.

 GIN packs keys and downlinks into tuples in a different way.

 (P_0, K_1), (P_1, K_2), ... , (P_n, K_{n+1})

 P_i is grouped with K_{i+1}.  -Inf key is not needed.

 There are couple of additional notes regarding K_{n+1} key.
 1) In entry tree rightmost page, a key coupled with P_n doesn't really matter.
 Highkey is assumed to be infinity.
 2) In posting tree, a key coupled with P_n always doesn't matter.  Highkey for
 non-rightmost pages is stored separately and accessed via
 GinDataPageGetRightBound().

 Posting tree
 ------------

 If a posting list is too large to store in-line in a key entry, a posting tree
 is created. A posting tree is a B-tree structure, where the ItemPointer is
 used as the key.

 Internal posting tree pages use the standard PageHeader and the same "opaque"
 struct as other GIN page, but do not contain regular index tuples. Instead,
 the contents of the page is an array of PostingItem structs. Each PostingItem
 consists of the block number of the child page, and the right bound of that
 child page, as an ItemPointer. The right bound of the page is stored right
 after the page header, before the PostingItem array.

 Posting tree leaf pages also use the standard PageHeader and opaque struct,
 and the right bound of the page is stored right after the page header, but
 the page content comprises of a number of compressed posting lists. The
 compressed posting lists are stored one after each other, between page header
 and pd_lower. The space between pd_lower and pd_upper is unused, which allows
 full-page images of posting tree leaf pages to skip the unused space in middle
 (buffer_std = true in XLogRecData).

 The item pointers are stored in a number of independent compressed posting
 lists (also called segments), instead of one big one, to make random access
 to a given item pointer faster: to find an item in a compressed list, you
 have to read the list from the beginning, but when the items are split into
 multiple lists, you can first skip over to the list containing the item you're
 looking for, and read only that segment. Also, an update only needs to
 re-encode the affected segment.

 Posting List Compression
 ------------------------

 To fit as many item pointers on a page as possible, posting tree leaf pages
 and posting lists stored inline in entry tree leaf tuples use a lightweight
 form of compression. We take advantage of the fact that the item pointers
 are stored in sorted order. Instead of storing the block and offset number of
 each item pointer separately, we store the difference from the previous item.
 That in itself doesn't do much, but it allows us to use so-called varbyte
 encoding to compress them.

 Varbyte encoding is a method to encode integers, allowing smaller numbers to
 take less space at the cost of larger numbers. Each integer is represented by
 variable number of bytes. High bit of each byte in varbyte encoding determines
 whether the next byte is still part of this number. Therefore, to read a single
 varbyte encoded number, you have to read bytes until you find a byte with the
 high bit not set.

 When encoding, the block and offset number forming the item pointer are
 combined into a single integer. The offset number is stored in the 11 low
 bits (see MaxHeapTuplesPerPageBits in ginpostinglist.c), and the block number
 is stored in the higher bits. That requires 43 bits in total, which
 conveniently fits in at most 6 bytes.

 A compressed posting list is passed around and stored on disk in a
 GinPostingList struct. The first item in the list is stored uncompressed
 as a regular ItemPointerData, followed by the length of the list in bytes,
 followed by the packed items.

 Concurrency
 -----------

 The entry tree and each posting tree are B-trees, with right-links connecting
 sibling pages at the same level.  This is the same structure that is used in
 the regular B-tree indexam (invented by Lehman & Yao), but we don't support
 scanning a GIN trees backwards, so we don't need left-links.  The entry tree
 leaves don't have dedicated high keys, instead greatest leaf tuple serves as
 high key.  That works because tuples are never deleted from the entry tree.

 The algorithms used to operate entry and posting trees are considered below.

 ### Locating the leaf page

 When we search for leaf page in GIN btree to perform a read, we descend from
 the root page to the leaf through using downlinks taking pin and shared lock on
 one page at once.  So, we release pin and shared lock on previous page before
 getting them on the next page.

 The picture below shows tree state after finding the leaf page.  Lower case
 letters depicts tree pages.  'S' depicts shared lock on the page.

                a
            /   |   \
        b       c       d
      / | \     | \     | \
    eS  f   g   h   i   j   k

 ### Steping right

 Concurrent page splits move the keyspace to right, so after following a
 downlink, the page actually containing the key we're looking for might be
 somewhere to the right of the page we landed on.  In that case, we follow the
 right-links until we find the page we're looking for.

 During stepping right we take pin and shared lock on the right sibling before
 releasing them from the current page.  This mechanism was designed to protect
 from stepping to delete page.  We step to the right sibling while hold lock on
 the rightlink pointing there.  So, it's guaranteed that nobody updates rightlink
 concurrently and doesn't delete right sibling accordingly.

 The picture below shows two pages locked at once during stepping right.

                a
            /   |   \
        b       c       d
      / | \     | \     | \
    eS  fS  g   h   i   j   k

 ### Insert

 While finding appropriate leaf for insertion we also descend from the root to
 leaf, while shared locking one page at once in.  But during insertion we don't
 release pins from root and internal pages.  That could save us some lookups to
 the buffers hash table for downlinks insertion assuming parents are not changed
 due to concurrent splits.  Once we reach leaf we re-lock the page in exclusive
 mode.

 The picture below shows leaf page locked in exclusive mode and ready for
 insertion.  'P' and 'E' depict pin and exclusive lock correspondingly.


                aP
            /   |   \
        b       cP      d
      / | \     | \     | \
    e   f   g   hE  i   j   k


 If insert causes a page split, the parent is locked in exclusive mode before
 unlocking the left child.  So, insertion algorithm can exclusively lock both
 parent and child pages at once starting from child.

 The picture below shows tree state after leaf page split.  'q' is new page
 produced by split.  Parent 'c' is about to have downlink inserted.

                   aP
             /     |   \
        b          cE      d
      / | \      / | \     | \
    e   f   g  hE  q   i   j   k


 ### Page deletion

 Vacuum never deletes tuples or pages from the entry tree. It traverses entry
 tree leafs in logical order by rightlinks and removes deletable TIDs from
 posting lists. Posting trees are processed by links from entry tree leafs. They
 are vacuumed in two stages. At first stage, deletable TIDs are removed from
 leafs. If first stage detects at least one empty page, then at the second stage
 ginScanToDelete() deletes empty pages.

 ginScanToDelete() traverses the whole tree in depth-first manner.  It starts
 from the full cleanup lock on the tree root.  This lock prevents all the
 concurrent insertions into this tree while we're deleting pages.  However,
 there are still might be some in-progress readers, who traversed root before
 we locked it.

 The picture below shows tree state after page deletion algorithm traversed to
 leftmost leaf of the tree.

                aE
            /   |   \
        bE      c       d
      / | \     | \     | \
    eE  f   g   h   i   j   k

 Deletion algorithm keeps exclusive locks on left siblings of pages comprising
 currently investigated path.  Thus, if current page is to be removed, all
 required pages to remove both downlink and rightlink are already locked.  That
 avoids potential right to left page locking order, which could deadlock with
 concurrent stepping right.

 A search concurrent to page deletion might already have read a pointer to the
 page to be deleted, and might be just about to follow it.  A page can be reached
 via the right-link of its left sibling, or via its downlink in the parent.

 To prevent a backend from reaching a deleted page via a right-link, stepping
 right algorithm doesn't release lock on the current page until lock of the
 right page is acquired.

 The downlink is more tricky.  A search descending the tree must release the lock
 on the parent page before locking the child, or it could deadlock with a
 concurrent split of the child page; a page split locks the parent, while already
 holding a lock on the child page.  So, deleted page cannot be reclaimed
 immediately.  Instead, we have to wait for every transaction, which might wait
 to reference this page, to finish.  Corresponding processes must observe that
 the page is marked deleted and recover accordingly.

 The picture below shows tree state after page deletion algorithm further
 traversed the tree.  Currently investigated path is 'a-c-h'.  Left siblings 'b'
 and 'g' of 'c' and 'h' correspondingly are also exclusively locked.

                aE
            /   |   \
        bE      cE      d
      / | \     | \     | \
    e   f   gE  hE  i   j   k

 The next picture shows tree state after page 'h' was deleted.  It's marked with
 'deleted' flag and newest xid, which might visit it.  Downlink from 'c' to 'h'
 is also deleted.

                aE
            /   |   \
        bE      cE      d
      / | \       \     | \
    e   f   gE  hD  iE  j   k

 However, it's still possible that concurrent reader has seen downlink from 'c'
 to 'h' before we deleted it.  In that case this reader will step right from 'h'
 to till find non-deleted page.  Xid-marking of page 'h' guarantees that this
 page wouldn't be reused till all such readers gone.  Next leaf page under
 investigation is 'i'.  'g' remains locked as it becomes left sibling of 'i'.

 The next picture shows tree state after 'i' and 'c' was deleted.  Internal page
 'c' was deleted because it appeared to have no downlinks.  The path under
 investigation is 'a-d-j'.  Pages 'b' and 'g' are locked as self siblings of 'd'
 and 'j'.

                aE
            /       \
        bE      cD      dE
      / | \             | \
    e   f   gE  hD  iD  jE  k

 During the replay of page deletion at standby, the page's left sibling, the
 target page, and its parent, are locked in that order.  This order guarantees
 no deadlock with concurrent reads.

 Predicate Locking
 -----------------

 GIN supports predicate locking, for serializable snapshot isolation.
 A predicate locks represent that a scan has scanned a range of values.  They
 are not concerned with physical pages as such, but the logical key values.
 A predicate lock on a page covers the key range that would belong on that
 page, whether or not there are any matching tuples there currently.  In other
 words, a predicate lock on an index page covers the "gaps" between the index
 tuples.  To minimize false positives, predicate locks are acquired at the
 finest level possible.

 * Like in the B-tree index, it is enough to lock only leaf pages, because all
   insertions happen at the leaf level.

 * In an equality search (i.e. not a partial match search), if a key entry has
   a posting tree, we lock the posting tree root page, to represent a lock on
   just that key entry.  Otherwise, we lock the entry tree page.  We also lock
   the entry tree page if no match is found, to lock the "gap" where the entry
   would've been, had there been one.

 * In a partial match search, we lock all the entry leaf pages that we scan,
   in addition to locks on posting tree roots, to represent the "gaps" between
   values.

 * In addition to the locks on entry leaf pages and posting tree roots, all
   scans grab a lock the metapage.  This is to interlock with insertions to
   the fast update pending list.  An insertion to the pending list can really
   belong anywhere in the tree, and the lock on the metapage represents that.

 The interlock for fastupdate pending lists means that with fastupdate=on,
 we effectively always grab a full-index lock, so you could get a lot of false
 positives.

 Compatibility
 -------------

 Compression of TIDs was introduced in 9.4. Some GIN indexes could remain in
 uncompressed format because of pg_upgrade from 9.3 or earlier versions.
 For compatibility, old uncompressed format is also supported. Following
 rules are used to handle it:

 * GIN_ITUP_COMPRESSED flag marks index tuples that contain a posting list.
 This flag is stored in high bit of ItemPointerGetBlockNumber(&itup->t_tid).
 Use GinItupIsCompressed(itup) to check the flag.

 * Posting tree pages in the new format are marked with the GIN_COMPRESSED flag.
   Macros GinPageIsCompressed(page) and GinPageSetCompressed(page) are used to
   check and set this flag.

 * All scan operations check format of posting list add use corresponding code
 to read its content.

 * When updating an index tuple containing an uncompressed posting list, it
 will be replaced with new index tuple containing a compressed list.

 * When updating an uncompressed posting tree leaf page, it's compressed.

 * If vacuum finds some dead TIDs in uncompressed posting lists, they are
 converted into compressed posting lists. This assumes that the compressed
 posting list fits in the space occupied by the uncompressed list. IOW, we
 assume that the compressed version of the page, with the dead items removed,
 takes less space than the old uncompressed version.

 Limitations
 -----------

   * Gin doesn't use scan->kill_prior_tuple & scan->ignore_killed_tuples
   * Gin searches entries only by equality matching, or simple range
     matching using the "partial match" feature.

 TODO
 ----

 Nearest future:

   * Opclasses for more types (no programming, just many catalog changes)

 Distant future:

   * Replace B-tree of entries to something like GiST

 Authors
 -------

 Original work was done by Teodor Sigaev (teodor@sigaev.ru) and Oleg Bartunov
 (oleg@sai.msu.su).