| src/backend/access/gin/README |
| |
| Gin for PostgreSQL |
| ================== |
| |
| Gin was sponsored by jfg://networks (http://www.jfg-networks.com/) |
| |
| Gin stands for Generalized Inverted Index and should be considered as a genie, |
| not a drink. |
| |
| Generalized means that the index does not know which operation it accelerates. |
| It instead works with custom strategies, defined for specific data types (read |
| "Index Method Strategies" in the PostgreSQL documentation). In that sense, Gin |
| is similar to GiST and differs from btree indices, which have predefined, |
| comparison-based operations. |
| |
| An inverted index is an index structure storing a set of (key, posting list) |
| pairs, where 'posting list' is a set of heap rows in which the key occurs. |
| (A text document would usually contain many keys.) The primary goal of |
| Gin indices is support for highly scalable, full-text search in PostgreSQL. |
| |
| A Gin index consists of a B-tree index constructed over key values, |
| where each key is an element of some indexed items (element of array, lexeme |
| for tsvector) and where each tuple in a leaf page contains either a pointer to |
| a B-tree over item pointers (posting tree), or a simple list of item pointers |
| (posting list) if the list is small enough. |
| |
| Note: There is no delete operation in the key (entry) tree. The reason for |
| this is that in our experience, the set of distinct words in a large corpus |
| changes very slowly. This greatly simplifies the code and concurrency |
| algorithms. |
| |
| Core PostgreSQL includes built-in Gin support for one-dimensional arrays |
| (eg. integer[], text[]). The following operations are available: |
| |
| * contains: value_array @> query_array |
| * overlaps: value_array && query_array |
| * is contained by: value_array <@ query_array |
| |
| Synopsis |
| -------- |
| |
| =# create index txt_idx on aa using gin(a); |
| |
| Features |
| -------- |
| |
| * Concurrency |
| * Write-Ahead Logging (WAL). (Recoverability from crashes.) |
| * User-defined opclasses. (The scheme is similar to GiST.) |
| * Optimized index creation (Makes use of maintenance_work_mem to accumulate |
| postings in memory.) |
| * Text search support via an opclass |
| * Soft upper limit on the returned results set using a GUC variable: |
| gin_fuzzy_search_limit |
| |
| Gin Fuzzy Limit |
| --------------- |
| |
| There are often situations when a full-text search returns a very large set of |
| results. Since reading tuples from the disk and sorting them could take a |
| lot of time, this is unacceptable for production. (Note that the search |
| itself is very fast.) |
| |
| Such queries usually contain very frequent lexemes, so the results are not |
| very helpful. To facilitate execution of such queries Gin has a configurable |
| soft upper limit on the size of the returned set, determined by the |
| 'gin_fuzzy_search_limit' GUC variable. This is set to 0 by default (no |
| limit). |
| |
| If a non-zero search limit is set, then the returned set is a subset of the |
| whole result set, chosen at random. |
| |
| "Soft" means that the actual number of returned results could differ |
| from the specified limit, depending on the query and the quality of the |
| system's random number generator. |
| |
| From experience, a value of 'gin_fuzzy_search_limit' in the thousands |
| (eg. 5000-20000) works well. This means that 'gin_fuzzy_search_limit' will |
| have no effect for queries returning a result set with less tuples than this |
| number. |
| |
| Index structure |
| --------------- |
| |
| The "items" that a GIN index indexes are composite values that contain |
| zero or more "keys". For example, an item might be an integer array, and |
| then the keys would be the individual integer values. The index actually |
| stores and searches for the key values, not the items per se. In the |
| pg_opclass entry for a GIN opclass, the opcintype is the data type of the |
| items, and the opckeytype is the data type of the keys. GIN is optimized |
| for cases where items contain many keys and the same key values appear |
| in many different items. |
| |
| A GIN index contains a metapage, a btree of key entries, and possibly |
| "posting tree" pages, which hold the overflow when a key entry acquires |
| too many heap tuple pointers to fit in a btree page. Additionally, if the |
| fast-update feature is enabled, there can be "list pages" holding "pending" |
| key entries that haven't yet been merged into the main btree. The list |
| pages have to be scanned linearly when doing a search, so the pending |
| entries should be merged into the main btree before there get to be too |
| many of them. The advantage of the pending list is that bulk insertion of |
| a few thousand entries can be much faster than retail insertion. (The win |
| comes mainly from not having to do multiple searches/insertions when the |
| same key appears in multiple new heap tuples.) |
| |
| Key entries are nominally of the same IndexTuple format as used in other |
| index types, but since a leaf key entry typically refers to multiple heap |
| tuples, there are significant differences. (See GinFormTuple, which works |
| by building a "normal" index tuple and then modifying it.) The points to |
| know are: |
| |
| * In a single-column index, a key tuple just contains the key datum, but |
| in a multi-column index, a key tuple contains the pair (column number, |
| key datum) where the column number is stored as an int2. This is needed |
| to support different key data types in different columns. This much of |
| the tuple is built by index_form_tuple according to the usual rules. |
| The column number (if present) can never be null, but the key datum can |
| be, in which case a null bitmap is present as usual. (As usual for index |
| tuples, the size of the null bitmap is fixed at INDEX_MAX_KEYS.) |
| |
| * If the key datum is null (ie, IndexTupleHasNulls() is true), then |
| just after the nominal index data (ie, at offset IndexInfoFindDataOffset |
| or IndexInfoFindDataOffset + sizeof(int2)) there is a byte indicating |
| the "category" of the null entry. These are the possible categories: |
| 1 = ordinary null key value extracted from an indexable item |
| 2 = placeholder for zero-key indexable item |
| 3 = placeholder for null indexable item |
| Placeholder null entries are inserted into the index because otherwise |
| there would be no index entry at all for an empty or null indexable item, |
| which would mean that full index scans couldn't be done and various corner |
| cases would give wrong answers. The different categories of null entries |
| are treated as distinct keys by the btree, but heap itempointers for the |
| same category of null entry are merged into one index entry just as happens |
| with ordinary key entries. |
| |
| * In a key entry at the btree leaf level, at the next SHORTALIGN boundary, |
| there is a list of item pointers, in compressed format (see Posting List |
| Compression section), pointing to the heap tuples for which the indexable |
| items contain this key. This is called the "posting list". |
| |
| If the list would be too big for the index tuple to fit on an index page, the |
| ItemPointers are pushed out to a separate posting page or pages, and none |
| appear in the key entry itself. The separate pages are called a "posting |
| tree" (see below); Note that in either case, the ItemPointers associated with |
| a key can easily be read out in sorted order; this is relied on by the scan |
| algorithms. |
| |
| * The index tuple header fields of a leaf key entry are abused as follows: |
| |
| 1) Posting list case: |
| |
| * ItemPointerGetBlockNumber(&itup->t_tid) contains the offset from index |
| tuple start to the posting list. |
| Access macros: GinGetPostingOffset(itup) / GinSetPostingOffset(itup,n) |
| |
| * ItemPointerGetOffsetNumber(&itup->t_tid) contains the number of elements |
| in the posting list (number of heap itempointers). |
| Access macros: GinGetNPosting(itup) / GinSetNPosting(itup,n) |
| |
| * If IndexTupleHasNulls(itup) is true, the null category byte can be |
| accessed/set with GinGetNullCategory(itup,gs) / GinSetNullCategory(itup,gs,c) |
| |
| * The posting list can be accessed with GinGetPosting(itup) |
| |
| * If GinItupIsCompressed(itup), the posting list is stored in compressed |
| format. Otherwise it is just an array of ItemPointers. New tuples are always |
| stored in compressed format, uncompressed items can be present if the |
| database was migrated from 9.3 or earlier version. |
| |
| 2) Posting tree case: |
| |
| * ItemPointerGetBlockNumber(&itup->t_tid) contains the index block number |
| of the root of the posting tree. |
| Access macros: GinGetPostingTree(itup) / GinSetPostingTree(itup, blkno) |
| |
| * ItemPointerGetOffsetNumber(&itup->t_tid) contains the magic number |
| GIN_TREE_POSTING, which distinguishes this from the posting-list case |
| (it's large enough that that many heap itempointers couldn't possibly |
| fit on an index page). This value is inserted automatically by the |
| GinSetPostingTree macro. |
| |
| * If IndexTupleHasNulls(itup) is true, the null category byte can be |
| accessed/set with GinGetNullCategory(itup,gs) / GinSetNullCategory(itup,gs,c) |
| |
| * The posting list is not present and must not be accessed. |
| |
| Use the macro GinIsPostingTree(itup) to determine which case applies. |
| |
| In both cases, itup->t_info & INDEX_SIZE_MASK contains actual total size of |
| tuple, and the INDEX_VAR_MASK and INDEX_NULL_MASK bits have their normal |
| meanings as set by index_form_tuple. |
| |
| Index tuples in non-leaf levels of the btree contain the optional column |
| number, key datum, and null category byte as above. They do not contain |
| a posting list. ItemPointerGetBlockNumber(&itup->t_tid) is the downlink |
| to the next lower btree level, and ItemPointerGetOffsetNumber(&itup->t_tid) |
| is InvalidOffsetNumber. Use the access macros GinGetDownlink/GinSetDownlink |
| to get/set the downlink. |
| |
| Index entries that appear in "pending list" pages work a tad differently as |
| well. The optional column number, key datum, and null category byte are as |
| for other GIN index entries. However, there is always exactly one heap |
| itempointer associated with a pending entry, and it is stored in the t_tid |
| header field just as in non-GIN indexes. There is no posting list. |
| Furthermore, the code that searches the pending list assumes that all |
| entries for a given heap tuple appear consecutively in the pending list and |
| are sorted by the column-number-plus-key-datum. The GIN_LIST_FULLROW page |
| flag bit tells whether entries for a given heap tuple are spread across |
| multiple pending-list pages. If GIN_LIST_FULLROW is set, the page contains |
| all the entries for one or more heap tuples. If GIN_LIST_FULLROW is clear, |
| the page contains entries for only one heap tuple, *and* they are not all |
| the entries for that tuple. (Thus, a heap tuple whose entries do not all |
| fit on one pending-list page must have those pages to itself, even if this |
| results in wasting much of the space on the preceding page and the last |
| page for the tuple.) |
| |
| GIN packs downlinks and pivot keys into internal page tuples in a different way |
| than nbtree does. Lehman & Yao defines it as following. |
| |
| P_0, K_1, P_1, K_2, P_2, ... , K_n, P_n, K_{n+1} |
| |
| There P_i is a downlink and K_i is a key. K_i splits key space between P_{i-1} |
| and P_i (0 <= i <= n). K_{n+1} is high key. |
| |
| In internal page tuple is key and downlink grouped together. nbtree packs |
| keys and downlinks into tuples as following. |
| |
| (K_{n+1}, None), (-Inf, P_0), (K_1, P_1), ... , (K_n, P_n) |
| |
| There tuples are shown in parentheses. So, highkey is stored separately. P_i |
| is grouped with K_i. P_0 is grouped with -Inf key. |
| |
| GIN packs keys and downlinks into tuples in a different way. |
| |
| (P_0, K_1), (P_1, K_2), ... , (P_n, K_{n+1}) |
| |
| P_i is grouped with K_{i+1}. -Inf key is not needed. |
| |
| There are couple of additional notes regarding K_{n+1} key. |
| 1) In entry tree rightmost page, a key coupled with P_n doesn't really matter. |
| Highkey is assumed to be infinity. |
| 2) In posting tree, a key coupled with P_n always doesn't matter. Highkey for |
| non-rightmost pages is stored separately and accessed via |
| GinDataPageGetRightBound(). |
| |
| Posting tree |
| ------------ |
| |
| If a posting list is too large to store in-line in a key entry, a posting tree |
| is created. A posting tree is a B-tree structure, where the ItemPointer is |
| used as the key. |
| |
| Internal posting tree pages use the standard PageHeader and the same "opaque" |
| struct as other GIN page, but do not contain regular index tuples. Instead, |
| the contents of the page is an array of PostingItem structs. Each PostingItem |
| consists of the block number of the child page, and the right bound of that |
| child page, as an ItemPointer. The right bound of the page is stored right |
| after the page header, before the PostingItem array. |
| |
| Posting tree leaf pages also use the standard PageHeader and opaque struct, |
| and the right bound of the page is stored right after the page header, but |
| the page content comprises of a number of compressed posting lists. The |
| compressed posting lists are stored one after each other, between page header |
| and pd_lower. The space between pd_lower and pd_upper is unused, which allows |
| full-page images of posting tree leaf pages to skip the unused space in middle |
| (buffer_std = true in XLogRecData). |
| |
| The item pointers are stored in a number of independent compressed posting |
| lists (also called segments), instead of one big one, to make random access |
| to a given item pointer faster: to find an item in a compressed list, you |
| have to read the list from the beginning, but when the items are split into |
| multiple lists, you can first skip over to the list containing the item you're |
| looking for, and read only that segment. Also, an update only needs to |
| re-encode the affected segment. |
| |
| Posting List Compression |
| ------------------------ |
| |
| To fit as many item pointers on a page as possible, posting tree leaf pages |
| and posting lists stored inline in entry tree leaf tuples use a lightweight |
| form of compression. We take advantage of the fact that the item pointers |
| are stored in sorted order. Instead of storing the block and offset number of |
| each item pointer separately, we store the difference from the previous item. |
| That in itself doesn't do much, but it allows us to use so-called varbyte |
| encoding to compress them. |
| |
| Varbyte encoding is a method to encode integers, allowing smaller numbers to |
| take less space at the cost of larger numbers. Each integer is represented by |
| variable number of bytes. High bit of each byte in varbyte encoding determines |
| whether the next byte is still part of this number. Therefore, to read a single |
| varbyte encoded number, you have to read bytes until you find a byte with the |
| high bit not set. |
| |
| When encoding, the block and offset number forming the item pointer are |
| combined into a single integer. The offset number is stored in the 11 low |
| bits (see MaxHeapTuplesPerPageBits in ginpostinglist.c), and the block number |
| is stored in the higher bits. That requires 43 bits in total, which |
| conveniently fits in at most 6 bytes. |
| |
| A compressed posting list is passed around and stored on disk in a |
| GinPostingList struct. The first item in the list is stored uncompressed |
| as a regular ItemPointerData, followed by the length of the list in bytes, |
| followed by the packed items. |
| |
| Concurrency |
| ----------- |
| |
| The entry tree and each posting tree are B-trees, with right-links connecting |
| sibling pages at the same level. This is the same structure that is used in |
| the regular B-tree indexam (invented by Lehman & Yao), but we don't support |
| scanning a GIN trees backwards, so we don't need left-links. The entry tree |
| leaves don't have dedicated high keys, instead greatest leaf tuple serves as |
| high key. That works because tuples are never deleted from the entry tree. |
| |
| The algorithms used to operate entry and posting trees are considered below. |
| |
| ### Locating the leaf page |
| |
| When we search for leaf page in GIN btree to perform a read, we descend from |
| the root page to the leaf through using downlinks taking pin and shared lock on |
| one page at once. So, we release pin and shared lock on previous page before |
| getting them on the next page. |
| |
| The picture below shows tree state after finding the leaf page. Lower case |
| letters depicts tree pages. 'S' depicts shared lock on the page. |
| |
| a |
| / | \ |
| b c d |
| / | \ | \ | \ |
| eS f g h i j k |
| |
| ### Steping right |
| |
| Concurrent page splits move the keyspace to right, so after following a |
| downlink, the page actually containing the key we're looking for might be |
| somewhere to the right of the page we landed on. In that case, we follow the |
| right-links until we find the page we're looking for. |
| |
| During stepping right we take pin and shared lock on the right sibling before |
| releasing them from the current page. This mechanism was designed to protect |
| from stepping to delete page. We step to the right sibling while hold lock on |
| the rightlink pointing there. So, it's guaranteed that nobody updates rightlink |
| concurrently and doesn't delete right sibling accordingly. |
| |
| The picture below shows two pages locked at once during stepping right. |
| |
| a |
| / | \ |
| b c d |
| / | \ | \ | \ |
| eS fS g h i j k |
| |
| ### Insert |
| |
| While finding appropriate leaf for insertion we also descend from the root to |
| leaf, while shared locking one page at once in. But during insertion we don't |
| release pins from root and internal pages. That could save us some lookups to |
| the buffers hash table for downlinks insertion assuming parents are not changed |
| due to concurrent splits. Once we reach leaf we re-lock the page in exclusive |
| mode. |
| |
| The picture below shows leaf page locked in exclusive mode and ready for |
| insertion. 'P' and 'E' depict pin and exclusive lock correspondingly. |
| |
| |
| aP |
| / | \ |
| b cP d |
| / | \ | \ | \ |
| e f g hE i j k |
| |
| |
| If insert causes a page split, the parent is locked in exclusive mode before |
| unlocking the left child. So, insertion algorithm can exclusively lock both |
| parent and child pages at once starting from child. |
| |
| The picture below shows tree state after leaf page split. 'q' is new page |
| produced by split. Parent 'c' is about to have downlink inserted. |
| |
| aP |
| / | \ |
| b cE d |
| / | \ / | \ | \ |
| e f g hE q i j k |
| |
| |
| ### Page deletion |
| |
| Vacuum never deletes tuples or pages from the entry tree. It traverses entry |
| tree leafs in logical order by rightlinks and removes deletable TIDs from |
| posting lists. Posting trees are processed by links from entry tree leafs. They |
| are vacuumed in two stages. At first stage, deletable TIDs are removed from |
| leafs. If first stage detects at least one empty page, then at the second stage |
| ginScanToDelete() deletes empty pages. |
| |
| ginScanToDelete() traverses the whole tree in depth-first manner. It starts |
| from the full cleanup lock on the tree root. This lock prevents all the |
| concurrent insertions into this tree while we're deleting pages. However, |
| there are still might be some in-progress readers, who traversed root before |
| we locked it. |
| |
| The picture below shows tree state after page deletion algorithm traversed to |
| leftmost leaf of the tree. |
| |
| aE |
| / | \ |
| bE c d |
| / | \ | \ | \ |
| eE f g h i j k |
| |
| Deletion algorithm keeps exclusive locks on left siblings of pages comprising |
| currently investigated path. Thus, if current page is to be removed, all |
| required pages to remove both downlink and rightlink are already locked. That |
| avoids potential right to left page locking order, which could deadlock with |
| concurrent stepping right. |
| |
| A search concurrent to page deletion might already have read a pointer to the |
| page to be deleted, and might be just about to follow it. A page can be reached |
| via the right-link of its left sibling, or via its downlink in the parent. |
| |
| To prevent a backend from reaching a deleted page via a right-link, stepping |
| right algorithm doesn't release lock on the current page until lock of the |
| right page is acquired. |
| |
| The downlink is more tricky. A search descending the tree must release the lock |
| on the parent page before locking the child, or it could deadlock with a |
| concurrent split of the child page; a page split locks the parent, while already |
| holding a lock on the child page. So, deleted page cannot be reclaimed |
| immediately. Instead, we have to wait for every transaction, which might wait |
| to reference this page, to finish. Corresponding processes must observe that |
| the page is marked deleted and recover accordingly. |
| |
| The picture below shows tree state after page deletion algorithm further |
| traversed the tree. Currently investigated path is 'a-c-h'. Left siblings 'b' |
| and 'g' of 'c' and 'h' correspondingly are also exclusively locked. |
| |
| aE |
| / | \ |
| bE cE d |
| / | \ | \ | \ |
| e f gE hE i j k |
| |
| The next picture shows tree state after page 'h' was deleted. It's marked with |
| 'deleted' flag and newest xid, which might visit it. Downlink from 'c' to 'h' |
| is also deleted. |
| |
| aE |
| / | \ |
| bE cE d |
| / | \ \ | \ |
| e f gE hD iE j k |
| |
| However, it's still possible that concurrent reader has seen downlink from 'c' |
| to 'h' before we deleted it. In that case this reader will step right from 'h' |
| to till find non-deleted page. Xid-marking of page 'h' guarantees that this |
| page wouldn't be reused till all such readers gone. Next leaf page under |
| investigation is 'i'. 'g' remains locked as it becomes left sibling of 'i'. |
| |
| The next picture shows tree state after 'i' and 'c' was deleted. Internal page |
| 'c' was deleted because it appeared to have no downlinks. The path under |
| investigation is 'a-d-j'. Pages 'b' and 'g' are locked as self siblings of 'd' |
| and 'j'. |
| |
| aE |
| / \ |
| bE cD dE |
| / | \ | \ |
| e f gE hD iD jE k |
| |
| During the replay of page deletion at standby, the page's left sibling, the |
| target page, and its parent, are locked in that order. This order guarantees |
| no deadlock with concurrent reads. |
| |
| Predicate Locking |
| ----------------- |
| |
| GIN supports predicate locking, for serializable snapshot isolation. |
| A predicate locks represent that a scan has scanned a range of values. They |
| are not concerned with physical pages as such, but the logical key values. |
| A predicate lock on a page covers the key range that would belong on that |
| page, whether or not there are any matching tuples there currently. In other |
| words, a predicate lock on an index page covers the "gaps" between the index |
| tuples. To minimize false positives, predicate locks are acquired at the |
| finest level possible. |
| |
| * Like in the B-tree index, it is enough to lock only leaf pages, because all |
| insertions happen at the leaf level. |
| |
| * In an equality search (i.e. not a partial match search), if a key entry has |
| a posting tree, we lock the posting tree root page, to represent a lock on |
| just that key entry. Otherwise, we lock the entry tree page. We also lock |
| the entry tree page if no match is found, to lock the "gap" where the entry |
| would've been, had there been one. |
| |
| * In a partial match search, we lock all the entry leaf pages that we scan, |
| in addition to locks on posting tree roots, to represent the "gaps" between |
| values. |
| |
| * In addition to the locks on entry leaf pages and posting tree roots, all |
| scans grab a lock the metapage. This is to interlock with insertions to |
| the fast update pending list. An insertion to the pending list can really |
| belong anywhere in the tree, and the lock on the metapage represents that. |
| |
| The interlock for fastupdate pending lists means that with fastupdate=on, |
| we effectively always grab a full-index lock, so you could get a lot of false |
| positives. |
| |
| Compatibility |
| ------------- |
| |
| Compression of TIDs was introduced in 9.4. Some GIN indexes could remain in |
| uncompressed format because of pg_upgrade from 9.3 or earlier versions. |
| For compatibility, old uncompressed format is also supported. Following |
| rules are used to handle it: |
| |
| * GIN_ITUP_COMPRESSED flag marks index tuples that contain a posting list. |
| This flag is stored in high bit of ItemPointerGetBlockNumber(&itup->t_tid). |
| Use GinItupIsCompressed(itup) to check the flag. |
| |
| * Posting tree pages in the new format are marked with the GIN_COMPRESSED flag. |
| Macros GinPageIsCompressed(page) and GinPageSetCompressed(page) are used to |
| check and set this flag. |
| |
| * All scan operations check format of posting list add use corresponding code |
| to read its content. |
| |
| * When updating an index tuple containing an uncompressed posting list, it |
| will be replaced with new index tuple containing a compressed list. |
| |
| * When updating an uncompressed posting tree leaf page, it's compressed. |
| |
| * If vacuum finds some dead TIDs in uncompressed posting lists, they are |
| converted into compressed posting lists. This assumes that the compressed |
| posting list fits in the space occupied by the uncompressed list. IOW, we |
| assume that the compressed version of the page, with the dead items removed, |
| takes less space than the old uncompressed version. |
| |
| Limitations |
| ----------- |
| |
| * Gin doesn't use scan->kill_prior_tuple & scan->ignore_killed_tuples |
| * Gin searches entries only by equality matching, or simple range |
| matching using the "partial match" feature. |
| |
| TODO |
| ---- |
| |
| Nearest future: |
| |
| * Opclasses for more types (no programming, just many catalog changes) |
| |
| Distant future: |
| |
| * Replace B-tree of entries to something like GiST |
| |
| Authors |
| ------- |
| |
| Original work was done by Teodor Sigaev (teodor@sigaev.ru) and Oleg Bartunov |
| (oleg@sai.msu.su). |