| <!-- doc/src/sgml/hash.sgml --> |
| |
| <chapter id="hash-index"> |
| <title>Hash Indexes</title> |
| |
| <indexterm> |
| <primary>index</primary> |
| <secondary>Hash</secondary> |
| </indexterm> |
| |
| <sect1 id="hash-intro"> |
| <title>Overview</title> |
| |
| <para> |
| <productname>PostgreSQL</productname> |
| includes an implementation of persistent on-disk hash indexes, |
| which are fully crash recoverable. Any data type can be indexed by a |
| hash index, including data types that do not have a well-defined linear |
| ordering. Hash indexes store only the hash value of the data being |
| indexed, thus there are no restrictions on the size of the data column |
| being indexed. |
| </para> |
| |
| <para> |
| Hash indexes support only single-column indexes and do not allow |
| uniqueness checking. |
| </para> |
| |
| <para> |
| Hash indexes support only the <literal>=</literal> operator, |
| so WHERE clauses that specify range operations will not be able to take |
| advantage of hash indexes. |
| </para> |
| |
| <para> |
| Each hash index tuple stores just the 4-byte hash value, not the actual |
| column value. As a result, hash indexes may be much smaller than B-trees |
| when indexing longer data items such as UUIDs, URLs, etc. The absence of |
| the column value also makes all hash index scans lossy. Hash indexes may |
| take part in bitmap index scans and backward scans. |
| </para> |
| |
| <para> |
| Hash indexes are best optimized for SELECT and UPDATE-heavy workloads |
| that use equality scans on larger tables. In a B-tree index, searches must |
| descend through the tree until the leaf page is found. In tables with |
| millions of rows, this descent can increase access time to data. The |
| equivalent of a leaf page in a hash index is referred to as a bucket page. In |
| contrast, a hash index allows accessing the bucket pages directly, |
| thereby potentially reducing index access time in larger tables. This |
| reduction in "logical I/O" becomes even more pronounced on indexes/data |
| larger than shared_buffers/RAM. |
| </para> |
| |
| <para> |
| Hash indexes have been designed to cope with uneven distributions of |
| hash values. Direct access to the bucket pages works well if the hash |
| values are evenly distributed. When inserts mean that the bucket page |
| becomes full, additional overflow pages are chained to that specific |
| bucket page, locally expanding the storage for index tuples that match |
| that hash value. When scanning a hash bucket during queries, we need to |
| scan through all of the overflow pages. Thus an unbalanced hash index |
| might actually be worse than a B-tree in terms of number of block |
| accesses required, for some data. |
| </para> |
| |
| <para> |
| As a result of the overflow cases, we can say that hash indexes are |
| most suitable for unique, nearly unique data or data with a low number |
| of rows per hash bucket. |
| One possible way to avoid problems is to exclude highly non-unique |
| values from the index using a partial index condition, but this may |
| not be suitable in many cases. |
| </para> |
| |
| <para> |
| Like B-Trees, hash indexes perform simple index tuple deletion. This |
| is a deferred maintenance operation that deletes index tuples that are |
| known to be safe to delete (those whose item identifier's LP_DEAD bit |
| is already set). If an insert finds no space is available on a page we |
| try to avoid creating a new overflow page by attempting to remove dead |
| index tuples. Removal cannot occur if the page is pinned at that time. |
| Deletion of dead index pointers also occurs during VACUUM. |
| </para> |
| |
| <para> |
| If it can, VACUUM will also try to squeeze the index tuples onto as |
| few overflow pages as possible, minimizing the overflow chain. If an |
| overflow page becomes empty, overflow pages can be recycled for reuse |
| in other buckets, though we never return them to the operating system. |
| There is currently no provision to shrink a hash index, other than by |
| rebuilding it with REINDEX. |
| There is no provision for reducing the number of buckets, either. |
| </para> |
| |
| <para> |
| Hash indexes may expand the number of bucket pages as the number of |
| rows indexed grows. The hash key-to-bucket-number mapping is chosen so that |
| the index can be incrementally expanded. When a new bucket is to be added to |
| the index, exactly one existing bucket will need to be "split", with some of |
| its tuples being transferred to the new bucket according to the updated |
| key-to-bucket-number mapping. |
| </para> |
| |
| <para> |
| The expansion occurs in the foreground, which could increase execution |
| time for user inserts. Thus, hash indexes may not be suitable for tables |
| with rapidly increasing number of rows. |
| </para> |
| |
| </sect1> |
| |
| <sect1 id="hash-implementation"> |
| <title>Implementation</title> |
| |
| <para> |
| There are four kinds of pages in a hash index: the meta page (page zero), |
| which contains statically allocated control information; primary bucket |
| pages; overflow pages; and bitmap pages, which keep track of overflow |
| pages that have been freed and are available for re-use. For addressing |
| purposes, bitmap pages are regarded as a subset of the overflow pages. |
| </para> |
| |
| <para> |
| Both scanning the index and inserting tuples require locating the bucket |
| where a given tuple ought to be located. To do this, we need the bucket |
| count, highmask, and lowmask from the metapage; however, it's undesirable |
| for performance reasons to have to have to lock and pin the metapage for |
| every such operation. Instead, we retain a cached copy of the metapage |
| in each backend's relcache entry. This will produce the correct bucket |
| mapping as long as the target bucket hasn't been split since the last |
| cache refresh. |
| </para> |
| |
| <para> |
| Primary bucket pages and overflow pages are allocated independently since |
| any given index might need more or fewer overflow pages relative to its |
| number of buckets. The hash code uses an interesting set of addressing |
| rules to support a variable number of overflow pages while not having to |
| move primary bucket pages around after they are created. |
| </para> |
| |
| <para> |
| Each row in the table indexed is represented by a single index tuple in |
| the hash index. Hash index tuples are stored in bucket pages, and if |
| they exist, overflow pages. We speed up searches by keeping the index entries |
| in any one index page sorted by hash code, thus allowing binary search to be |
| used within an index page. Note however that there is *no* assumption about |
| the relative ordering of hash codes across different index pages of a bucket. |
| </para> |
| |
| <para> |
| The bucket splitting algorithms to expand the hash index are too complex to |
| be worthy of mention here, though are described in more detail in |
| <filename>src/backend/access/hash/README</filename>. |
| The split algorithm is crash safe and can be restarted if not completed |
| successfully. |
| </para> |
| |
| </sect1> |
| |
| </chapter> |