| $PostgreSQL: pgsql/src/backend/access/hash/README,v 1.8 2008/03/20 17:55:14 momjian Exp $ |
| |
| Hash Indexing |
| ============= |
| |
| This directory contains an implementation of hash indexing for Postgres. Most |
| of the core ideas are taken from Margo Seltzer and Ozan Yigit, A New Hashing |
| Package for UNIX, Proceedings of the Winter USENIX Conference, January 1991. |
| (Our in-memory hashtable implementation, src/backend/utils/hash/dynahash.c, |
| also relies on some of the same concepts; it is derived from code written by |
| Esmond Pitt and later improved by Margo among others.) |
| |
| A hash index consists of two or more "buckets", into which tuples are |
| placed whenever their hash key maps to the bucket number. The |
| key-to-bucket-number mapping is chosen so that the index can be |
| incrementally expanded. When a new bucket is to be added to the index, |
| exactly one existing bucket will need to be "split", with some of its |
| tuples being transferred to the new bucket according to the updated |
| key-to-bucket-number mapping. This is essentially the same hash table |
| management technique embodied in src/backend/utils/hash/dynahash.c for |
| in-memory hash tables. |
| |
| Each bucket in the hash index comprises one or more index pages. The |
| bucket's first page is permanently assigned to it when the bucket is |
| created. Additional pages, called "overflow pages", are added if the |
| bucket receives too many tuples to fit in the primary bucket page. |
| The pages of a bucket are chained together in a doubly-linked list |
| using fields in the index page special space. |
| |
| There is currently no provision to shrink a hash index, other than by |
| rebuilding it with REINDEX. Overflow pages can be recycled for reuse |
| in other buckets, but we never give them back to the operating system. |
| There is no provision for reducing the number of buckets, either. |
| |
| |
| Page Addressing |
| --------------- |
| |
| There are four kinds of pages in a hash index: the meta page (page zero), |
| which contains statically allocated control information; primary bucket |
| pages; overflow pages; and bitmap pages, which keep track of overflow |
| pages that have been freed and are available for re-use. For addressing |
| purposes, bitmap pages are regarded as a subset of the overflow pages. |
| |
| Primary bucket pages and overflow pages are allocated independently (since |
| any given index might need more or fewer overflow pages relative to its |
| number of buckets). The hash code uses an interesting set of addressing |
| rules to support a variable number of overflow pages while not having to |
| move primary bucket pages around after they are created. |
| |
| Primary bucket pages (henceforth just "bucket pages") are allocated in |
| power-of-2 groups, called "split points" in the code. Buckets 0 and 1 |
| are created when the index is initialized. At the first split, buckets 2 |
| and 3 are allocated; when bucket 4 is needed, buckets 4-7 are allocated; |
| when bucket 8 is needed, buckets 8-15 are allocated; etc. All the bucket |
| pages of a power-of-2 group appear consecutively in the index. This |
| addressing scheme allows the physical location of a bucket page to be |
| computed from the bucket number relatively easily, using only a small |
| amount of control information. We take the log2() of the bucket number |
| to determine which split point S the bucket belongs to, and then simply |
| add "hashm_spares[S] + 1" (where hashm_spares[] is an array stored in the |
| metapage) to compute the physical address. hashm_spares[S] can be |
| interpreted as the total number of overflow pages that have been allocated |
| before the bucket pages of splitpoint S. hashm_spares[0] is always 0, |
| so that buckets 0 and 1 (which belong to splitpoint 0) always appear at |
| block numbers 1 and 2, just after the meta page. We always have |
| hashm_spares[N] <= hashm_spares[N+1], since the latter count includes the |
| former. The difference between the two represents the number of overflow |
| pages appearing between the bucket page groups of splitpoints N and N+1. |
| |
| (Note: the above describes what happens when filling an initially minimally |
| sized hash index. In practice, we try to estimate the required index size |
| and allocate a suitable number of splitpoints immediately, to avoid |
| expensive re-splitting during initial index build.) |
| |
| When S splitpoints exist altogether, the array entries hashm_spares[0] |
| through hashm_spares[S] are valid; hashm_spares[S] records the current |
| total number of overflow pages. New overflow pages are created as needed |
| at the end of the index, and recorded by incrementing hashm_spares[S]. |
| When it is time to create a new splitpoint's worth of bucket pages, we |
| copy hashm_spares[S] into hashm_spares[S+1] and increment S (which is |
| stored in the hashm_ovflpoint field of the meta page). This has the |
| effect of reserving the correct number of bucket pages at the end of the |
| index, and preparing to allocate additional overflow pages after those |
| bucket pages. hashm_spares[] entries before S cannot change anymore, |
| since that would require moving already-created bucket pages. |
| |
| The last page nominally used by the index is always determinable from |
| hashm_spares[S]. To avoid complaints from smgr, the logical EOF as seen by |
| the filesystem and smgr must always be greater than or equal to this page. |
| We have to allow the case "greater than" because it's possible that during |
| an index extension we crash after allocating filesystem space and before |
| updating the metapage. Note that on filesystems that allow "holes" in |
| files, it's entirely likely that pages before the logical EOF are not yet |
| allocated: when we allocate a new splitpoint's worth of bucket pages, we |
| physically zero the last such page to force the EOF up, and the first such |
| page will be used immediately, but the intervening pages are not written |
| until needed. |
| |
| Since overflow pages may be recycled if enough tuples are deleted from |
| their bucket, we need a way to keep track of currently-free overflow |
| pages. The state of each overflow page (0 = available, 1 = not available) |
| is recorded in "bitmap" pages dedicated to this purpose. The entries in |
| the bitmap are indexed by "bit number", a zero-based count in which every |
| overflow page has a unique entry. We can convert between an overflow |
| page's physical block number and its bit number using the information in |
| hashm_spares[] (see hashovfl.c for details). The bit number sequence |
| includes the bitmap pages, which is the reason for saying that bitmap |
| pages are a subset of the overflow pages. It turns out in fact that each |
| bitmap page's first bit represents itself --- this is not an essential |
| property, but falls out of the fact that we only allocate another bitmap |
| page when we really need one. Bit number zero always corresponds to the |
| first bitmap page, which is allocated during index creation just after all |
| the initially created buckets. |
| |
| |
| Lock Definitions |
| ---------------- |
| |
| We use both lmgr locks ("heavyweight" locks) and buffer context locks |
| (LWLocks) to control access to a hash index. lmgr locks are needed for |
| long-term locking since there is a (small) risk of deadlock, which we must |
| be able to detect. Buffer context locks are used for short-term access |
| control to individual pages of the index. |
| |
| We define the following lmgr locks for a hash index: |
| |
| LockPage(rel, 0) represents the right to modify the hash-code-to-bucket |
| mapping. A process attempting to enlarge the hash table by splitting a |
| bucket must exclusive-lock this lock before modifying the metapage data |
| representing the mapping. Processes intending to access a particular |
| bucket must share-lock this lock until they have acquired lock on the |
| correct target bucket. |
| |
| LockPage(rel, page), where page is the page number of a hash bucket page, |
| represents the right to split or compact an individual bucket. A process |
| splitting a bucket must exclusive-lock both old and new halves of the |
| bucket until it is done. A process doing VACUUM must exclusive-lock the |
| bucket it is currently purging tuples from. Processes doing scans or |
| insertions must share-lock the bucket they are scanning or inserting into. |
| (It is okay to allow concurrent scans and insertions.) |
| |
| The lmgr lock IDs corresponding to overflow pages are currently unused. |
| These are available for possible future refinements. |
| |
| Note that these lock definitions are conceptually distinct from any sort |
| of lock on the pages whose numbers they share. A process must also obtain |
| read or write buffer lock on the metapage or bucket page before accessing |
| said page. |
| |
| Processes performing hash index scans must hold share lock on the bucket |
| they are scanning throughout the scan. This seems to be essential, since |
| there is no reasonable way for a scan to cope with its bucket being split |
| underneath it. This creates a possibility of deadlock external to the |
| hash index code, since a process holding one of these locks could block |
| waiting for an unrelated lock held by another process. If that process |
| then does something that requires exclusive lock on the bucket, we have |
| deadlock. Therefore the bucket locks must be lmgr locks so that deadlock |
| can be detected and recovered from. This also forces the page-zero lock |
| to be an lmgr lock, because as we'll see below it is held while attempting |
| to acquire a bucket lock, and so it could also participate in a deadlock. |
| |
| Processes must obtain read (share) buffer context lock on any hash index |
| page while reading it, and write (exclusive) lock while modifying it. |
| To prevent deadlock we enforce these coding rules: no buffer lock may be |
| held long term (across index AM calls), nor may any buffer lock be held |
| while waiting for an lmgr lock, nor may more than one buffer lock |
| be held at a time by any one process. (The third restriction is probably |
| stronger than necessary, but it makes the proof of no deadlock obvious.) |
| |
| |
| Pseudocode Algorithms |
| --------------------- |
| |
| The operations we need to support are: readers scanning the index for |
| entries of a particular hash code (which by definition are all in the same |
| bucket); insertion of a new tuple into the correct bucket; enlarging the |
| hash table by splitting an existing bucket; and garbage collection |
| (deletion of dead tuples and compaction of buckets). Bucket splitting is |
| done at conclusion of any insertion that leaves the hash table more full |
| than the target load factor, but it is convenient to consider it as an |
| independent operation. Note that we do not have a bucket-merge operation |
| --- the number of buckets never shrinks. Insertion, splitting, and |
| garbage collection may all need access to freelist management, which keeps |
| track of available overflow pages. |
| |
| The reader algorithm is: |
| |
| share-lock page 0 (to prevent active split) |
| read/sharelock meta page |
| compute bucket number for target hash key |
| release meta page |
| share-lock bucket page (to prevent split/compact of this bucket) |
| release page 0 share-lock |
| -- then, per read request: |
| read/sharelock current page of bucket |
| step to next page if necessary (no chaining of locks) |
| get tuple |
| release current page |
| -- at scan shutdown: |
| release bucket share-lock |
| |
| By holding the page-zero lock until lock on the target bucket is obtained, |
| the reader ensures that the target bucket calculation is valid (otherwise |
| the bucket might be split before the reader arrives at it, and the target |
| entries might go into the new bucket). Holding the bucket sharelock for |
| the remainder of the scan prevents the reader's current-tuple pointer from |
| being invalidated by other processes. Notice though that the reader need |
| not prevent other buckets from being split or compacted. |
| |
| The insertion algorithm is rather similar: |
| |
| share-lock page 0 (to prevent active split) |
| read/sharelock meta page |
| compute bucket number for target hash key |
| release meta page |
| share-lock bucket page (to prevent split/compact of this bucket) |
| release page 0 share-lock |
| -- (so far same as reader) |
| read/exclusive-lock current page of bucket |
| if full, release, read/exclusive-lock next page; repeat as needed |
| >> see below if no space in any page of bucket |
| insert tuple |
| write/release current page |
| release bucket share-lock |
| read/exclusive-lock meta page |
| increment tuple count, decide if split needed |
| write/release meta page |
| done if no split needed, else enter Split algorithm below |
| |
| It is okay for an insertion to take place in a bucket that is being |
| actively scanned, because it does not change the position of any existing |
| item in the bucket, so scan states are not invalidated. We only need the |
| short-term buffer locks to ensure that readers do not see a |
| partially-updated page. |
| |
| It is clearly impossible for readers and inserters to deadlock, and in |
| fact this algorithm allows them a very high degree of concurrency. |
| (The exclusive metapage lock taken to update the tuple count is stronger |
| than necessary, since readers do not care about the tuple count, but the |
| lock is held for such a short time that this is probably not an issue.) |
| |
| When an inserter cannot find space in any existing page of a bucket, it |
| must obtain an overflow page and add that page to the bucket's chain. |
| Details of that part of the algorithm appear later. |
| |
| The page split algorithm is entered whenever an inserter observes that the |
| index is overfull (has a higher-than-wanted ratio of tuples to buckets). |
| The algorithm attempts, but does not necessarily succeed, to split one |
| existing bucket in two, thereby lowering the fill ratio: |
| |
| exclusive-lock page 0 (assert the right to begin a split) |
| read/exclusive-lock meta page |
| check split still needed |
| if split not needed anymore, drop locks and exit |
| decide which bucket to split |
| Attempt to X-lock old bucket number (definitely could fail) |
| Attempt to X-lock new bucket number (shouldn't fail, but...) |
| if above fail, drop locks and exit |
| update meta page to reflect new number of buckets |
| write/release meta page |
| release X-lock on page 0 |
| -- now, accesses to all other buckets can proceed. |
| Perform actual split of bucket, moving tuples as needed |
| >> see below about acquiring needed extra space |
| Release X-locks of old and new buckets |
| |
| Note the page zero and metapage locks are not held while the actual tuple |
| rearrangement is performed, so accesses to other buckets can proceed in |
| parallel; in fact, it's possible for multiple bucket splits to proceed |
| in parallel. |
| |
| Split's attempt to X-lock the old bucket number could fail if another |
| process holds S-lock on it. We do not want to wait if that happens, first |
| because we don't want to wait while holding the metapage exclusive-lock, |
| and second because it could very easily result in deadlock. (The other |
| process might be out of the hash AM altogether, and could do something |
| that blocks on another lock this process holds; so even if the hash |
| algorithm itself is deadlock-free, a user-induced deadlock could occur.) |
| So, this is a conditional LockAcquire operation, and if it fails we just |
| abandon the attempt to split. This is all right since the index is |
| overfull but perfectly functional. Every subsequent inserter will try to |
| split, and eventually one will succeed. If multiple inserters failed to |
| split, the index might still be overfull, but eventually, the index will |
| not be overfull and split attempts will stop. (We could make a successful |
| splitter loop to see if the index is still overfull, but it seems better to |
| distribute the split overhead across successive insertions.) |
| |
| A problem is that if a split fails partway through (eg due to insufficient |
| disk space) the index is left corrupt. The probability of that could be |
| made quite low if we grab a free page or two before we update the meta |
| page, but the only real solution is to treat a split as a WAL-loggable, |
| must-complete action. I'm not planning to teach hash about WAL in this |
| go-round. |
| |
| The fourth operation is garbage collection (bulk deletion): |
| |
| next bucket := 0 |
| read/sharelock meta page |
| fetch current max bucket number |
| release meta page |
| while next bucket <= max bucket do |
| Acquire X lock on target bucket |
| Scan and remove tuples, compact free space as needed |
| Release X lock |
| next bucket ++ |
| end loop |
| exclusive-lock meta page |
| check if number of buckets changed |
| if so, release lock and return to for-each-bucket loop |
| else update metapage tuple count |
| write/release meta page |
| |
| Note that this is designed to allow concurrent splits. If a split occurs, |
| tuples relocated into the new bucket will be visited twice by the scan, |
| but that does no harm. (We must however be careful about the statistics |
| reported by the VACUUM operation. What we can do is count the number of |
| tuples scanned, and believe this in preference to the stored tuple count |
| if the stored tuple count and number of buckets did *not* change at any |
| time during the scan. This provides a way of correcting the stored tuple |
| count if it gets out of sync for some reason. But if a split or insertion |
| does occur concurrently, the scan count is untrustworthy; instead, |
| subtract the number of tuples deleted from the stored tuple count and |
| use that.) |
| |
| The exclusive lock request could deadlock in some strange scenarios, but |
| we can just error out without any great harm being done. |
| |
| |
| Free Space Management |
| --------------------- |
| |
| (Question: why is this so complicated? Why not just have a linked list |
| of free pages with the list head in the metapage? It's not like we |
| avoid needing to modify the metapage with all this.) |
| |
| Free space management consists of two sub-algorithms, one for reserving |
| an overflow page to add to a bucket chain, and one for returning an empty |
| overflow page to the free pool. |
| |
| Obtaining an overflow page: |
| |
| read/exclusive-lock meta page |
| determine next bitmap page number; if none, exit loop |
| release meta page lock |
| read/exclusive-lock bitmap page |
| search for a free page (zero bit in bitmap) |
| if found: |
| set bit in bitmap |
| write/release bitmap page |
| read/exclusive-lock meta page |
| if first-free-bit value did not change, |
| update it and write meta page |
| release meta page |
| return page number |
| else (not found): |
| release bitmap page |
| loop back to try next bitmap page, if any |
| -- here when we have checked all bitmap pages; we hold meta excl. lock |
| extend index to add another overflow page; update meta information |
| write/release meta page |
| return page number |
| |
| It is slightly annoying to release and reacquire the metapage lock |
| multiple times, but it seems best to do it that way to minimize loss of |
| concurrency against processes just entering the index. We don't want |
| to hold the metapage exclusive lock while reading in a bitmap page. |
| (We can at least avoid repeated buffer pin/unpin here.) |
| |
| The normal path for extending the index does not require doing I/O while |
| holding the metapage lock. We do have to do I/O when the extension |
| requires adding a new bitmap page as well as the required overflow page |
| ... but that is an infrequent case, so the loss of concurrency seems |
| acceptable. |
| |
| The portion of tuple insertion that calls the above subroutine looks |
| like this: |
| |
| -- having determined that no space is free in the target bucket: |
| remember last page of bucket, drop write lock on it |
| call free-page-acquire routine |
| re-write-lock last page of bucket |
| if it is not last anymore, step to the last page |
| update (former) last page to point to new page |
| write-lock and initialize new page, with back link to former last page |
| write and release former last page |
| insert tuple into new page |
| -- etc. |
| |
| Notice this handles the case where two concurrent inserters try to extend |
| the same bucket. They will end up with a valid, though perhaps |
| space-inefficient, configuration: two overflow pages will be added to the |
| bucket, each containing one tuple. |
| |
| The last part of this violates the rule about holding write lock on two |
| pages concurrently, but it should be okay to write-lock the previously |
| free page; there can be no other process holding lock on it. |
| |
| Bucket splitting uses a similar algorithm if it has to extend the new |
| bucket, but it need not worry about concurrent extension since it has |
| exclusive lock on the new bucket. |
| |
| Freeing an overflow page is done by garbage collection and by bucket |
| splitting (the old bucket may contain no-longer-needed overflow pages). |
| In both cases, the process holds exclusive lock on the containing bucket, |
| so need not worry about other accessors of pages in the bucket. The |
| algorithm is: |
| |
| delink overflow page from bucket chain |
| (this requires read/update/write/release of fore and aft siblings) |
| read/share-lock meta page |
| determine which bitmap page contains the free space bit for page |
| release meta page |
| read/exclusive-lock bitmap page |
| update bitmap bit |
| write/release bitmap page |
| if page number is less than what we saw as first-free-bit in meta: |
| read/exclusive-lock meta page |
| if page number is still less than first-free-bit, |
| update first-free-bit field and write meta page |
| release meta page |
| |
| We have to do it this way because we must clear the bitmap bit before |
| changing the first-free-bit field (hashm_firstfree). It is possible that |
| we set first-free-bit too small (because someone has already reused the |
| page we just freed), but that is okay; the only cost is the next overflow |
| page acquirer will scan more bitmap bits than he needs to. What must be |
| avoided is having first-free-bit greater than the actual first free bit, |
| because then that free page would never be found by searchers. |
| |
| All the freespace operations should be called while holding no buffer |
| locks. Since they need no lmgr locks, deadlock is not possible. |
| |
| |
| Other Notes |
| ----------- |
| |
| All the shenanigans with locking prevent a split occurring while *another* |
| process is stopped in a given bucket. They do not ensure that one of |
| our *own* backend's scans is not stopped in the bucket, because lmgr |
| doesn't consider a process's own locks to conflict. So the Split |
| algorithm must check for that case separately before deciding it can go |
| ahead with the split. VACUUM does not have this problem since nothing |
| else can be happening within the vacuuming backend. |
| |
| Should we instead try to fix the state of any conflicting local scan? |
| Seems mighty ugly --- got to move the held bucket S-lock as well as lots |
| of other messiness. For now, just punt and don't split. |