| src/backend/access/hash/README |
| |
| Hash Indexing |
| ============= |
| |
| This directory contains an implementation of hash indexing for Postgres. |
| Most of the core ideas are taken from Margo Seltzer and Ozan Yigit, |
| A New Hashing Package for UNIX, Proceedings of the Winter USENIX Conference, |
| January 1991. (Our in-memory hashtable implementation, |
| src/backend/utils/hash/dynahash.c, also relies on some of the same concepts; |
| it is derived from code written by Esmond Pitt and later improved by Margo |
| among others.) |
| |
| A hash index consists of two or more "buckets", into which tuples are |
| placed whenever their hash key maps to the bucket number. The |
| key-to-bucket-number mapping is chosen so that the index can be |
| incrementally expanded. When a new bucket is to be added to the index, |
| exactly one existing bucket will need to be "split", with some of its |
| tuples being transferred to the new bucket according to the updated |
| key-to-bucket-number mapping. This is essentially the same hash table |
| management technique embodied in src/backend/utils/hash/dynahash.c for |
| in-memory hash tables. |
| |
| Each bucket in the hash index comprises one or more index pages. The |
| bucket's first page is permanently assigned to it when the bucket is |
| created. Additional pages, called "overflow pages", are added if the |
| bucket receives too many tuples to fit in the primary bucket page. |
| The pages of a bucket are chained together in a doubly-linked list |
| using fields in the index page special space. |
| |
| There is currently no provision to shrink a hash index, other than by |
| rebuilding it with REINDEX. Overflow pages can be recycled for reuse |
| in other buckets, but we never give them back to the operating system. |
| There is no provision for reducing the number of buckets, either. |
| |
| As of PostgreSQL 8.4, hash index entries store only the hash code, not the |
| actual data value, for each indexed item. This makes the index entries |
| smaller (perhaps very substantially so) and speeds up various operations. |
| In particular, we can speed searches by keeping the index entries in any |
| one index page sorted by hash code, thus allowing binary search to be used |
| within an index page. Note however that there is *no* assumption about the |
| relative ordering of hash codes across different index pages of a bucket. |
| |
| |
| Page Addressing |
| --------------- |
| |
| There are four kinds of pages in a hash index: the meta page (page zero), |
| which contains statically allocated control information; primary bucket |
| pages; overflow pages; and bitmap pages, which keep track of overflow |
| pages that have been freed and are available for re-use. For addressing |
| purposes, bitmap pages are regarded as a subset of the overflow pages. |
| |
| Primary bucket pages and overflow pages are allocated independently (since |
| any given index might need more or fewer overflow pages relative to its |
| number of buckets). The hash code uses an interesting set of addressing |
| rules to support a variable number of overflow pages while not having to |
| move primary bucket pages around after they are created. |
| |
| Primary bucket pages (henceforth just "bucket pages") are allocated in |
| power-of-2 groups, called "split points" in the code. That means at every new |
| splitpoint we double the existing number of buckets. Allocating huge chunks |
| of bucket pages all at once isn't optimal and we will take ages to consume |
| those. To avoid this exponential growth of index size, we did use a trick to |
| break up allocation of buckets at the splitpoint into 4 equal phases. If |
| (2 ^ x) are the total buckets need to be allocated at a splitpoint (from now on |
| we shall call this as a splitpoint group), then we allocate 1/4th (2 ^ (x - 2)) |
| of total buckets at each phase of splitpoint group. Next quarter of allocation |
| will only happen if buckets of the previous phase have been already consumed. |
| For the initial splitpoint groups < 10 we will allocate all of their buckets in |
| single phase only, as number of buckets allocated at initial groups are small |
| in numbers. And for the groups >= 10 the allocation process is distributed |
| among four equal phases. At group 10 we allocate (2 ^ 9) buckets in 4 |
| different phases {2 ^ 7, 2 ^ 7, 2 ^ 7, 2 ^ 7}, the numbers in curly braces |
| indicate the number of buckets allocated within each phase of splitpoint group |
| 10. And, for splitpoint group 11 and 12 allocation phases will be |
| {2 ^ 8, 2 ^ 8, 2 ^ 8, 2 ^ 8} and {2 ^ 9, 2 ^ 9, 2 ^ 9, 2 ^ 9} respectively. We |
| can see that at each splitpoint group we double the total number of buckets |
| from the previous group but in an incremental phase. The bucket pages |
| allocated within one phase of a splitpoint group will appear consecutively in |
| the index. This addressing scheme allows the physical location of a bucket |
| page to be computed from the bucket number relatively easily, using only a |
| small amount of control information. If we look at the function |
| _hash_spareindex for a given bucket number we first compute the |
| splitpoint group it belongs to and then the phase to which the bucket belongs |
| to. Adding them we get the global splitpoint phase number S to which the |
| bucket belongs and then simply add "hashm_spares[S] + 1" (where hashm_spares[] |
| is an array stored in the metapage) with given bucket number to compute its |
| physical address. The hashm_spares[S] can be interpreted as the total number |
| of overflow pages that have been allocated before the bucket pages of |
| splitpoint phase S. The hashm_spares[0] is always 0, so that buckets 0 and 1 |
| always appear at block numbers 1 and 2, just after the meta page. We always |
| have hashm_spares[N] <= hashm_spares[N+1], since the latter count includes the |
| former. The difference between the two represents the number of overflow pages |
| appearing between the bucket page groups of splitpoints phase N and N+1. |
| (Note: the above describes what happens when filling an initially minimally |
| sized hash index. In practice, we try to estimate the required index size and |
| allocate a suitable number of splitpoints phases immediately, to avoid |
| expensive re-splitting during initial index build.) |
| |
| When S splitpoints exist altogether, the array entries hashm_spares[0] |
| through hashm_spares[S] are valid; hashm_spares[S] records the current |
| total number of overflow pages. New overflow pages are created as needed |
| at the end of the index, and recorded by incrementing hashm_spares[S]. |
| When it is time to create a new splitpoint phase's worth of bucket pages, we |
| copy hashm_spares[S] into hashm_spares[S+1] and increment S (which is |
| stored in the hashm_ovflpoint field of the meta page). This has the |
| effect of reserving the correct number of bucket pages at the end of the |
| index, and preparing to allocate additional overflow pages after those |
| bucket pages. hashm_spares[] entries before S cannot change anymore, |
| since that would require moving already-created bucket pages. |
| |
| The last page nominally used by the index is always determinable from |
| hashm_spares[S]. To avoid complaints from smgr, the logical EOF as seen by |
| the filesystem and smgr must always be greater than or equal to this page. |
| We have to allow the case "greater than" because it's possible that during |
| an index extension we crash after allocating filesystem space and before |
| updating the metapage. Note that on filesystems that allow "holes" in |
| files, it's entirely likely that pages before the logical EOF are not yet |
| allocated: when we allocate a new splitpoint phase's worth of bucket pages, we |
| physically zero the last such page to force the EOF up, and the first such |
| page will be used immediately, but the intervening pages are not written |
| until needed. |
| |
| Since overflow pages may be recycled if enough tuples are deleted from |
| their bucket, we need a way to keep track of currently-free overflow |
| pages. The state of each overflow page (0 = available, 1 = not available) |
| is recorded in "bitmap" pages dedicated to this purpose. The entries in |
| the bitmap are indexed by "bit number", a zero-based count in which every |
| overflow page has a unique entry. We can convert between an overflow |
| page's physical block number and its bit number using the information in |
| hashm_spares[] (see hashovfl.c for details). The bit number sequence |
| includes the bitmap pages, which is the reason for saying that bitmap |
| pages are a subset of the overflow pages. It turns out in fact that each |
| bitmap page's first bit represents itself --- this is not an essential |
| property, but falls out of the fact that we only allocate another bitmap |
| page when we really need one. Bit number zero always corresponds to the |
| first bitmap page, which is allocated during index creation just after all |
| the initially created buckets. |
| |
| |
| Lock Definitions |
| ---------------- |
| |
| Concurrency control for hash indexes is provided using buffer content |
| locks, buffer pins, and cleanup locks. Here as elsewhere in PostgreSQL, |
| cleanup lock means that we hold an exclusive lock on the buffer and have |
| observed at some point after acquiring the lock that we hold the only pin |
| on that buffer. For hash indexes, a cleanup lock on a primary bucket page |
| represents the right to perform an arbitrary reorganization of the entire |
| bucket. Therefore, scans retain a pin on the primary bucket page for the |
| bucket they are currently scanning. Splitting a bucket requires a cleanup |
| lock on both the old and new primary bucket pages. VACUUM therefore takes |
| a cleanup lock on every bucket page in order to remove tuples. It can also |
| remove tuples copied to a new bucket by any previous split operation, because |
| the cleanup lock taken on the primary bucket page guarantees that no scans |
| which started prior to the most recent split can still be in progress. After |
| cleaning each page individually, it attempts to take a cleanup lock on the |
| primary bucket page in order to "squeeze" the bucket down to the minimum |
| possible number of pages. |
| |
| To avoid deadlocks, we must be consistent about the lock order in which we |
| lock the buckets for operations that requires locks on two different buckets. |
| We choose to always lock the lower-numbered bucket first. The metapage is |
| only ever locked after all bucket locks have been taken. |
| |
| |
| Metapage Caching |
| ---------------- |
| |
| Both scanning the index and inserting tuples require locating the bucket |
| where a given tuple ought to be located. To do this, we need the bucket |
| count, highmask, and lowmask from the metapage; however, it's undesirable |
| for performance reasons to have to have to lock and pin the metapage for |
| every such operation. Instead, we retain a cached copy of the metapage |
| in each backend's relcache entry. This will produce the correct |
| bucket mapping as long as the target bucket hasn't been split since the |
| last cache refresh. |
| |
| To guard against the possibility that such a split has occurred, the |
| primary page of each bucket chain stores the number of buckets that |
| existed as of the time the bucket was last split, or if never split as |
| of the time it was created, in the space normally used for the |
| previous block number (that is, hasho_prevblkno). This doesn't cost |
| anything because the primary bucket page is always the first page in |
| the chain, and the previous block number is therefore always, in |
| reality, InvalidBlockNumber. |
| |
| After computing the ostensibly-correct bucket number based on our cached |
| copy of the metapage, we lock the corresponding primary bucket page and |
| check whether the bucket count stored in hasho_prevblkno is greater than |
| the number of buckets stored in our cached copy of the metapage. If |
| so, the bucket has certainly been split, because the count must originally |
| have been less than the number of buckets that existed at that time and |
| can't have increased except due to a split. If not, the bucket can't have |
| been split, because a split would have created a new bucket with a higher |
| bucket number than any we'd seen previously. In the latter case, we've |
| locked the correct bucket and can proceed; in the former case, we must |
| release the lock on this bucket, lock the metapage, update our cache, |
| unlock the metapage, and retry. |
| |
| Needing to retry occasionally might seem expensive, but the number of times |
| any given bucket can be split is limited to a few dozen no matter how |
| many times the hash index is accessed, because the total number of |
| buckets is limited to less than 2^32. On the other hand, the number of |
| times we access a bucket is unbounded and will be several orders of |
| magnitude larger even in unsympathetic cases. |
| |
| (The metapage cache is new in v10. Older hash indexes had the primary |
| bucket page's hasho_prevblkno initialized to InvalidBuffer.) |
| |
| Pseudocode Algorithms |
| --------------------- |
| |
| Various flags that are used in hash index operations are described as below: |
| |
| The bucket-being-split and bucket-being-populated flags indicate that split |
| the operation is in progress for a bucket. During split operation, a |
| bucket-being-split flag is set on the old bucket and bucket-being-populated |
| flag is set on new bucket. These flags are cleared once the split operation |
| is finished. |
| |
| The split-cleanup flag indicates that a bucket which has been recently split |
| still contains tuples that were also copied to the new bucket; it essentially |
| marks the split as incomplete. Once we're certain that no scans which |
| started before the new bucket was fully populated are still in progress, we |
| can remove the copies from the old bucket and clear the flag. We insist that |
| this flag must be clear before splitting a bucket; thus, a bucket can't be |
| split again until the previous split is totally complete. |
| |
| The moved-by-split flag on a tuple indicates that tuple is moved from old to |
| new bucket. Concurrent scans will skip such tuples until the split operation |
| is finished. Once the tuple is marked as moved-by-split, it will remain so |
| forever but that does no harm. We have intentionally not cleared it as that |
| can generate an additional I/O which is not necessary. |
| |
| The operations we need to support are: readers scanning the index for |
| entries of a particular hash code (which by definition are all in the same |
| bucket); insertion of a new tuple into the correct bucket; enlarging the |
| hash table by splitting an existing bucket; and garbage collection |
| (deletion of dead tuples and compaction of buckets). Bucket splitting is |
| done at conclusion of any insertion that leaves the hash table more full |
| than the target load factor, but it is convenient to consider it as an |
| independent operation. Note that we do not have a bucket-merge operation |
| --- the number of buckets never shrinks. Insertion, splitting, and |
| garbage collection may all need access to freelist management, which keeps |
| track of available overflow pages. |
| |
| The reader algorithm is: |
| |
| lock the primary bucket page of the target bucket |
| if the target bucket is still being populated by a split: |
| release the buffer content lock on current bucket page |
| pin and acquire the buffer content lock on old bucket in shared mode |
| release the buffer content lock on old bucket, but not pin |
| retake the buffer content lock on new bucket |
| arrange to scan the old bucket normally and the new bucket for |
| tuples which are not moved-by-split |
| -- then, per read request: |
| reacquire content lock on current page |
| step to next page if necessary (no chaining of content locks, but keep |
| the pin on the primary bucket throughout the scan) |
| save all the matching tuples from current index page into an items array |
| release pin and content lock (but if it is primary bucket page retain |
| its pin till the end of the scan) |
| get tuple from an item array |
| -- at scan shutdown: |
| release all pins still held |
| |
| Holding the buffer pin on the primary bucket page for the whole scan prevents |
| the reader's current-tuple pointer from being invalidated by splits or |
| compactions. (Of course, other buckets can still be split or compacted.) |
| |
| To minimize lock/unlock traffic, hash index scan always searches the entire |
| hash page to identify all the matching items at once, copying their heap tuple |
| IDs into backend-local storage. The heap tuple IDs are then processed while not |
| holding any page lock within the index thereby, allowing concurrent insertion |
| to happen on the same index page without any requirement of re-finding the |
| current scan position for the reader. We do continue to hold a pin on the |
| bucket page, to protect against concurrent deletions and bucket split. |
| |
| To allow for scans during a bucket split, if at the start of the scan, the |
| bucket is marked as bucket-being-populated, it scan all the tuples in that |
| bucket except for those that are marked as moved-by-split. Once it finishes |
| the scan of all the tuples in the current bucket, it scans the old bucket from |
| which this bucket is formed by split. |
| |
| The insertion algorithm is rather similar: |
| |
| lock the primary bucket page of the target bucket |
| -- (so far same as reader, except for acquisition of buffer content lock in |
| exclusive mode on primary bucket page) |
| if the bucket-being-split flag is set for a bucket and pin count on it is |
| one, then finish the split |
| release the buffer content lock on current bucket |
| get the "new" bucket which was being populated by the split |
| scan the new bucket and form the hash table of TIDs |
| conditionally get the cleanup lock on old and new buckets |
| if we get the lock on both the buckets |
| finish the split using algorithm mentioned below for split |
| release the pin on old bucket and restart the insert from beginning. |
| if current page is full, first check if this page contains any dead tuples. |
| if yes, remove dead tuples from the current page and again check for the |
| availability of the space. If enough space found, insert the tuple else |
| release lock but not pin, read/exclusive-lock |
| next page; repeat as needed |
| >> see below if no space in any page of bucket |
| take buffer content lock in exclusive mode on metapage |
| insert tuple at appropriate place in page |
| mark current page dirty |
| increment tuple count, decide if split needed |
| mark meta page dirty |
| write WAL for insertion of tuple |
| release the buffer content lock on metapage |
| release buffer content lock on current page |
| if current page is not a bucket page, release the pin on bucket page |
| if split is needed, enter Split algorithm below |
| release the pin on metapage |
| |
| To speed searches, the index entries within any individual index page are |
| kept sorted by hash code; the insertion code must take care to insert new |
| entries in the right place. It is okay for an insertion to take place in a |
| bucket that is being actively scanned, because readers can cope with this |
| as explained above. We only need the short-term buffer locks to ensure |
| that readers do not see a partially-updated page. |
| |
| To avoid deadlock between readers and inserters, whenever there is a need |
| to lock multiple buckets, we always take in the order suggested in Lock |
| Definitions above. This algorithm allows them a very high degree of |
| concurrency. (The exclusive metapage lock taken to update the tuple count |
| is stronger than necessary, since readers do not care about the tuple count, |
| but the lock is held for such a short time that this is probably not an |
| issue.) |
| |
| When an inserter cannot find space in any existing page of a bucket, it |
| must obtain an overflow page and add that page to the bucket's chain. |
| Details of that part of the algorithm appear later. |
| |
| The page split algorithm is entered whenever an inserter observes that the |
| index is overfull (has a higher-than-wanted ratio of tuples to buckets). |
| The algorithm attempts, but does not necessarily succeed, to split one |
| existing bucket in two, thereby lowering the fill ratio: |
| |
| pin meta page and take buffer content lock in exclusive mode |
| check split still needed |
| if split not needed anymore, drop buffer content lock and pin and exit |
| decide which bucket to split |
| try to take a cleanup lock on that bucket; if fail, give up |
| if that bucket is still being split or has split-cleanup work: |
| try to finish the split and the cleanup work |
| if that succeeds, start over; if it fails, give up |
| mark the old and new buckets indicating split is in progress |
| mark both old and new buckets as dirty |
| write WAL for allocation of new page for split |
| copy the tuples that belongs to new bucket from old bucket, marking |
| them as moved-by-split |
| write WAL record for moving tuples to new page once the new page is full |
| or all the pages of old bucket are finished |
| release lock but not pin for primary bucket page of old bucket, |
| read/shared-lock next page; repeat as needed |
| clear the bucket-being-split and bucket-being-populated flags |
| mark the old bucket indicating split-cleanup |
| write WAL for changing the flags on both old and new buckets |
| |
| The split operation's attempt to acquire cleanup-lock on the old bucket number |
| could fail if another process holds any lock or pin on it. We do not want to |
| wait if that happens, because we don't want to wait while holding the metapage |
| exclusive-lock. So, this is a conditional LWLockAcquire operation, and if |
| it fails we just abandon the attempt to split. This is all right since the |
| index is overfull but perfectly functional. Every subsequent inserter will |
| try to split, and eventually one will succeed. If multiple inserters failed |
| to split, the index might still be overfull, but eventually, the index will |
| not be overfull and split attempts will stop. (We could make a successful |
| splitter loop to see if the index is still overfull, but it seems better to |
| distribute the split overhead across successive insertions.) |
| |
| If a split fails partway through (e.g. due to insufficient disk space or an |
| interrupt), the index will not be corrupted. Instead, we'll retry the split |
| every time a tuple is inserted into the old bucket prior to inserting the new |
| tuple; eventually, we should succeed. The fact that a split is left |
| unfinished doesn't prevent subsequent buckets from being split, but we won't |
| try to split the bucket again until the prior split is finished. In other |
| words, a bucket can be in the middle of being split for some time, but it can't |
| be in the middle of two splits at the same time. |
| |
| The fourth operation is garbage collection (bulk deletion): |
| |
| next bucket := 0 |
| pin metapage and take buffer content lock in exclusive mode |
| fetch current max bucket number |
| release meta page buffer content lock and pin |
| while next bucket <= max bucket do |
| acquire cleanup lock on primary bucket page |
| loop: |
| scan and remove tuples |
| mark the target page dirty |
| write WAL for deleting tuples from target page |
| if this is the last bucket page, break out of loop |
| pin and x-lock next page |
| release prior lock and pin (except keep pin on primary bucket page) |
| if the page we have locked is not the primary bucket page: |
| release lock and take exclusive lock on primary bucket page |
| if there are no other pins on the primary bucket page: |
| squeeze the bucket to remove free space |
| release the pin on primary bucket page |
| next bucket ++ |
| end loop |
| pin metapage and take buffer content lock in exclusive mode |
| check if number of buckets changed |
| if so, release content lock and pin and return to for-each-bucket loop |
| else update metapage tuple count |
| mark meta page dirty and write WAL for update of metapage |
| release buffer content lock and pin |
| |
| Note that this is designed to allow concurrent splits and scans. If a split |
| occurs, tuples relocated into the new bucket will be visited twice by the |
| scan, but that does no harm. See also "Interlocking Between Scans and |
| VACUUM", below. |
| |
| We must be careful about the statistics reported by the VACUUM operation. |
| What we can do is count the number of tuples scanned, and believe this in |
| preference to the stored tuple count if the stored tuple count and number of |
| buckets did *not* change at any time during the scan. This provides a way of |
| correcting the stored tuple count if it gets out of sync for some reason. But |
| if a split or insertion does occur concurrently, the scan count is |
| untrustworthy; instead, subtract the number of tuples deleted from the stored |
| tuple count and use that. |
| |
| Interlocking Between Scans and VACUUM |
| ------------------------------------- |
| |
| Since we release the lock on bucket page during a cleanup scan of a bucket, a |
| concurrent scan could start in that bucket before we've finished vacuuming it. |
| If a scan gets ahead of cleanup, we could have the following problem: (1) the |
| scan sees heap TIDs that are about to be removed before they are processed by |
| VACUUM, (2) the scan decides that one or more of those TIDs are dead, (3) |
| VACUUM completes, (4) one or more of the TIDs the scan decided were dead are |
| reused for an unrelated tuple, and finally (5) the scan wakes up and |
| erroneously kills the new tuple. |
| |
| Note that this requires VACUUM and a scan to be active in the same bucket at |
| the same time. If VACUUM completes before the scan starts, the scan never has |
| a chance to see the dead tuples; if the scan completes before the VACUUM |
| starts, the heap TIDs can't have been reused meanwhile. Furthermore, VACUUM |
| can't start on a bucket that has an active scan, because the scan holds a pin |
| on the primary bucket page, and VACUUM must take a cleanup lock on that page |
| in order to begin cleanup. Therefore, the only way this problem can occur is |
| for a scan to start after VACUUM has released the cleanup lock on the bucket |
| but before it has processed the entire bucket and then overtake the cleanup |
| operation. |
| |
| Currently, we prevent this using lock chaining: cleanup locks the next page |
| in the chain before releasing the lock and pin on the page just processed. |
| |
| Free Space Management |
| --------------------- |
| |
| (Question: why is this so complicated? Why not just have a linked list |
| of free pages with the list head in the metapage? It's not like we |
| avoid needing to modify the metapage with all this.) |
| |
| Free space management consists of two sub-algorithms, one for reserving |
| an overflow page to add to a bucket chain, and one for returning an empty |
| overflow page to the free pool. |
| |
| Obtaining an overflow page: |
| |
| take metapage content lock in exclusive mode |
| determine next bitmap page number; if none, exit loop |
| release meta page content lock |
| pin bitmap page and take content lock in exclusive mode |
| search for a free page (zero bit in bitmap) |
| if found: |
| set bit in bitmap |
| mark bitmap page dirty |
| take metapage buffer content lock in exclusive mode |
| if first-free-bit value did not change, |
| update it and mark meta page dirty |
| else (not found): |
| release bitmap page buffer content lock |
| loop back to try next bitmap page, if any |
| -- here when we have checked all bitmap pages; we hold meta excl. lock |
| extend index to add another overflow page; update meta information |
| mark meta page dirty |
| return page number |
| |
| It is slightly annoying to release and reacquire the metapage lock |
| multiple times, but it seems best to do it that way to minimize loss of |
| concurrency against processes just entering the index. We don't want |
| to hold the metapage exclusive lock while reading in a bitmap page. |
| (We can at least avoid repeated buffer pin/unpin here.) |
| |
| The normal path for extending the index does not require doing I/O while |
| holding the metapage lock. We do have to do I/O when the extension |
| requires adding a new bitmap page as well as the required overflow page |
| ... but that is an infrequent case, so the loss of concurrency seems |
| acceptable. |
| |
| The portion of tuple insertion that calls the above subroutine looks |
| like this: |
| |
| -- having determined that no space is free in the target bucket: |
| remember last page of bucket, drop write lock on it |
| re-write-lock last page of bucket |
| if it is not last anymore, step to the last page |
| execute free-page-acquire (obtaining an overflow page) mechanism |
| described above |
| update (former) last page to point to the new page and mark buffer dirty |
| write-lock and initialize new page, with back link to former last page |
| write WAL for addition of overflow page |
| release the locks on meta page and bitmap page acquired in |
| free-page-acquire algorithm |
| release the lock on former last page |
| release the lock on new overflow page |
| insert tuple into new page |
| -- etc. |
| |
| Notice this handles the case where two concurrent inserters try to extend |
| the same bucket. They will end up with a valid, though perhaps |
| space-inefficient, configuration: two overflow pages will be added to the |
| bucket, each containing one tuple. |
| |
| The last part of this violates the rule about holding write lock on two |
| pages concurrently, but it should be okay to write-lock the previously |
| free page; there can be no other process holding lock on it. |
| |
| Bucket splitting uses a similar algorithm if it has to extend the new |
| bucket, but it need not worry about concurrent extension since it has |
| buffer content lock in exclusive mode on the new bucket. |
| |
| Freeing an overflow page requires the process to hold buffer content lock in |
| exclusive mode on the containing bucket, so need not worry about other |
| accessors of pages in the bucket. The algorithm is: |
| |
| delink overflow page from bucket chain |
| (this requires read/update/write/release of fore and aft siblings) |
| pin meta page and take buffer content lock in shared mode |
| determine which bitmap page contains the free space bit for page |
| release meta page buffer content lock |
| pin bitmap page and take buffer content lock in exclusive mode |
| retake meta page buffer content lock in exclusive mode |
| move (insert) tuples that belong to the overflow page being freed |
| update bitmap bit |
| mark bitmap page dirty |
| if page number is still less than first-free-bit, |
| update first-free-bit field and mark meta page dirty |
| write WAL for delinking overflow page operation |
| release buffer content lock and pin |
| release meta page buffer content lock and pin |
| |
| We have to do it this way because we must clear the bitmap bit before |
| changing the first-free-bit field (hashm_firstfree). It is possible that |
| we set first-free-bit too small (because someone has already reused the |
| page we just freed), but that is okay; the only cost is the next overflow |
| page acquirer will scan more bitmap bits than he needs to. What must be |
| avoided is having first-free-bit greater than the actual first free bit, |
| because then that free page would never be found by searchers. |
| |
| The reason of moving tuples from overflow page while delinking the later is |
| to make that as an atomic operation. Not doing so could lead to spurious reads |
| on standby. Basically, the user might see the same tuple twice. |
| |
| |
| WAL Considerations |
| ------------------ |
| |
| The hash index operations like create index, insert, delete, bucket split, |
| allocate overflow page, and squeeze in themselves don't guarantee hash index |
| consistency after a crash. To provide robustness, we write WAL for each of |
| these operations. |
| |
| CREATE INDEX writes multiple WAL records. First, we write a record to cover |
| the initializatoin of the metapage, followed by one for each new bucket |
| created, followed by one for the initial bitmap page. It's not important for |
| index creation to appear atomic, because the index isn't yet visible to any |
| other transaction, and the creating transaction will roll back in the event of |
| a crash. It would be difficult to cover the whole operation with a single |
| write-ahead log record anyway, because we can log only a fixed number of |
| pages, as given by XLR_MAX_BLOCK_ID (32), with current XLog machinery. |
| |
| Ordinary item insertions (that don't force a page split or need a new overflow |
| page) are single WAL entries. They touch a single bucket page and the |
| metapage. The metapage is updated during replay as it is updated during |
| original operation. |
| |
| If an insertion causes the addition of an overflow page, there will be one |
| WAL entry for the new overflow page and second entry for insert itself. |
| |
| If an insertion causes a bucket split, there will be one WAL entry for insert |
| itself, followed by a WAL entry for allocating a new bucket, followed by a WAL |
| entry for each overflow bucket page in the new bucket to which the tuples are |
| moved from old bucket, followed by a WAL entry to indicate that split is |
| complete for both old and new buckets. A split operation which requires |
| overflow pages to complete the operation will need to write a WAL record for |
| each new allocation of an overflow page. |
| |
| As splitting involves multiple atomic actions, it's possible that the system |
| crashes between moving tuples from bucket pages of the old bucket to new |
| bucket. In such a case, after recovery, the old and new buckets will be |
| marked with bucket-being-split and bucket-being-populated flags respectively |
| which indicates that split is in progress for those buckets. The reader |
| algorithm works correctly, as it will scan both the old and new buckets when |
| the split is in progress as explained in the reader algorithm section above. |
| |
| We finish the split at next insert or split operation on the old bucket as |
| explained in insert and split algorithm above. It could be done during |
| searches, too, but it seems best not to put any extra updates in what would |
| otherwise be a read-only operation (updating is not possible in hot standby |
| mode anyway). It would seem natural to complete the split in VACUUM, but since |
| splitting a bucket might require allocating a new page, it might fail if you |
| run out of disk space. That would be bad during VACUUM - the reason for |
| running VACUUM in the first place might be that you run out of disk space, |
| and now VACUUM won't finish because you're out of disk space. In contrast, |
| an insertion can require enlarging the physical file anyway. |
| |
| Deletion of tuples from a bucket is performed for two reasons: to remove dead |
| tuples, and to remove tuples that were moved by a bucket split. A WAL entry |
| is made for each bucket page from which tuples are removed, and then another |
| WAL entry is made when we clear the needs-split-cleanup flag. If dead tuples |
| are removed, a separate WAL entry is made to update the metapage. |
| |
| As deletion involves multiple atomic operations, it is quite possible that |
| system crashes after (a) removing tuples from some of the bucket pages, (b) |
| before clearing the garbage flag, or (c) before updating the metapage. If the |
| system crashes before completing (b), it will again try to clean the bucket |
| during next vacuum or insert after recovery which can have some performance |
| impact, but it will work fine. If the system crashes before completing (c), |
| after recovery there could be some additional splits until the next vacuum |
| updates the metapage, but the other operations like insert, delete and scan |
| will work correctly. We can fix this problem by actually updating the |
| metapage based on delete operation during replay, but it's not clear whether |
| it's worth the complication. |
| |
| A squeeze operation moves tuples from one of the buckets later in the chain to |
| one of the bucket earlier in chain and writes WAL record when either the |
| bucket to which it is writing tuples is filled or bucket from which it |
| is removing the tuples becomes empty. |
| |
| As a squeeze operation involves writing multiple atomic operations, it is |
| quite possible that the system crashes before completing the operation on |
| entire bucket. After recovery, the operations will work correctly, but |
| the index will remain bloated and this can impact performance of read and |
| insert operations until the next vacuum squeeze the bucket completely. |
| |
| |
| Other Notes |
| ----------- |
| |
| Clean up locks prevent a split from occurring while *another* process is stopped |
| in a given bucket. It also ensures that one of our *own* backend's scans is not |
| stopped in the bucket. |