| $PostgreSQL: pgsql/src/backend/storage/buffer/README,v 1.17 2009/06/22 20:04:28 tgl Exp $ |
| |
| Notes About Shared Buffer Access Rules |
| ====================================== |
| |
| There are two separate access control mechanisms for shared disk buffers: |
| reference counts (a/k/a pin counts) and buffer content locks. (Actually, |
| there's a third level of access control: one must hold the appropriate kind |
| of lock on a relation before one can legally access any page belonging to |
| the relation. Relation-level locks are not discussed here.) |
| |
| Pins: one must "hold a pin on" a buffer (increment its reference count) |
| before being allowed to do anything at all with it. An unpinned buffer is |
| subject to being reclaimed and reused for a different page at any instant, |
| so touching it is unsafe. Normally a pin is acquired via ReadBuffer and |
| released via ReleaseBuffer. It is OK and indeed common for a single |
| backend to pin a page more than once concurrently; the buffer manager |
| handles this efficiently. It is considered OK to hold a pin for long |
| intervals --- for example, sequential scans hold a pin on the current page |
| until done processing all the tuples on the page, which could be quite a |
| while if the scan is the outer scan of a join. Similarly, btree index |
| scans hold a pin on the current index page. This is OK because normal |
| operations never wait for a page's pin count to drop to zero. (Anything |
| that might need to do such a wait is instead handled by waiting to obtain |
| the relation-level lock, which is why you'd better hold one first.) Pins |
| may not be held across transaction boundaries, however. |
| |
| Buffer content locks: there are two kinds of buffer lock, shared and exclusive, |
| which act just as you'd expect: multiple backends can hold shared locks on |
| the same buffer, but an exclusive lock prevents anyone else from holding |
| either shared or exclusive lock. (These can alternatively be called READ |
| and WRITE locks.) These locks are intended to be short-term: they should not |
| be held for long. Buffer locks are acquired and released by LockBuffer(). |
| It will *not* work for a single backend to try to acquire multiple locks on |
| the same buffer. One must pin a buffer before trying to lock it. |
| |
| Buffer access rules: |
| |
| 1. To scan a page for tuples, one must hold a pin and either shared or |
| exclusive content lock. To examine the commit status (XIDs and status bits) |
| of a tuple in a shared buffer, one must likewise hold a pin and either shared |
| or exclusive lock. |
| |
| 2. Once one has determined that a tuple is interesting (visible to the |
| current transaction) one may drop the content lock, yet continue to access |
| the tuple's data for as long as one holds the buffer pin. This is what is |
| typically done by heap scans, since the tuple returned by heap_fetch |
| contains a pointer to tuple data in the shared buffer. Therefore the |
| tuple cannot go away while the pin is held (see rule #5). Its state could |
| change, but that is assumed not to matter after the initial determination |
| of visibility is made. |
| |
| 3. To add a tuple or change the xmin/xmax fields of an existing tuple, |
| one must hold a pin and an exclusive content lock on the containing buffer. |
| This ensures that no one else might see a partially-updated state of the |
| tuple while they are doing visibility checks. |
| |
| 4. It is considered OK to update tuple commit status bits (ie, OR the |
| values HEAP_XMIN_COMMITTED, HEAP_XMIN_INVALID, HEAP_XMAX_COMMITTED, or |
| HEAP_XMAX_INVALID into t_infomask) while holding only a shared lock and |
| pin on a buffer. This is OK because another backend looking at the tuple |
| at about the same time would OR the same bits into the field, so there |
| is little or no risk of conflicting update; what's more, if there did |
| manage to be a conflict it would merely mean that one bit-update would |
| be lost and need to be done again later. These four bits are only hints |
| (they cache the results of transaction status lookups in pg_clog), so no |
| great harm is done if they get reset to zero by conflicting updates. |
| |
| 5. To physically remove a tuple or compact free space on a page, one |
| must hold a pin and an exclusive lock, *and* observe while holding the |
| exclusive lock that the buffer's shared reference count is one (ie, |
| no other backend holds a pin). If these conditions are met then no other |
| backend can perform a page scan until the exclusive lock is dropped, and |
| no other backend can be holding a reference to an existing tuple that it |
| might expect to examine again. Note that another backend might pin the |
| buffer (increment the refcount) while one is performing the cleanup, but |
| it won't be able to actually examine the page until it acquires shared |
| or exclusive content lock. |
| |
| |
| Rule #5 only affects VACUUM operations. Obtaining the |
| necessary lock is done by the bufmgr routine LockBufferForCleanup(). |
| It first gets an exclusive lock and then checks to see if the shared pin |
| count is currently 1. If not, it releases the exclusive lock (but not the |
| caller's pin) and waits until signaled by another backend, whereupon it |
| tries again. The signal will occur when UnpinBuffer decrements the shared |
| pin count to 1. As indicated above, this operation might have to wait a |
| good while before it acquires lock, but that shouldn't matter much for |
| concurrent VACUUM. The current implementation only supports a single |
| waiter for pin-count-1 on any particular shared buffer. This is enough |
| for VACUUM's use, since we don't allow multiple VACUUMs concurrently on a |
| single relation anyway. |
| |
| |
| Buffer Manager's Internal Locking |
| --------------------------------- |
| |
| Before PostgreSQL 8.1, all operations of the shared buffer manager itself |
| were protected by a single system-wide lock, the BufMgrLock, which |
| unsurprisingly proved to be a source of contention. The new locking scheme |
| avoids grabbing system-wide exclusive locks in common code paths. It works |
| like this: |
| |
| * There is a system-wide LWLock, the BufMappingLock, that notionally |
| protects the mapping from buffer tags (page identifiers) to buffers. |
| (Physically, it can be thought of as protecting the hash table maintained |
| by buf_table.c.) To look up whether a buffer exists for a tag, it is |
| sufficient to obtain share lock on the BufMappingLock. Note that one |
| must pin the found buffer, if any, before releasing the BufMappingLock. |
| To alter the page assignment of any buffer, one must hold exclusive lock |
| on the BufMappingLock. This lock must be held across adjusting the buffer's |
| header fields and changing the buf_table hash table. The only common |
| operation that needs exclusive lock is reading in a page that was not |
| in shared buffers already, which will require at least a kernel call |
| and usually a wait for I/O, so it will be slow anyway. |
| |
| * As of PG 8.2, the BufMappingLock has been split into NUM_BUFFER_PARTITIONS |
| separate locks, each guarding a portion of the buffer tag space. This allows |
| further reduction of contention in the normal code paths. The partition |
| that a particular buffer tag belongs to is determined from the low-order |
| bits of the tag's hash value. The rules stated above apply to each partition |
| independently. If it is necessary to lock more than one partition at a time, |
| they must be locked in partition-number order to avoid risk of deadlock. |
| |
| * A separate system-wide LWLock, the BufFreelistLock, provides mutual |
| exclusion for operations that access the buffer free list or select |
| buffers for replacement. This is always taken in exclusive mode since |
| there are no read-only operations on those data structures. The buffer |
| management policy is designed so that BufFreelistLock need not be taken |
| except in paths that will require I/O, and thus will be slow anyway. |
| (Details appear below.) It is never necessary to hold the BufMappingLock |
| and the BufFreelistLock at the same time. |
| |
| * Each buffer header contains a spinlock that must be taken when examining |
| or changing fields of that buffer header. This allows operations such as |
| ReleaseBuffer to make local state changes without taking any system-wide |
| lock. We use a spinlock, not an LWLock, since there are no cases where |
| the lock needs to be held for more than a few instructions. |
| |
| Note that a buffer header's spinlock does not control access to the data |
| held within the buffer. Each buffer header also contains an LWLock, the |
| "buffer content lock", that *does* represent the right to access the data |
| in the buffer. It is used per the rules above. |
| |
| There is yet another set of per-buffer LWLocks, the io_in_progress locks, |
| that are used to wait for I/O on a buffer to complete. The process doing |
| a read or write takes exclusive lock for the duration, and processes that |
| need to wait for completion try to take shared locks (which they release |
| immediately upon obtaining). XXX on systems where an LWLock represents |
| nontrivial resources, it's fairly annoying to need so many locks. Possibly |
| we could use per-backend LWLocks instead (a buffer header would then contain |
| a field to show which backend is doing its I/O). |
| |
| |
| Normal Buffer Replacement Strategy |
| ---------------------------------- |
| |
| There is a "free list" of buffers that are prime candidates for replacement. |
| In particular, buffers that are completely free (contain no valid page) are |
| always in this list. We could also throw buffers into this list if we |
| consider their pages unlikely to be needed soon; however, the current |
| algorithm never does that. The list is singly-linked using fields in the |
| buffer headers; we maintain head and tail pointers in global variables. |
| (Note: although the list links are in the buffer headers, they are |
| considered to be protected by the BufFreelistLock, not the buffer-header |
| spinlocks.) To choose a victim buffer to recycle when there are no free |
| buffers available, we use a simple clock-sweep algorithm, which avoids the |
| need to take system-wide locks during common operations. It works like |
| this: |
| |
| Each buffer header contains a usage counter, which is incremented (up to a |
| small limit value) whenever the buffer is unpinned. (This requires only the |
| buffer header spinlock, which would have to be taken anyway to decrement the |
| buffer reference count, so it's nearly free.) |
| |
| The "clock hand" is a buffer index, NextVictimBuffer, that moves circularly |
| through all the available buffers. NextVictimBuffer is protected by the |
| BufFreelistLock. |
| |
| The algorithm for a process that needs to obtain a victim buffer is: |
| |
| 1. Obtain BufFreelistLock. |
| |
| 2. If buffer free list is nonempty, remove its head buffer. If the buffer |
| is pinned or has a nonzero usage count, it cannot be used; ignore it and |
| return to the start of step 2. Otherwise, pin the buffer, release |
| BufFreelistLock, and return the buffer. |
| |
| 3. Otherwise, select the buffer pointed to by NextVictimBuffer, and |
| circularly advance NextVictimBuffer for next time. |
| |
| 4. If the selected buffer is pinned or has a nonzero usage count, it cannot |
| be used. Decrement its usage count (if nonzero) and return to step 3 to |
| examine the next buffer. |
| |
| 5. Pin the selected buffer, release BufFreelistLock, and return the buffer. |
| |
| (Note that if the selected buffer is dirty, we will have to write it out |
| before we can recycle it; if someone else pins the buffer meanwhile we will |
| have to give up and try another buffer. This however is not a concern |
| of the basic select-a-victim-buffer algorithm.) |
| |
| |
| Buffer Ring Replacement Strategy |
| --------------------------------- |
| |
| When running a query that needs to access a large number of pages just once, |
| such as VACUUM or a large sequential scan, a different strategy is used. |
| A page that has been touched only by such a scan is unlikely to be needed |
| again soon, so instead of running the normal clock sweep algorithm and |
| blowing out the entire buffer cache, a small ring of buffers is allocated |
| using the normal clock sweep algorithm and those buffers are reused for the |
| whole scan. This also implies that much of the write traffic caused by such |
| a statement will be done by the backend itself and not pushed off onto other |
| processes. |
| |
| For sequential scans, a 256KB ring is used. That's small enough to fit in L2 |
| cache, which makes transferring pages from OS cache to shared buffer cache |
| efficient. Even less would often be enough, but the ring must be big enough |
| to accommodate all pages in the scan that are pinned concurrently. 256KB |
| should also be enough to leave a small cache trail for other backends to |
| join in a synchronized seq scan. If a ring buffer is dirtied and its LSN |
| updated, we would normally have to write and flush WAL before we could |
| re-use the buffer; in this case we instead discard the buffer from the ring |
| and (later) choose a replacement using the normal clock-sweep algorithm. |
| Hence this strategy works best for scans that are read-only (or at worst |
| update hint bits). In a scan that modifies every page in the scan, like a |
| bulk UPDATE or DELETE, the buffers in the ring will always be dirtied and |
| the ring strategy effectively degrades to the normal strategy. |
| |
| VACUUM uses a 256KB ring like sequential scans, but dirty pages are not |
| removed from the ring. Instead, WAL is flushed if needed to allow reuse of |
| the buffers. Before introducing the buffer ring strategy in 8.3, VACUUM's |
| buffers were sent to the freelist, which was effectively a buffer ring of 1 |
| buffer, resulting in excessive WAL flushing. Allowing VACUUM to update |
| 256KB between WAL flushes should be more efficient. |
| |
| Bulk writes work similarly to VACUUM. Currently this applies only to |
| COPY IN and CREATE TABLE AS SELECT. (Might it be interesting to make |
| seqscan UPDATE and DELETE use the bulkwrite strategy?) For bulk writes |
| we use a ring size of 16MB (but not more than 1/8th of shared_buffers). |
| Smaller sizes have been shown to result in the COPY blocking too often |
| for WAL flushes. While it's okay for a background vacuum to be slowed by |
| doing its own WAL flushing, we'd prefer that COPY not be subject to that, |
| so we let it use up a bit more of the buffer arena. |
| |
| |
| Background Writer's Processing |
| ------------------------------ |
| |
| The background writer is designed to write out pages that are likely to be |
| recycled soon, thereby offloading the writing work from active backends. |
| To do this, it scans forward circularly from the current position of |
| NextVictimBuffer (which it does not change!), looking for buffers that are |
| dirty and not pinned nor marked with a positive usage count. It pins, |
| writes, and releases any such buffer. |
| |
| If we can assume that reading NextVictimBuffer is an atomic action, then |
| the writer doesn't even need to take the BufFreelistLock in order to look |
| for buffers to write; it needs only to spinlock each buffer header for long |
| enough to check the dirtybit. Even without that assumption, the writer |
| only needs to take the lock long enough to read the variable value, not |
| while scanning the buffers. (This is a very substantial improvement in |
| the contention cost of the writer compared to PG 8.0.) |
| |
| During a checkpoint, the writer's strategy must be to write every dirty |
| buffer (pinned or not!). We may as well make it start this scan from |
| NextVictimBuffer, however, so that the first-to-be-written pages are the |
| ones that backends might otherwise have to write for themselves soon. |
| |
| The background writer takes shared content lock on a buffer while writing it |
| out (and anyone else who flushes buffer contents to disk must do so too). |
| This ensures that the page image transferred to disk is reasonably consistent. |
| We might miss a hint-bit update or two but that isn't a problem, for the same |
| reasons mentioned under buffer access rules. |
| |
| As of 8.4, background writer starts during recovery mode when there is |
| some form of potentially extended recovery to perform. It performs an |
| identical service to normal processing, except that checkpoints it |
| writes are technically restartpoints. |