| src/backend/storage/freespace/README |
| |
| Free Space Map |
| -------------- |
| |
| The purpose of the free space map is to quickly locate a page with enough |
| free space to hold a tuple to be stored; or to determine that no such page |
| exists and the relation must be extended by one page. As of PostgreSQL 8.4 |
| each relation has its own, extensible free space map stored in a separate |
| "fork" of its relation. This eliminates the disadvantages of the former |
| fixed-size FSM. |
| |
| It is important to keep the map small so that it can be searched rapidly. |
| Therefore, we don't attempt to record the exact free space on a page. |
| We allocate one map byte to each page, allowing us to record free space |
| at a granularity of 1/256th of a page. Another way to say it is that |
| the stored value is the free space divided by BLCKSZ/256 (rounding down). |
| We assume that the free space must always be less than BLCKSZ, since |
| all pages have some overhead; so the maximum map value is 255. |
| |
| To assist in fast searching, the map isn't simply an array of per-page |
| entries, but has a tree structure above those entries. There is a tree |
| structure of pages, and a tree structure within each page, as described |
| below. |
| |
| FSM page structure |
| ------------------ |
| |
| Within each FSM page, we use a binary tree structure where leaf nodes store |
| the amount of free space on heap pages (or lower level FSM pages, see |
| "Higher-level structure" below), with one leaf node per heap page. A non-leaf |
| node stores the max amount of free space on any of its children. |
| |
| For example: |
| |
| 4 |
| 4 2 |
| 3 4 0 2 <- This level represents heap pages |
| |
| We need two basic operations: search and update. |
| |
| To search for a page with X amount of free space, traverse down the tree |
| along a path where n >= X, until you hit the bottom. If both children of a |
| node satisfy the condition, you can pick either one arbitrarily. |
| |
| To update the amount of free space on a page to X, first update the leaf node |
| corresponding to the heap page, then "bubble up" the change to upper nodes, |
| by walking up to each parent and recomputing its value as the max of its |
| two children. Repeat until reaching the root or a parent whose value |
| doesn't change. |
| |
| This data structure has a couple of nice properties: |
| - to discover that there is no page with X bytes of free space, you only |
| need to look at the root node |
| - by varying which child to traverse to in the search algorithm, when you have |
| a choice, we can implement various strategies, like preferring pages closer |
| to a given page, or spreading the load across the table. |
| |
| Higher-level routines that use FSM pages access them through the fsm_set_avail() |
| and fsm_search_avail() functions. The interface to those functions hides the |
| page's internal tree structure, treating the FSM page as a black box that has |
| a certain number of "slots" for storing free space information. (However, |
| the higher routines have to be aware of the tree structure of the whole map.) |
| |
| The binary tree is stored on each FSM page as an array. Because the page |
| header takes some space on a page, the binary tree isn't perfect. That is, |
| a few right-most leaf nodes are missing, and there are some useless non-leaf |
| nodes at the right. So the tree looks something like this: |
| |
| 0 |
| 1 2 |
| 3 4 5 6 |
| 7 8 9 A B |
| |
| where the numbers denote each node's position in the array. Note that the |
| tree is guaranteed complete above the leaf level; only some leaf nodes are |
| missing. This is reflected in the number of usable "slots" per page not |
| being an exact power of 2. |
| |
| A FSM page also has a next slot pointer, fp_next_slot, that determines where |
| to start the next search for free space within that page. The reason for that |
| is to spread out the pages that are returned by FSM searches. When several |
| backends are concurrently inserting into a relation, contention can be avoided |
| by having them insert into different pages. But it is also desirable to fill |
| up pages in sequential order, to get the benefit of OS prefetching and batched |
| writes. The FSM is responsible for making that happen, and the next slot |
| pointer helps provide the desired behavior. |
| |
| Higher-level structure |
| ---------------------- |
| |
| To scale up the data structure described above beyond a single page, we |
| maintain a similar tree-structure across pages. Leaf nodes in higher level |
| pages correspond to lower level FSM pages. The root node within each page |
| has the same value as the corresponding leaf node on its parent page. |
| |
| The root page is always stored at physical block 0. |
| |
| For example, assuming each FSM page can hold information about 4 pages (in |
| reality, it holds (BLCKSZ - headers) / 2, or ~4000 with default BLCKSZ), |
| we get a disk layout like this: |
| |
| 0 <-- page 0 at level 2 (root page) |
| 0 <-- page 0 at level 1 |
| 0 <-- page 0 at level 0 |
| 1 <-- page 1 at level 0 |
| 2 <-- ... |
| 3 |
| 1 <-- page 1 at level 1 |
| 4 |
| 5 |
| 6 |
| 7 |
| 2 |
| 8 |
| 9 |
| 10 |
| 11 |
| 3 |
| 12 |
| 13 |
| 14 |
| 15 |
| |
| where the numbers are page numbers *at that level*, starting from 0. |
| |
| To find the physical block # corresponding to leaf page n, we need to |
| count the number of leaf and upper-level pages preceding page n. |
| This turns out to be |
| |
| y = n + (n / F + 1) + (n / F^2 + 1) + ... + 1 |
| |
| where F is the fanout (4 in the above example). The first term n is the number |
| of preceding leaf pages, the second term is the number of pages at level 1, |
| and so forth. |
| |
| To keep things simple, the tree is always constant height. To cover the |
| maximum relation size of 2^32-1 blocks, three levels is enough with the default |
| BLCKSZ (4000^3 > 2^32). |
| |
| Addressing |
| ---------- |
| |
| The higher-level routines operate on "logical" addresses, consisting of |
| - level, |
| - logical page number, and |
| - slot (if applicable) |
| |
| Bottom level FSM pages have level of 0, the level above that 1, and root 2. |
| As in the diagram above, logical page number is the page number at that level, |
| starting from 0. |
| |
| Locking |
| ------- |
| |
| When traversing down to search for free space, only one page is locked at a |
| time: the parent page is released before locking the child. If the child page |
| is concurrently modified, and there no longer is free space on the child page |
| when you land on it, you need to start from scratch (after correcting the |
| parent page, so that you don't get into an infinite loop). |
| |
| We use shared buffer locks when searching, but exclusive buffer lock when |
| updating a page. However, the next slot search pointer is updated during |
| searches even though we have only a shared lock. fp_next_slot is just a hint |
| and we can easily reset it if it gets corrupted; so it seems better to accept |
| some risk of that type than to pay the overhead of exclusive locking. |
| |
| Recovery |
| -------- |
| |
| The FSM is not explicitly WAL-logged. Instead, we rely on a bunch of |
| self-correcting measures to repair possible corruption. As a result when |
| we write to the FSM we treat that as a hint and thus use MarkBufferDirtyHint() |
| rather than MarkBufferDirty(). |
| |
| First of all, whenever a value is set on an FSM page, the root node of the |
| page is compared against the new value after bubbling up the change is |
| finished. It should be greater than or equal to the value just set, or we |
| have a corrupted page, with a parent somewhere with too small a value. |
| Secondly, if we detect corrupted pages while we search, traversing down |
| the tree. That check will notice if a parent node is set to too high a value. |
| In both cases, the upper nodes on the page are immediately rebuilt, fixing |
| the corruption so far as that page is concerned. |
| |
| VACUUM updates all the bottom-level FSM pages with the correct amount of free |
| space on corresponding heap pages, as it proceeds through the heap. This |
| goes through fsm_set_avail(), so that the upper nodes on those pages are |
| immediately updated. Periodically, VACUUM calls FreeSpaceMapVacuum[Range] |
| to propagate the new free-space info into the upper pages of the FSM tree. |
| |
| TODO |
| ---- |
| |
| - fastroot to avoid traversing upper nodes with just 1 child |
| - use a different system for tables that fit into one FSM page, with a |
| mechanism to switch to the real thing as it grows. |