| $PostgreSQL: pgsql/src/backend/access/nbtree/README,v 1.20 2008/03/21 13:23:27 momjian Exp $ |
| |
| Btree Indexing |
| ============== |
| |
| This directory contains a correct implementation of Lehman and Yao's |
| high-concurrency B-tree management algorithm (P. Lehman and S. Yao, |
| Efficient Locking for Concurrent Operations on B-Trees, ACM Transactions |
| on Database Systems, Vol 6, No. 4, December 1981, pp 650-670). We also |
| use a simplified version of the deletion logic described in Lanin and |
| Shasha (V. Lanin and D. Shasha, A Symmetric Concurrent B-Tree Algorithm, |
| Proceedings of 1986 Fall Joint Computer Conference, pp 380-389). |
| |
| The Lehman and Yao Algorithm and Insertions |
| ------------------------------------------- |
| |
| We have made the following changes in order to incorporate the L&Y algorithm |
| into Postgres: |
| |
| The requirement that all btree keys be unique is too onerous, |
| but the algorithm won't work correctly without it. Fortunately, it is |
| only necessary that keys be unique on a single tree level, because L&Y |
| only use the assumption of key uniqueness when re-finding a key in a |
| parent page (to determine where to insert the key for a split page). |
| Therefore, we can use the link field to disambiguate multiple |
| occurrences of the same user key: only one entry in the parent level |
| will be pointing at the page we had split. (Indeed we need not look at |
| the real "key" at all, just at the link field.) We can distinguish |
| items at the leaf level in the same way, by examining their links to |
| heap tuples; we'd never have two items for the same heap tuple. |
| |
| Lehman and Yao assume that the key range for a subtree S is described |
| by Ki < v <= Ki+1 where Ki and Ki+1 are the adjacent keys in the parent |
| page. This does not work for nonunique keys (for example, if we have |
| enough equal keys to spread across several leaf pages, there *must* be |
| some equal bounding keys in the first level up). Therefore we assume |
| Ki <= v <= Ki+1 instead. A search that finds exact equality to a |
| bounding key in an upper tree level must descend to the left of that |
| key to ensure it finds any equal keys in the preceding page. An |
| insertion that sees the high key of its target page is equal to the key |
| to be inserted has a choice whether or not to move right, since the new |
| key could go on either page. (Currently, we try to find a page where |
| there is room for the new key without a split.) |
| |
| Lehman and Yao don't require read locks, but assume that in-memory |
| copies of tree pages are unshared. Postgres shares in-memory buffers |
| among backends. As a result, we do page-level read locking on btree |
| pages in order to guarantee that no record is modified while we are |
| examining it. This reduces concurrency but guaranteees correct |
| behavior. An advantage is that when trading in a read lock for a |
| write lock, we need not re-read the page after getting the write lock. |
| Since we're also holding a pin on the shared buffer containing the |
| page, we know that buffer still contains the page and is up-to-date. |
| |
| We support the notion of an ordered "scan" of an index as well as |
| insertions, deletions, and simple lookups. A scan in the forward |
| direction is no problem, we just use the right-sibling pointers that |
| L&Y require anyway. (Thus, once we have descended the tree to the |
| correct start point for the scan, the scan looks only at leaf pages |
| and never at higher tree levels.) To support scans in the backward |
| direction, we also store a "left sibling" link much like the "right |
| sibling". (This adds an extra step to the L&Y split algorithm: while |
| holding the write lock on the page being split, we also lock its former |
| right sibling to update that page's left-link. This is safe since no |
| writer of that page can be interested in acquiring a write lock on our |
| page.) A backwards scan has one additional bit of complexity: after |
| following the left-link we must account for the possibility that the |
| left sibling page got split before we could read it. So, we have to |
| move right until we find a page whose right-link matches the page we |
| came from. (Actually, it's even harder than that; see deletion discussion |
| below.) |
| |
| Page read locks are held only for as long as a scan is examining a page. |
| To minimize lock/unlock traffic, an index scan always searches a leaf page |
| to identify all the matching items at once, copying their heap tuple IDs |
| into backend-local storage. The heap tuple IDs are then processed while |
| not holding any page lock within the index. We do continue to hold a pin |
| on the leaf page, to protect against concurrent deletions (see below). |
| In this state the scan is effectively stopped "between" pages, either |
| before or after the page it has pinned. This is safe in the presence of |
| concurrent insertions and even page splits, because items are never moved |
| across pre-existing page boundaries --- so the scan cannot miss any items |
| it should have seen, nor accidentally return the same item twice. The scan |
| must remember the page's right-link at the time it was scanned, since that |
| is the page to move right to; if we move right to the current right-link |
| then we'd re-scan any items moved by a page split. We don't similarly |
| remember the left-link, since it's best to use the most up-to-date |
| left-link when trying to move left (see detailed move-left algorithm below). |
| |
| In most cases we release our lock and pin on a page before attempting |
| to acquire pin and lock on the page we are moving to. In a few places |
| it is necessary to lock the next page before releasing the current one. |
| This is safe when moving right or up, but not when moving left or down |
| (else we'd create the possibility of deadlocks). |
| |
| Lehman and Yao fail to discuss what must happen when the root page |
| becomes full and must be split. Our implementation is to split the |
| root in the same way that any other page would be split, then construct |
| a new root page holding pointers to both of the resulting pages (which |
| now become siblings on the next level of the tree). The new root page |
| is then installed by altering the root pointer in the meta-data page (see |
| below). This works because the root is not treated specially in any |
| other way --- in particular, searches will move right using its link |
| pointer if the link is set. Therefore, searches will find the data |
| that's been moved into the right sibling even if they read the meta-data |
| page before it got updated. This is the same reasoning that makes a |
| split of a non-root page safe. The locking considerations are similar too. |
| |
| When an inserter recurses up the tree, splitting internal pages to insert |
| links to pages inserted on the level below, it is possible that it will |
| need to access a page above the level that was the root when it began its |
| descent (or more accurately, the level that was the root when it read the |
| meta-data page). In this case the stack it made while descending does not |
| help for finding the correct page. When this happens, we find the correct |
| place by re-descending the tree until we reach the level one above the |
| level we need to insert a link to, and then moving right as necessary. |
| (Typically this will take only two fetches, the meta-data page and the new |
| root, but in principle there could have been more than one root split |
| since we saw the root. We can identify the correct tree level by means of |
| the level numbers stored in each page. The situation is rare enough that |
| we do not need a more efficient solution.) |
| |
| Lehman and Yao assume fixed-size keys, but we must deal with |
| variable-size keys. Therefore there is not a fixed maximum number of |
| keys per page; we just stuff in as many as will fit. When we split a |
| page, we try to equalize the number of bytes, not items, assigned to |
| each of the resulting pages. Note we must include the incoming item in |
| this calculation, otherwise it is possible to find that the incoming |
| item doesn't fit on the split page where it needs to go! |
| |
| The Deletion Algorithm |
| ---------------------- |
| |
| Before deleting a leaf item, we get a super-exclusive lock on the target |
| page, so that no other backend has a pin on the page when the deletion |
| starts. This is not necessary for correctness in terms of the btree index |
| operations themselves; as explained above, index scans logically stop |
| "between" pages and so can't lose their place. The reason we do it is to |
| provide an interlock between non-full VACUUM and indexscans. Since VACUUM |
| deletes index entries before deleting tuples, the super-exclusive lock |
| guarantees that VACUUM can't delete any heap tuple that an indexscanning |
| process might be about to visit. (This guarantee works only for simple |
| indexscans that visit the heap in sync with the index scan, not for bitmap |
| scans. We only need the guarantee when using non-MVCC snapshot rules such |
| as SnapshotNow, so in practice this is only important for system catalog |
| accesses.) |
| |
| Because a page can be split even while someone holds a pin on it, it is |
| possible that an indexscan will return items that are no longer stored on |
| the page it has a pin on, but rather somewhere to the right of that page. |
| To ensure that VACUUM can't prematurely remove such heap tuples, we require |
| btbulkdelete to obtain super-exclusive lock on every leaf page in the index, |
| even pages that don't contain any deletable tuples. This guarantees that |
| the btbulkdelete call cannot return while any indexscan is still holding |
| a copy of a deleted index tuple. Note that this requirement does not say |
| that btbulkdelete must visit the pages in any particular order. (See also |
| on-the-fly deletion, below.) |
| |
| There is no such interlocking for deletion of items in internal pages, |
| since backends keep no lock nor pin on a page they have descended past. |
| Hence, when a backend is ascending the tree using its stack, it must |
| be prepared for the possibility that the item it wants is to the left of |
| the recorded position (but it can't have moved left out of the recorded |
| page). Since we hold a lock on the lower page (per L&Y) until we have |
| re-found the parent item that links to it, we can be assured that the |
| parent item does still exist and can't have been deleted. Also, because |
| we are matching downlink page numbers and not data keys, we don't have any |
| problem with possibly misidentifying the parent item. |
| |
| We consider deleting an entire page from the btree only when it's become |
| completely empty of items. (Merging partly-full pages would allow better |
| space reuse, but it seems impractical to move existing data items left or |
| right to make this happen --- a scan moving in the opposite direction |
| might miss the items if so. We could do it during VACUUM FULL, though.) |
| Also, we *never* delete the rightmost page on a tree level (this |
| restriction simplifies the traversal algorithms, as explained below). |
| |
| To delete an empty page, we acquire write lock on its left sibling (if |
| any), the target page itself, the right sibling (there must be one), and |
| the parent page, in that order. The parent page must be found using the |
| same type of search as used to find the parent during an insertion split. |
| Then we update the side-links in the siblings, mark the target page |
| deleted, and remove the downlink from the parent, as well as the parent's |
| upper bounding key for the target (the one separating it from its right |
| sibling). This causes the target page's key space to effectively belong |
| to its right sibling. (Neither the left nor right sibling pages need to |
| change their "high key" if any; so there is no problem with possibly not |
| having enough space to replace a high key.) The side-links in the target |
| page are not changed. |
| |
| (Note: Lanin and Shasha prefer to make the key space move left, but their |
| argument for doing so hinges on not having left-links, which we have |
| anyway. So we simplify the algorithm by moving key space right.) |
| |
| To preserve consistency on the parent level, we cannot merge the key space |
| of a page into its right sibling unless the right sibling is a child of |
| the same parent --- otherwise, the parent's key space assignment changes |
| too, meaning we'd have to make bounding-key updates in its parent, and |
| perhaps all the way up the tree. Since we can't possibly do that |
| atomically, we forbid this case. That means that the rightmost child of a |
| parent node can't be deleted unless it's the only remaining child. |
| |
| When we delete the last remaining child of a parent page, we mark the |
| parent page "half-dead" as part of the atomic update that deletes the |
| child page. This implicitly transfers the parent's key space to its right |
| sibling (which it must have, since we never delete the overall-rightmost |
| page of a level). Searches ignore the half-dead page and immediately move |
| right. We need not worry about insertions into a half-dead page --- insertions |
| into upper tree levels happen only as a result of splits of child pages, and |
| the half-dead page no longer has any children that could split. Therefore |
| the page stays empty even when we don't have lock on it, and we can complete |
| its deletion in a second atomic action. |
| |
| The notion of a half-dead page means that the key space relationship between |
| the half-dead page's level and its parent's level may be a little out of |
| whack: key space that appears to belong to the half-dead page's parent on the |
| parent level may really belong to its right sibling. To prevent any possible |
| problems, we hold lock on the deleted child page until we have finished |
| deleting any now-half-dead parent page(s). This prevents any insertions into |
| the transferred keyspace until the operation is complete. The reason for |
| doing this is that a sufficiently large number of insertions into the |
| transferred keyspace, resulting in multiple page splits, could propagate keys |
| from that keyspace into the parent level, resulting in transiently |
| out-of-order keys in that level. It is thought that that wouldn't cause any |
| serious problem, but it seems too risky to allow. |
| |
| A deleted page cannot be reclaimed immediately, since there may be other |
| processes waiting to reference it (ie, search processes that just left the |
| parent, or scans moving right or left from one of the siblings). These |
| processes must observe that the page is marked dead and recover |
| accordingly. Searches and forward scans simply follow the right-link |
| until they find a non-dead page --- this will be where the deleted page's |
| key-space moved to. |
| |
| Moving left in a backward scan is complicated because we must consider |
| the possibility that the left sibling was just split (meaning we must find |
| the rightmost page derived from the left sibling), plus the possibility |
| that the page we were just on has now been deleted and hence isn't in the |
| sibling chain at all anymore. So the move-left algorithm becomes: |
| 0. Remember the page we are on as the "original page". |
| 1. Follow the original page's left-link (we're done if this is zero). |
| 2. If the current page is live and its right-link matches the "original |
| page", we are done. |
| 3. Otherwise, move right one or more times looking for a live page whose |
| right-link matches the "original page". If found, we are done. (In |
| principle we could scan all the way to the right end of the index, but |
| in practice it seems better to give up after a small number of tries. |
| It's unlikely the original page's sibling split more than a few times |
| while we were in flight to it; if we do not find a matching link in a |
| few tries, then most likely the original page is deleted.) |
| 4. Return to the "original page". If it is still live, return to step 1 |
| (we guessed wrong about it being deleted, and should restart with its |
| current left-link). If it is dead, move right until a non-dead page |
| is found (there must be one, since rightmost pages are never deleted), |
| mark that as the new "original page", and return to step 1. |
| This algorithm is correct because the live page found by step 4 will have |
| the same left keyspace boundary as the page we started from. Therefore, |
| when we ultimately exit, it must be on a page whose right keyspace |
| boundary matches the left boundary of where we started --- which is what |
| we need to be sure we don't miss or re-scan any items. |
| |
| A deleted page can only be reclaimed once there is no scan or search that |
| has a reference to it; until then, it must stay in place with its |
| right-link undisturbed. We implement this by waiting until all |
| transactions that were running at the time of deletion are dead; which is |
| overly strong, but is simple to implement within Postgres. When marked |
| dead, a deleted page is labeled with the next-transaction counter value. |
| VACUUM can reclaim the page for re-use when this transaction number is |
| older than the oldest open transaction. (NOTE: VACUUM FULL can reclaim |
| such pages immediately.) |
| |
| Reclaiming a page doesn't actually change its state on disk --- we simply |
| record it in the shared-memory free space map, from which it will be |
| handed out the next time a new page is needed for a page split. The |
| deleted page's contents will be overwritten by the split operation. |
| (Note: if we find a deleted page with an extremely old transaction |
| number, it'd be worthwhile to re-mark it with FrozenTransactionId so that |
| a later xid wraparound can't cause us to think the page is unreclaimable. |
| But in more normal situations this would be a waste of a disk write.) |
| |
| Because we never delete the rightmost page of any level (and in particular |
| never delete the root), it's impossible for the height of the tree to |
| decrease. After massive deletions we might have a scenario in which the |
| tree is "skinny", with several single-page levels below the root. |
| Operations will still be correct in this case, but we'd waste cycles |
| descending through the single-page levels. To handle this we use an idea |
| from Lanin and Shasha: we keep track of the "fast root" level, which is |
| the lowest single-page level. The meta-data page keeps a pointer to this |
| level as well as the true root. All ordinary operations initiate their |
| searches at the fast root not the true root. When we split a page that is |
| alone on its level or delete the next-to-last page on a level (both cases |
| are easily detected), we have to make sure that the fast root pointer is |
| adjusted appropriately. In the split case, we do this work as part of the |
| atomic update for the insertion into the parent level; in the delete case |
| as part of the atomic update for the delete (either way, the metapage has |
| to be the last page locked in the update to avoid deadlock risks). This |
| avoids race conditions if two such operations are executing concurrently. |
| |
| VACUUM needs to do a linear scan of an index to search for deleted pages |
| that can be reclaimed because they are older than all open transactions. |
| For efficiency's sake, we'd like to use the same linear scan to search for |
| deletable tuples. Before Postgres 8.2, btbulkdelete scanned the leaf pages |
| in index order, but it is possible to visit them in physical order instead. |
| The tricky part of this is to avoid missing any deletable tuples in the |
| presence of concurrent page splits: a page split could easily move some |
| tuples from a page not yet passed over by the sequential scan to a |
| lower-numbered page already passed over. (This wasn't a concern for the |
| index-order scan, because splits always split right.) To implement this, |
| we provide a "vacuum cycle ID" mechanism that makes it possible to |
| determine whether a page has been split since the current btbulkdelete |
| cycle started. If btbulkdelete finds a page that has been split since |
| it started, and has a right-link pointing to a lower page number, then |
| it temporarily suspends its sequential scan and visits that page instead. |
| It must continue to follow right-links and vacuum dead tuples until |
| reaching a page that either hasn't been split since btbulkdelete started, |
| or is above the location of the outer sequential scan. Then it can resume |
| the sequential scan. This ensures that all tuples are visited. It may be |
| that some tuples are visited twice, but that has no worse effect than an |
| inaccurate index tuple count (and we can't guarantee an accurate count |
| anyway in the face of concurrent activity). Note that this still works |
| if the has-been-recently-split test has a small probability of false |
| positives, so long as it never gives a false negative. This makes it |
| possible to implement the test with a small counter value stored on each |
| index page. |
| |
| On-the-Fly Deletion Of Index Tuples |
| ----------------------------------- |
| |
| If a process visits a heap tuple and finds that it's dead and removable |
| (ie, dead to all open transactions, not only that process), then we can |
| return to the index and mark the corresponding index entry "known dead", |
| allowing subsequent index scans to skip visiting the heap tuple. The |
| "known dead" marking works by setting the index item's lp_flags state |
| to LP_DEAD. This is currently only done in plain indexscans, not bitmap |
| scans, because only plain scans visit the heap and index "in sync" and so |
| there's not a convenient way to do it for bitmap scans. |
| |
| Once an index tuple has been marked LP_DEAD it can actually be removed |
| from the index immediately; since index scans only stop "between" pages, |
| no scan can lose its place from such a deletion. We separate the steps |
| because we allow LP_DEAD to be set with only a share lock (it's exactly |
| like a hint bit for a heap tuple), but physically removing tuples requires |
| exclusive lock. In the current code we try to remove LP_DEAD tuples when |
| we are otherwise faced with having to split a page to do an insertion (and |
| hence have exclusive lock on it already). |
| |
| This leaves the index in a state where it has no entry for a dead tuple |
| that still exists in the heap. This is not a problem for the current |
| implementation of VACUUM, but it could be a problem for anything that |
| explicitly tries to find index entries for dead tuples. (However, the |
| same situation is created by REINDEX, since it doesn't enter dead |
| tuples into the index.) |
| |
| It's sufficient to have an exclusive lock on the index page, not a |
| super-exclusive lock, to do deletion of LP_DEAD items. It might seem |
| that this breaks the interlock between VACUUM and indexscans, but that is |
| not so: as long as an indexscanning process has a pin on the page where |
| the index item used to be, VACUUM cannot complete its btbulkdelete scan |
| and so cannot remove the heap tuple. This is another reason why |
| btbulkdelete has to get super-exclusive lock on every leaf page, not only |
| the ones where it actually sees items to delete. |
| |
| WAL Considerations |
| ------------------ |
| |
| The insertion and deletion algorithms in themselves don't guarantee btree |
| consistency after a crash. To provide robustness, we depend on WAL |
| replay. A single WAL entry is effectively an atomic action --- we can |
| redo it from the log if it fails to complete. |
| |
| Ordinary item insertions (that don't force a page split) are of course |
| single WAL entries, since they only affect one page. The same for |
| leaf-item deletions (if the deletion brings the leaf page to zero items, |
| it is now a candidate to be deleted, but that is a separate action). |
| |
| An insertion that causes a page split is logged as a single WAL entry for |
| the changes occuring on the insertion's level --- including update of the |
| right sibling's left-link --- followed by a second WAL entry for the |
| insertion on the parent level (which might itself be a page split, requiring |
| an additional insertion above that, etc). |
| |
| For a root split, the followon WAL entry is a "new root" entry rather than |
| an "insertion" entry, but details are otherwise much the same. |
| |
| Because insertion involves multiple atomic actions, the WAL replay logic |
| has to detect the case where a page split isn't followed by a matching |
| insertion on the parent level, and then do that insertion on its own (and |
| recursively for any subsequent parent insertion, of course). This is |
| feasible because the WAL entry for the split contains enough info to know |
| what must be inserted in the parent level. |
| |
| When splitting a non-root page that is alone on its level, the required |
| metapage update (of the "fast root" link) is performed and logged as part |
| of the insertion into the parent level. When splitting the root page, the |
| metapage update is handled as part of the "new root" action. |
| |
| A page deletion is logged as a single WAL entry covering all four |
| required page updates (target page, left and right siblings, and parent) |
| as an atomic event. (Any required fast-root link update is also part |
| of the WAL entry.) If the parent page becomes half-dead but is not |
| immediately deleted due to a subsequent crash, there is no loss of |
| consistency, and the empty page will be picked up by the next VACUUM. |
| |
| Other Things That Are Handy to Know |
| ----------------------------------- |
| |
| Page zero of every btree is a meta-data page. This page stores the |
| location of the root page --- both the true root and the current effective |
| root ("fast" root). To avoid fetching the metapage for every single index |
| search, we cache a copy of the meta-data information in the index's |
| relcache entry (rd_amcache). This is a bit ticklish since using the cache |
| implies following a root page pointer that could be stale. We require |
| every metapage update to send out a SI "relcache inval" message on the |
| index relation. That ensures that each backend will flush its cached copy |
| not later than the start of its next transaction. Therefore, stale |
| pointers cannot be used for longer than the current transaction, which |
| reduces the problem to the same one already dealt with for concurrent |
| VACUUM --- we can just imagine that each open transaction is potentially |
| "already in flight" to the old root. |
| |
| The algorithm assumes we can fit at least three items per page |
| (a "high key" and two real data items). Therefore it's unsafe |
| to accept items larger than 1/3rd page size. Larger items would |
| work sometimes, but could cause failures later on depending on |
| what else gets put on their page. |
| |
| "ScanKey" data structures are used in two fundamentally different ways |
| in this code, which we describe as "search" scankeys and "insertion" |
| scankeys. A search scankey is the kind passed to btbeginscan() or |
| btrescan() from outside the btree code. The sk_func pointers in a search |
| scankey point to comparison functions that return boolean, such as int4lt. |
| There might be more than one scankey entry for a given index column, or |
| none at all. (We require the keys to appear in index column order, but |
| the order of multiple keys for a given column is unspecified.) An |
| insertion scankey uses the same array-of-ScanKey data structure, but the |
| sk_func pointers point to btree comparison support functions (ie, 3-way |
| comparators that return int4 values interpreted as <0, =0, >0). In an |
| insertion scankey there is exactly one entry per index column. Insertion |
| scankeys are built within the btree code (eg, by _bt_mkscankey()) and are |
| used to locate the starting point of a scan, as well as for locating the |
| place to insert a new index tuple. (Note: in the case of an insertion |
| scankey built from a search scankey, there might be fewer keys than |
| index columns, indicating that we have no constraints for the remaining |
| index columns.) After we have located the starting point of a scan, the |
| original search scankey is consulted as each index entry is sequentially |
| scanned to decide whether to return the entry and whether the scan can |
| stop (see _bt_checkkeys()). |
| |
| Notes About Data Representation |
| ------------------------------- |
| |
| The right-sibling link required by L&Y is kept in the page "opaque |
| data" area, as is the left-sibling link, the page level, and some flags. |
| The page level counts upwards from zero at the leaf level, to the tree |
| depth minus 1 at the root. (Counting up from the leaves ensures that we |
| don't need to renumber any existing pages when splitting the root.) |
| |
| The Postgres disk block data format (an array of items) doesn't fit |
| Lehman and Yao's alternating-keys-and-pointers notion of a disk page, |
| so we have to play some games. |
| |
| On a page that is not rightmost in its tree level, the "high key" is |
| kept in the page's first item, and real data items start at item 2. |
| The link portion of the "high key" item goes unused. A page that is |
| rightmost has no "high key", so data items start with the first item. |
| Putting the high key at the left, rather than the right, may seem odd, |
| but it avoids moving the high key as we add data items. |
| |
| On a leaf page, the data items are simply links to (TIDs of) tuples |
| in the relation being indexed, with the associated key values. |
| |
| On a non-leaf page, the data items are down-links to child pages with |
| bounding keys. The key in each data item is the *lower* bound for |
| keys on that child page, so logically the key is to the left of that |
| downlink. The high key (if present) is the upper bound for the last |
| downlink. The first data item on each such page has no lower bound |
| --- or lower bound of minus infinity, if you prefer. The comparison |
| routines must treat it accordingly. The actual key stored in the |
| item is irrelevant, and need not be stored at all. This arrangement |
| corresponds to the fact that an L&Y non-leaf page has one more pointer |
| than key. |
| |
| Notes to Operator Class Implementors |
| ------------------------------------ |
| |
| With this implementation, we require each supported combination of |
| datatypes to supply us with a comparison procedure via pg_amproc. |
| This procedure must take two nonnull values A and B and return an int32 < 0, |
| 0, or > 0 if A < B, A = B, or A > B, respectively. The procedure must |
| not return INT_MIN for "A < B", since the value may be negated before |
| being tested for sign. A null result is disallowed, too. See nbtcompare.c |
| for examples. |
| |
| There are some basic assumptions that a btree operator family must satisfy: |
| |
| An = operator must be an equivalence relation; that is, for all non-null |
| values A,B,C of the datatype: |
| |
| A = A is true reflexive law |
| if A = B, then B = A symmetric law |
| if A = B and B = C, then A = C transitive law |
| |
| A < operator must be a strong ordering relation; that is, for all non-null |
| values A,B,C: |
| |
| A < A is false irreflexive law |
| if A < B and B < C, then A < C transitive law |
| |
| Furthermore, the ordering is total; that is, for all non-null values A,B: |
| |
| exactly one of A < B, A = B, and B < A is true trichotomy law |
| |
| (The trichotomy law justifies the definition of the comparison support |
| procedure, of course.) |
| |
| The other three operators are defined in terms of these two in the obvious way, |
| and must act consistently with them. |
| |
| For an operator family supporting multiple datatypes, the above laws must hold |
| when A,B,C are taken from any datatypes in the family. The transitive laws |
| are the trickiest to ensure, as in cross-type situations they represent |
| statements that the behaviors of two or three different operators are |
| consistent. As an example, it would not work to put float8 and numeric into |
| an opfamily, at least not with the current semantics that numerics are |
| converted to float8 for comparison to a float8. Because of the limited |
| accuracy of float8, this means there are distinct numeric values that will |
| compare equal to the same float8 value, and thus the transitive law fails. |
| |
| It should be fairly clear why a btree index requires these laws to hold within |
| a single datatype: without them there is no ordering to arrange the keys with. |
| Also, index searches using a key of a different datatype require comparisons |
| to behave sanely across two datatypes. The extensions to three or more |
| datatypes within a family are not strictly required by the btree index |
| mechanism itself, but the planner relies on them for optimization purposes. |