| <!-- doc/src/sgml/btree.sgml --> |
| |
| <chapter id="btree"> |
| <title>B-Tree Indexes</title> |
| |
| <indexterm> |
| <primary>index</primary> |
| <secondary>B-Tree</secondary> |
| </indexterm> |
| |
| <sect1 id="btree-intro"> |
| <title>Introduction</title> |
| |
| <para> |
| <productname>PostgreSQL</productname> includes an implementation of the |
| standard <acronym>btree</acronym> (multi-way balanced tree) index data |
| structure. Any data type that can be sorted into a well-defined linear |
| order can be indexed by a btree index. The only limitation is that an |
| index entry cannot exceed approximately one-third of a page (after TOAST |
| compression, if applicable). |
| </para> |
| |
| <para> |
| Because each btree operator class imposes a sort order on its data type, |
| btree operator classes (or, really, operator families) have come to be |
| used as <productname>PostgreSQL</productname>'s general representation |
| and understanding of sorting semantics. Therefore, they've acquired |
| some features that go beyond what would be needed just to support btree |
| indexes, and parts of the system that are quite distant from the |
| btree AM make use of them. |
| </para> |
| |
| </sect1> |
| |
| <sect1 id="btree-behavior"> |
| <title>Behavior of B-Tree Operator Classes</title> |
| |
| <para> |
| As shown in <xref linkend="xindex-btree-strat-table"/>, a btree operator |
| class must provide five comparison operators, |
| <literal><</literal>, |
| <literal><=</literal>, |
| <literal>=</literal>, |
| <literal>>=</literal> and |
| <literal>></literal>. |
| One might expect that <literal><></literal> should also be part of |
| the operator class, but it is not, because it would almost never be |
| useful to use a <literal><></literal> WHERE clause in an index |
| search. (For some purposes, the planner treats <literal><></literal> |
| as associated with a btree operator class; but it finds that operator via |
| the <literal>=</literal> operator's negator link, rather than |
| from <structname>pg_amop</structname>.) |
| </para> |
| |
| <para> |
| When several data types share near-identical sorting semantics, their |
| operator classes can be grouped into an operator family. Doing so is |
| advantageous because it allows the planner to make deductions about |
| cross-type comparisons. Each operator class within the family should |
| contain the single-type operators (and associated support functions) |
| for its input data type, while cross-type comparison operators and |
| support functions are <quote>loose</quote> in the family. It is |
| recommendable that a complete set of cross-type operators be included |
| in the family, thus ensuring that the planner can represent any |
| comparison conditions that it deduces from transitivity. |
| </para> |
| |
| <para> |
| There are some basic assumptions that a btree operator family must |
| satisfy: |
| </para> |
| |
| <itemizedlist> |
| <listitem> |
| <para> |
| An <literal>=</literal> operator must be an equivalence relation; that |
| is, for all non-null values <replaceable>A</replaceable>, |
| <replaceable>B</replaceable>, <replaceable>C</replaceable> of the |
| data type: |
| |
| <itemizedlist> |
| <listitem> |
| <para> |
| <replaceable>A</replaceable> <literal>=</literal> |
| <replaceable>A</replaceable> is true |
| (<firstterm>reflexive law</firstterm>) |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| if <replaceable>A</replaceable> <literal>=</literal> |
| <replaceable>B</replaceable>, |
| then <replaceable>B</replaceable> <literal>=</literal> |
| <replaceable>A</replaceable> |
| (<firstterm>symmetric law</firstterm>) |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| if <replaceable>A</replaceable> <literal>=</literal> |
| <replaceable>B</replaceable> and <replaceable>B</replaceable> |
| <literal>=</literal> <replaceable>C</replaceable>, |
| then <replaceable>A</replaceable> <literal>=</literal> |
| <replaceable>C</replaceable> |
| (<firstterm>transitive law</firstterm>) |
| </para> |
| </listitem> |
| </itemizedlist> |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| A <literal><</literal> operator must be a strong ordering relation; |
| that is, for all non-null values <replaceable>A</replaceable>, |
| <replaceable>B</replaceable>, <replaceable>C</replaceable>: |
| |
| <itemizedlist> |
| <listitem> |
| <para> |
| <replaceable>A</replaceable> <literal><</literal> |
| <replaceable>A</replaceable> is false |
| (<firstterm>irreflexive law</firstterm>) |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| if <replaceable>A</replaceable> <literal><</literal> |
| <replaceable>B</replaceable> |
| and <replaceable>B</replaceable> <literal><</literal> |
| <replaceable>C</replaceable>, |
| then <replaceable>A</replaceable> <literal><</literal> |
| <replaceable>C</replaceable> |
| (<firstterm>transitive law</firstterm>) |
| </para> |
| </listitem> |
| </itemizedlist> |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| Furthermore, the ordering is total; that is, for all non-null |
| values <replaceable>A</replaceable>, <replaceable>B</replaceable>: |
| |
| <itemizedlist> |
| <listitem> |
| <para> |
| exactly one of <replaceable>A</replaceable> <literal><</literal> |
| <replaceable>B</replaceable>, <replaceable>A</replaceable> |
| <literal>=</literal> <replaceable>B</replaceable>, and |
| <replaceable>B</replaceable> <literal><</literal> |
| <replaceable>A</replaceable> is true |
| (<firstterm>trichotomy law</firstterm>) |
| </para> |
| </listitem> |
| </itemizedlist> |
| |
| (The trichotomy law justifies the definition of the comparison support |
| function, of course.) |
| </para> |
| </listitem> |
| </itemizedlist> |
| |
| <para> |
| The other three operators are defined in terms of <literal>=</literal> |
| and <literal><</literal> in the obvious way, and must act consistently |
| with them. |
| </para> |
| |
| <para> |
| For an operator family supporting multiple data types, the above laws must |
| hold when <replaceable>A</replaceable>, <replaceable>B</replaceable>, |
| <replaceable>C</replaceable> are taken from any data types in the family. |
| The transitive laws are the trickiest to ensure, as in cross-type |
| situations they represent statements that the behaviors of two or three |
| different operators are consistent. |
| As an example, it would not work to put <type>float8</type> |
| and <type>numeric</type> into the same operator family, at least not with |
| the current semantics that <type>numeric</type> values are converted |
| to <type>float8</type> for comparison to a <type>float8</type>. Because |
| of the limited accuracy of <type>float8</type>, this means there are |
| distinct <type>numeric</type> values that will compare equal to the |
| same <type>float8</type> value, and thus the transitive law would fail. |
| </para> |
| |
| <para> |
| Another requirement for a multiple-data-type family is that any implicit |
| or binary-coercion casts that are defined between data types included in |
| the operator family must not change the associated sort ordering. |
| </para> |
| |
| <para> |
| It should be fairly clear why a btree index requires these laws to hold |
| within a single data type: without them there is no ordering to arrange |
| the keys with. Also, index searches using a comparison key of a |
| different data type require comparisons to behave sanely across two |
| data types. The extensions to three or more data types within a family |
| are not strictly required by the btree index mechanism itself, but the |
| planner relies on them for optimization purposes. |
| </para> |
| |
| </sect1> |
| |
| <sect1 id="btree-support-funcs"> |
| <title>B-Tree Support Functions</title> |
| |
| <para> |
| As shown in <xref linkend="xindex-btree-support-table"/>, btree defines |
| one required and four optional support functions. The five |
| user-defined methods are: |
| </para> |
| <variablelist> |
| <varlistentry> |
| <term><function>order</function></term> |
| <listitem> |
| <para> |
| For each combination of data types that a btree operator family |
| provides comparison operators for, it must provide a comparison |
| support function, registered in |
| <structname>pg_amproc</structname> with support function number 1 |
| and |
| <structfield>amproclefttype</structfield>/<structfield>amprocrighttype</structfield> |
| equal to the left and right data types for the comparison (i.e., |
| the same data types that the matching operators are registered |
| with in <structname>pg_amop</structname>). The comparison |
| function must take two non-null values |
| <replaceable>A</replaceable> and <replaceable>B</replaceable> and |
| return an <type>int32</type> value that is |
| <literal><</literal> <literal>0</literal>, |
| <literal>0</literal>, or <literal>></literal> |
| <literal>0</literal> when <replaceable>A</replaceable> |
| <literal><</literal> <replaceable>B</replaceable>, |
| <replaceable>A</replaceable> <literal>=</literal> |
| <replaceable>B</replaceable>, or <replaceable>A</replaceable> |
| <literal>></literal> <replaceable>B</replaceable>, |
| respectively. A null result is disallowed: all values of the |
| data type must be comparable. See |
| <filename>src/backend/access/nbtree/nbtcompare.c</filename> for |
| examples. |
| </para> |
| |
| <para> |
| If the compared values are of a collatable data type, the |
| appropriate collation OID will be passed to the comparison |
| support function, using the standard |
| <function>PG_GET_COLLATION()</function> mechanism. |
| </para> |
| </listitem> |
| </varlistentry> |
| <varlistentry> |
| <term><function>sortsupport</function></term> |
| <listitem> |
| <para> |
| Optionally, a btree operator family may provide <firstterm>sort |
| support</firstterm> function(s), registered under support |
| function number 2. These functions allow implementing |
| comparisons for sorting purposes in a more efficient way than |
| naively calling the comparison support function. The APIs |
| involved in this are defined in |
| <filename>src/include/utils/sortsupport.h</filename>. |
| </para> |
| </listitem> |
| </varlistentry> |
| <varlistentry> |
| <term><function>in_range</function></term> |
| <listitem> |
| <indexterm> |
| <primary>in_range support functions</primary> |
| </indexterm> |
| |
| <indexterm> |
| <primary>support functions</primary> |
| <secondary>in_range</secondary> |
| </indexterm> |
| <para> |
| Optionally, a btree operator family may provide |
| <firstterm>in_range</firstterm> support function(s), registered |
| under support function number 3. These are not used during btree |
| index operations; rather, they extend the semantics of the |
| operator family so that it can support window clauses containing |
| the <literal>RANGE</literal> <replaceable>offset</replaceable> |
| <literal>PRECEDING</literal> and <literal>RANGE</literal> |
| <replaceable>offset</replaceable> <literal>FOLLOWING</literal> |
| frame bound types (see <xref |
| linkend="syntax-window-functions"/>). Fundamentally, the extra |
| information provided is how to add or subtract an |
| <replaceable>offset</replaceable> value in a way that is |
| compatible with the family's data ordering. |
| </para> |
| |
| <para> |
| An <function>in_range</function> function must have the signature |
| <synopsis> |
| in_range(<replaceable>val</replaceable> type1, <replaceable>base</replaceable> type1, <replaceable>offset</replaceable> type2, <replaceable>sub</replaceable> bool, <replaceable>less</replaceable> bool) |
| returns bool |
| </synopsis> |
| <replaceable>val</replaceable> and |
| <replaceable>base</replaceable> must be of the same type, which |
| is one of the types supported by the operator family (i.e., a |
| type for which it provides an ordering). However, |
| <replaceable>offset</replaceable> could be of a different type, |
| which might be one otherwise unsupported by the family. An |
| example is that the built-in <literal>time_ops</literal> family |
| provides an <function>in_range</function> function that has |
| <replaceable>offset</replaceable> of type <type>interval</type>. |
| A family can provide <function>in_range</function> functions for |
| any of its supported types and one or more |
| <replaceable>offset</replaceable> types. Each |
| <function>in_range</function> function should be entered in |
| <structname>pg_amproc</structname> with |
| <structfield>amproclefttype</structfield> equal to |
| <type>type1</type> and <structfield>amprocrighttype</structfield> |
| equal to <type>type2</type>. |
| </para> |
| |
| <para> |
| The essential semantics of an <function>in_range</function> |
| function depend on the two Boolean flag parameters. It should |
| add or subtract <replaceable>base</replaceable> and |
| <replaceable>offset</replaceable>, then compare |
| <replaceable>val</replaceable> to the result, as follows: |
| <itemizedlist> |
| <listitem> |
| <para> |
| if <literal>!</literal><replaceable>sub</replaceable> and |
| <literal>!</literal><replaceable>less</replaceable>, return |
| <replaceable>val</replaceable> <literal>>=</literal> |
| (<replaceable>base</replaceable> <literal>+</literal> |
| <replaceable>offset</replaceable>) |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| if <literal>!</literal><replaceable>sub</replaceable> and |
| <replaceable>less</replaceable>, return |
| <replaceable>val</replaceable> <literal><=</literal> |
| (<replaceable>base</replaceable> <literal>+</literal> |
| <replaceable>offset</replaceable>) |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| if <replaceable>sub</replaceable> and |
| <literal>!</literal><replaceable>less</replaceable>, return |
| <replaceable>val</replaceable> <literal>>=</literal> |
| (<replaceable>base</replaceable> <literal>-</literal> |
| <replaceable>offset</replaceable>) |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| if <replaceable>sub</replaceable> and |
| <replaceable>less</replaceable>, return |
| <replaceable>val</replaceable> <literal><=</literal> |
| (<replaceable>base</replaceable> <literal>-</literal> |
| <replaceable>offset</replaceable>) |
| </para> |
| </listitem> |
| </itemizedlist> |
| Before doing so, the function should check the sign of |
| <replaceable>offset</replaceable>: if it is less than zero, raise |
| error |
| <literal>ERRCODE_INVALID_PRECEDING_OR_FOLLOWING_SIZE</literal> |
| (22013) with error text like <quote>invalid preceding or |
| following size in window function</quote>. (This is required by |
| the SQL standard, although nonstandard operator families might |
| perhaps choose to ignore this restriction, since there seems to |
| be little semantic necessity for it.) This requirement is |
| delegated to the <function>in_range</function> function so that |
| the core code needn't understand what <quote>less than |
| zero</quote> means for a particular data type. |
| </para> |
| |
| <para> |
| An additional expectation is that <function>in_range</function> |
| functions should, if practical, avoid throwing an error if |
| <replaceable>base</replaceable> <literal>+</literal> |
| <replaceable>offset</replaceable> or |
| <replaceable>base</replaceable> <literal>-</literal> |
| <replaceable>offset</replaceable> would overflow. The correct |
| comparison result can be determined even if that value would be |
| out of the data type's range. Note that if the data type |
| includes concepts such as <quote>infinity</quote> or |
| <quote>NaN</quote>, extra care may be needed to ensure that |
| <function>in_range</function>'s results agree with the normal |
| sort order of the operator family. |
| </para> |
| |
| <para> |
| The results of the <function>in_range</function> function must be |
| consistent with the sort ordering imposed by the operator family. |
| To be precise, given any fixed values of |
| <replaceable>offset</replaceable> and |
| <replaceable>sub</replaceable>, then: |
| <itemizedlist> |
| <listitem> |
| <para> |
| If <function>in_range</function> with |
| <replaceable>less</replaceable> = true is true for some |
| <replaceable>val1</replaceable> and |
| <replaceable>base</replaceable>, it must be true for every |
| <replaceable>val2</replaceable> <literal><=</literal> |
| <replaceable>val1</replaceable> with the same |
| <replaceable>base</replaceable>. |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| If <function>in_range</function> with |
| <replaceable>less</replaceable> = true is false for some |
| <replaceable>val1</replaceable> and |
| <replaceable>base</replaceable>, it must be false for every |
| <replaceable>val2</replaceable> <literal>>=</literal> |
| <replaceable>val1</replaceable> with the same |
| <replaceable>base</replaceable>. |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| If <function>in_range</function> with |
| <replaceable>less</replaceable> = true is true for some |
| <replaceable>val</replaceable> and |
| <replaceable>base1</replaceable>, it must be true for every |
| <replaceable>base2</replaceable> <literal>>=</literal> |
| <replaceable>base1</replaceable> with the same |
| <replaceable>val</replaceable>. |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| If <function>in_range</function> with |
| <replaceable>less</replaceable> = true is false for some |
| <replaceable>val</replaceable> and |
| <replaceable>base1</replaceable>, it must be false for every |
| <replaceable>base2</replaceable> <literal><=</literal> |
| <replaceable>base1</replaceable> with the same |
| <replaceable>val</replaceable>. |
| </para> |
| </listitem> |
| </itemizedlist> |
| Analogous statements with inverted conditions hold when |
| <replaceable>less</replaceable> = false. |
| </para> |
| |
| <para> |
| If the type being ordered (<type>type1</type>) is collatable, the |
| appropriate collation OID will be passed to the |
| <function>in_range</function> function, using the standard |
| PG_GET_COLLATION() mechanism. |
| </para> |
| |
| <para> |
| <function>in_range</function> functions need not handle NULL |
| inputs, and typically will be marked strict. |
| </para> |
| </listitem> |
| </varlistentry> |
| <varlistentry> |
| <term><function>equalimage</function></term> |
| <listitem> |
| <para> |
| Optionally, a btree operator family may provide |
| <function>equalimage</function> (<quote>equality implies image |
| equality</quote>) support functions, registered under support |
| function number 4. These functions allow the core code to |
| determine when it is safe to apply the btree deduplication |
| optimization. Currently, <function>equalimage</function> |
| functions are only called when building or rebuilding an index. |
| </para> |
| <para> |
| An <function>equalimage</function> function must have the |
| signature |
| <synopsis> |
| equalimage(<replaceable>opcintype</replaceable> <type>oid</type>) returns bool |
| </synopsis> |
| The return value is static information about an operator class |
| and collation. Returning <literal>true</literal> indicates that |
| the <function>order</function> function for the operator class is |
| guaranteed to only return <literal>0</literal> (<quote>arguments |
| are equal</quote>) when its <replaceable>A</replaceable> and |
| <replaceable>B</replaceable> arguments are also interchangeable |
| without any loss of semantic information. Not registering an |
| <function>equalimage</function> function or returning |
| <literal>false</literal> indicates that this condition cannot be |
| assumed to hold. |
| </para> |
| <para> |
| The <replaceable>opcintype</replaceable> argument is the |
| <literal><structname>pg_type</structname>.oid</literal> of the |
| data type that the operator class indexes. This is a convenience |
| that allows reuse of the same underlying |
| <function>equalimage</function> function across operator classes. |
| If <replaceable>opcintype</replaceable> is a collatable data |
| type, the appropriate collation OID will be passed to the |
| <function>equalimage</function> function, using the standard |
| <function>PG_GET_COLLATION()</function> mechanism. |
| </para> |
| <para> |
| As far as the operator class is concerned, returning |
| <literal>true</literal> indicates that deduplication is safe (or |
| safe for the collation whose OID was passed to its |
| <function>equalimage</function> function). However, the core |
| code will only deem deduplication safe for an index when |
| <emphasis>every</emphasis> indexed column uses an operator class |
| that registers an <function>equalimage</function> function, and |
| each function actually returns <literal>true</literal> when |
| called. |
| </para> |
| <para> |
| Image equality is <emphasis>almost</emphasis> the same condition |
| as simple bitwise equality. There is one subtle difference: When |
| indexing a varlena data type, the on-disk representation of two |
| image equal datums may not be bitwise equal due to inconsistent |
| application of <acronym>TOAST</acronym> compression on input. |
| Formally, when an operator class's |
| <function>equalimage</function> function returns |
| <literal>true</literal>, it is safe to assume that the |
| <literal>datum_image_eq()</literal> C function will always agree |
| with the operator class's <function>order</function> function |
| (provided that the same collation OID is passed to both the |
| <function>equalimage</function> and <function>order</function> |
| functions). |
| </para> |
| <para> |
| The core code is fundamentally unable to deduce anything about |
| the <quote>equality implies image equality</quote> status of an |
| operator class within a multiple-data-type family based on |
| details from other operator classes in the same family. Also, it |
| is not sensible for an operator family to register a cross-type |
| <function>equalimage</function> function, and attempting to do so |
| will result in an error. This is because <quote>equality implies |
| image equality</quote> status does not just depend on |
| sorting/equality semantics, which are more or less defined at the |
| operator family level. In general, the semantics that one |
| particular data type implements must be considered separately. |
| </para> |
| <para> |
| The convention followed by the operator classes included with the |
| core <productname>PostgreSQL</productname> distribution is to |
| register a stock, generic <function>equalimage</function> |
| function. Most operator classes register |
| <function>btequalimage()</function>, which indicates that |
| deduplication is safe unconditionally. Operator classes for |
| collatable data types such as <type>text</type> register |
| <function>btvarstrequalimage()</function>, which indicates that |
| deduplication is safe with deterministic collations. Best |
| practice for third-party extensions is to register their own |
| custom function to retain control. |
| </para> |
| </listitem> |
| </varlistentry> |
| <varlistentry> |
| <term><function>options</function></term> |
| <listitem> |
| <para> |
| Optionally, a B-tree operator family may provide |
| <function>options</function> (<quote>operator class specific |
| options</quote>) support functions, registered under support |
| function number 5. These functions define a set of user-visible |
| parameters that control operator class behavior. |
| </para> |
| <para> |
| An <function>options</function> support function must have the |
| signature |
| <synopsis> |
| options(<replaceable>relopts</replaceable> <type>local_relopts *</type>) returns void |
| </synopsis> |
| The function is passed a pointer to a <structname>local_relopts</structname> |
| struct, which needs to be filled with a set of operator class |
| specific options. The options can be accessed from other support |
| functions using the <literal>PG_HAS_OPCLASS_OPTIONS()</literal> and |
| <literal>PG_GET_OPCLASS_OPTIONS()</literal> macros. |
| </para> |
| <para> |
| Currently, no B-Tree operator class has an <function>options</function> |
| support function. B-tree doesn't allow flexible representation of keys |
| like GiST, SP-GiST, GIN and BRIN do. So, <function>options</function> |
| probably doesn't have much application in the current B-tree index |
| access method. Nevertheless, this support function was added to B-tree |
| for uniformity, and will probably find uses during further |
| evolution of B-tree in <productname>PostgreSQL</productname>. |
| </para> |
| </listitem> |
| </varlistentry> |
| </variablelist> |
| |
| </sect1> |
| |
| <sect1 id="btree-implementation"> |
| <title>Implementation</title> |
| |
| <para> |
| This section covers B-Tree index implementation details that may be |
| of use to advanced users. See |
| <filename>src/backend/access/nbtree/README</filename> in the source |
| distribution for a much more detailed, internals-focused description |
| of the B-Tree implementation. |
| </para> |
| <sect2 id="btree-structure"> |
| <title>B-Tree Structure</title> |
| <para> |
| <productname>PostgreSQL</productname> B-Tree indexes are |
| multi-level tree structures, where each level of the tree can be |
| used as a doubly-linked list of pages. A single metapage is stored |
| in a fixed position at the start of the first segment file of the |
| index. All other pages are either leaf pages or internal pages. |
| Leaf pages are the pages on the lowest level of the tree. All |
| other levels consist of internal pages. Each leaf page contains |
| tuples that point to table rows. Each internal page contains |
| tuples that point to the next level down in the tree. Typically, |
| over 99% of all pages are leaf pages. Both internal pages and leaf |
| pages use the standard page format described in <xref |
| linkend="storage-page-layout"/>. |
| </para> |
| <para> |
| New leaf pages are added to a B-Tree index when an existing leaf |
| page cannot fit an incoming tuple. A <firstterm>page |
| split</firstterm> operation makes room for items that originally |
| belonged on the overflowing page by moving a portion of the items |
| to a new page. Page splits must also insert a new |
| <firstterm>downlink</firstterm> to the new page in the parent page, |
| which may cause the parent to split in turn. Page splits |
| <quote>cascade upwards</quote> in a recursive fashion. When the |
| root page finally cannot fit a new downlink, a <firstterm>root page |
| split</firstterm> operation takes place. This adds a new level to |
| the tree structure by creating a new root page that is one level |
| above the original root page. |
| </para> |
| </sect2> |
| |
| <sect2 id="btree-deletion"> |
| <title>Bottom-up Index Deletion</title> |
| <para> |
| B-Tree indexes are not directly aware that under MVCC, there might |
| be multiple extant versions of the same logical table row; to an |
| index, each tuple is an independent object that needs its own index |
| entry. <quote>Version churn</quote> tuples may sometimes |
| accumulate and adversely affect query latency and throughput. This |
| typically occurs with <command>UPDATE</command>-heavy workloads |
| where most individual updates cannot apply the |
| <acronym>HOT</acronym> optimization. Changing the value of only |
| one column covered by one index during an <command>UPDATE</command> |
| <emphasis>always</emphasis> necessitates a new set of index tuples |
| — one for <emphasis>each and every</emphasis> index on the |
| table. Note in particular that this includes indexes that were not |
| <quote>logically modified</quote> by the <command>UPDATE</command>. |
| All indexes will need a successor physical index tuple that points |
| to the latest version in the table. Each new tuple within each |
| index will generally need to coexist with the original |
| <quote>updated</quote> tuple for a short period of time (typically |
| until shortly after the <command>UPDATE</command> transaction |
| commits). |
| </para> |
| <para> |
| B-Tree indexes incrementally delete version churn index tuples by |
| performing <firstterm>bottom-up index deletion</firstterm> passes. |
| Each deletion pass is triggered in reaction to an anticipated |
| <quote>version churn page split</quote>. This only happens with |
| indexes that are not logically modified by |
| <command>UPDATE</command> statements, where concentrated build up |
| of obsolete versions in particular pages would occur otherwise. A |
| page split will usually be avoided, though it's possible that |
| certain implementation-level heuristics will fail to identify and |
| delete even one garbage index tuple (in which case a page split or |
| deduplication pass resolves the issue of an incoming new tuple not |
| fitting on a leaf page). The worst-case number of versions that |
| any index scan must traverse (for any single logical row) is an |
| important contributor to overall system responsiveness and |
| throughput. A bottom-up index deletion pass targets suspected |
| garbage tuples in a single leaf page based on |
| <emphasis>qualitative</emphasis> distinctions involving logical |
| rows and versions. This contrasts with the <quote>top-down</quote> |
| index cleanup performed by autovacuum workers, which is triggered |
| when certain <emphasis>quantitative</emphasis> table-level |
| thresholds are exceeded (see <xref linkend="autovacuum"/>). |
| </para> |
| <note> |
| <para> |
| Not all deletion operations that are performed within B-Tree |
| indexes are bottom-up deletion operations. There is a distinct |
| category of index tuple deletion: <firstterm>simple index tuple |
| deletion</firstterm>. This is a deferred maintenance operation |
| that deletes index tuples that are known to be safe to delete |
| (those whose item identifier's <literal>LP_DEAD</literal> bit is |
| already set). Like bottom-up index deletion, simple index |
| deletion takes place at the point that a page split is anticipated |
| as a way of avoiding the split. |
| </para> |
| <para> |
| Simple deletion is opportunistic in the sense that it can only |
| take place when recent index scans set the |
| <literal>LP_DEAD</literal> bits of affected items in passing. |
| Prior to <productname>PostgreSQL</productname> 14, the only |
| category of B-Tree deletion was simple deletion. The main |
| differences between it and bottom-up deletion are that only the |
| former is opportunistically driven by the activity of passing |
| index scans, while only the latter specifically targets version |
| churn from <command>UPDATE</command>s that do not logically modify |
| indexed columns. |
| </para> |
| </note> |
| <para> |
| Bottom-up index deletion performs the vast majority of all garbage |
| index tuple cleanup for particular indexes with certain workloads. |
| This is expected with any B-Tree index that is subject to |
| significant version churn from <command>UPDATE</command>s that |
| rarely or never logically modify the columns that the index covers. |
| The average and worst-case number of versions per logical row can |
| be kept low purely through targeted incremental deletion passes. |
| It's quite possible that the on-disk size of certain indexes will |
| never increase by even one single page/block despite |
| <emphasis>constant</emphasis> version churn from |
| <command>UPDATE</command>s. Even then, an exhaustive <quote>clean |
| sweep</quote> by a <command>VACUUM</command> operation (typically |
| run in an autovacuum worker process) will eventually be required as |
| a part of <emphasis>collective</emphasis> cleanup of the table and |
| each of its indexes. |
| </para> |
| <para> |
| Unlike <command>VACUUM</command>, bottom-up index deletion does not |
| provide any strong guarantees about how old the oldest garbage |
| index tuple may be. No index can be permitted to retain |
| <quote>floating garbage</quote> index tuples that became dead prior |
| to a conservative cutoff point shared by the table and all of its |
| indexes collectively. This fundamental table-level invariant makes |
| it safe to recycle table <acronym>TID</acronym>s. This is how it |
| is possible for distinct logical rows to reuse the same table |
| <acronym>TID</acronym> over time (though this can never happen with |
| two logical rows whose lifetimes span the same |
| <command>VACUUM</command> cycle). |
| </para> |
| </sect2> |
| |
| <sect2 id="btree-deduplication"> |
| <title>Deduplication</title> |
| <para> |
| A duplicate is a leaf page tuple (a tuple that points to a table |
| row) where <emphasis>all</emphasis> indexed key columns have values |
| that match corresponding column values from at least one other leaf |
| page tuple in the same index. Duplicate tuples are quite common in |
| practice. B-Tree indexes can use a special, space-efficient |
| representation for duplicates when an optional technique is |
| enabled: <firstterm>deduplication</firstterm>. |
| </para> |
| <para> |
| Deduplication works by periodically merging groups of duplicate |
| tuples together, forming a single <firstterm>posting list</firstterm> tuple for each |
| group. The column key value(s) only appear once in this |
| representation. This is followed by a sorted array of |
| <acronym>TID</acronym>s that point to rows in the table. This |
| significantly reduces the storage size of indexes where each value |
| (or each distinct combination of column values) appears several |
| times on average. The latency of queries can be reduced |
| significantly. Overall query throughput may increase |
| significantly. The overhead of routine index vacuuming may also be |
| reduced significantly. |
| </para> |
| <note> |
| <para> |
| B-Tree deduplication is just as effective with |
| <quote>duplicates</quote> that contain a NULL value, even though |
| NULL values are never equal to each other according to the |
| <literal>=</literal> member of any B-Tree operator class. As far |
| as any part of the implementation that understands the on-disk |
| B-Tree structure is concerned, NULL is just another value from the |
| domain of indexed values. |
| </para> |
| </note> |
| <para> |
| The deduplication process occurs lazily, when a new item is |
| inserted that cannot fit on an existing leaf page, though only when |
| index tuple deletion could not free sufficient space for the new |
| item (typically deletion is briefly considered and then skipped |
| over). Unlike GIN posting list tuples, B-Tree posting list tuples |
| do not need to expand every time a new duplicate is inserted; they |
| are merely an alternative physical representation of the original |
| logical contents of the leaf page. This design prioritizes |
| consistent performance with mixed read-write workloads. Most |
| client applications will at least see a moderate performance |
| benefit from using deduplication. Deduplication is enabled by |
| default. |
| </para> |
| <para> |
| <command>CREATE INDEX</command> and <command>REINDEX</command> |
| apply deduplication to create posting list tuples, though the |
| strategy they use is slightly different. Each group of duplicate |
| ordinary tuples encountered in the sorted input taken from the |
| table is merged into a posting list tuple |
| <emphasis>before</emphasis> being added to the current pending leaf |
| page. Individual posting list tuples are packed with as many |
| <acronym>TID</acronym>s as possible. Leaf pages are written out in |
| the usual way, without any separate deduplication pass. This |
| strategy is well-suited to <command>CREATE INDEX</command> and |
| <command>REINDEX</command> because they are once-off batch |
| operations. |
| </para> |
| <para> |
| Write-heavy workloads that don't benefit from deduplication due to |
| having few or no duplicate values in indexes will incur a small, |
| fixed performance penalty (unless deduplication is explicitly |
| disabled). The <literal>deduplicate_items</literal> storage |
| parameter can be used to disable deduplication within individual |
| indexes. There is never any performance penalty with read-only |
| workloads, since reading posting list tuples is at least as |
| efficient as reading the standard tuple representation. Disabling |
| deduplication isn't usually helpful. |
| </para> |
| <para> |
| It is sometimes possible for unique indexes (as well as unique |
| constraints) to use deduplication. This allows leaf pages to |
| temporarily <quote>absorb</quote> extra version churn duplicates. |
| Deduplication in unique indexes augments bottom-up index deletion, |
| especially in cases where a long-running transaction holds a |
| snapshot that blocks garbage collection. The goal is to buy time |
| for the bottom-up index deletion strategy to become effective |
| again. Delaying page splits until a single long-running |
| transaction naturally goes away can allow a bottom-up deletion pass |
| to succeed where an earlier deletion pass failed. |
| </para> |
| <tip> |
| <para> |
| A special heuristic is applied to determine whether a |
| deduplication pass in a unique index should take place. It can |
| often skip straight to splitting a leaf page, avoiding a |
| performance penalty from wasting cycles on unhelpful deduplication |
| passes. If you're concerned about the overhead of deduplication, |
| consider setting <literal>deduplicate_items = off</literal> |
| selectively. Leaving deduplication enabled in unique indexes has |
| little downside. |
| </para> |
| </tip> |
| <para> |
| Deduplication cannot be used in all cases due to |
| implementation-level restrictions. Deduplication safety is |
| determined when <command>CREATE INDEX</command> or |
| <command>REINDEX</command> is run. |
| </para> |
| <para> |
| Note that deduplication is deemed unsafe and cannot be used in the |
| following cases involving semantically significant differences |
| among equal datums: |
| </para> |
| <para> |
| <itemizedlist> |
| <listitem> |
| <para> |
| <type>text</type>, <type>varchar</type>, and <type>char</type> |
| cannot use deduplication when a |
| <emphasis>nondeterministic</emphasis> collation is used. Case |
| and accent differences must be preserved among equal datums. |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| <type>numeric</type> cannot use deduplication. Numeric display |
| scale must be preserved among equal datums. |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| <type>jsonb</type> cannot use deduplication, since the |
| <type>jsonb</type> B-Tree operator class uses |
| <type>numeric</type> internally. |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| <type>float4</type> and <type>float8</type> cannot use |
| deduplication. These types have distinct representations for |
| <literal>-0</literal> and <literal>0</literal>, which are |
| nevertheless considered equal. This difference must be |
| preserved. |
| </para> |
| </listitem> |
| </itemizedlist> |
| </para> |
| <para> |
| There is one further implementation-level restriction that may be |
| lifted in a future version of |
| <productname>PostgreSQL</productname>: |
| </para> |
| <para> |
| <itemizedlist> |
| <listitem> |
| <para> |
| Container types (such as composite types, arrays, or range |
| types) cannot use deduplication. |
| </para> |
| </listitem> |
| </itemizedlist> |
| </para> |
| <para> |
| There is one further implementation-level restriction that applies |
| regardless of the operator class or collation used: |
| </para> |
| <para> |
| <itemizedlist> |
| <listitem> |
| <para> |
| <literal>INCLUDE</literal> indexes can never use deduplication. |
| </para> |
| </listitem> |
| </itemizedlist> |
| </para> |
| |
| </sect2> |
| </sect1> |
| |
| </chapter> |