| <!-- doc/src/sgml/indices.sgml --> |
| |
| <chapter id="indexes"> |
| <title>Indexes</title> |
| |
| <indexterm zone="indexes"> |
| <primary>index</primary> |
| </indexterm> |
| |
| <para> |
| Indexes are a common way to enhance database performance. An index |
| allows the database server to find and retrieve specific rows much |
| faster than it could do without an index. But indexes also add |
| overhead to the database system as a whole, so they should be used |
| sensibly. |
| </para> |
| |
| |
| <sect1 id="indexes-intro"> |
| <title>Introduction</title> |
| |
| <para> |
| Suppose we have a table similar to this: |
| <programlisting> |
| CREATE TABLE test1 ( |
| id integer, |
| content varchar |
| ); |
| </programlisting> |
| and the application issues many queries of the form: |
| <programlisting> |
| SELECT content FROM test1 WHERE id = <replaceable>constant</replaceable>; |
| </programlisting> |
| With no advance preparation, the system would have to scan the entire |
| <structname>test1</structname> table, row by row, to find all |
| matching entries. If there are many rows in |
| <structname>test1</structname> and only a few rows (perhaps zero |
| or one) that would be returned by such a query, this is clearly an |
| inefficient method. But if the system has been instructed to maintain an |
| index on the <structfield>id</structfield> column, it can use a more |
| efficient method for locating matching rows. For instance, it |
| might only have to walk a few levels deep into a search tree. |
| </para> |
| |
| <para> |
| A similar approach is used in most non-fiction books: terms and |
| concepts that are frequently looked up by readers are collected in |
| an alphabetic index at the end of the book. The interested reader |
| can scan the index relatively quickly and flip to the appropriate |
| page(s), rather than having to read the entire book to find the |
| material of interest. Just as it is the task of the author to |
| anticipate the items that readers are likely to look up, |
| it is the task of the database programmer to foresee which indexes |
| will be useful. |
| </para> |
| |
| <para> |
| The following command can be used to create an index on the |
| <structfield>id</structfield> column, as discussed: |
| <programlisting> |
| CREATE INDEX test1_id_index ON test1 (id); |
| </programlisting> |
| The name <structname>test1_id_index</structname> can be chosen |
| freely, but you should pick something that enables you to remember |
| later what the index was for. |
| </para> |
| |
| <para> |
| To remove an index, use the <command>DROP INDEX</command> command. |
| Indexes can be added to and removed from tables at any time. |
| </para> |
| |
| <para> |
| Once an index is created, no further intervention is required: the |
| system will update the index when the table is modified, and it will |
| use the index in queries when it thinks doing so would be more efficient |
| than a sequential table scan. But you might have to run the |
| <command>ANALYZE</command> command regularly to update |
| statistics to allow the query planner to make educated decisions. |
| See <xref linkend="performance-tips"/> for information about |
| how to find out whether an index is used and when and why the |
| planner might choose <emphasis>not</emphasis> to use an index. |
| </para> |
| |
| <para> |
| Indexes can also benefit <command>UPDATE</command> and |
| <command>DELETE</command> commands with search conditions. |
| Indexes can moreover be used in join searches. Thus, |
| an index defined on a column that is part of a join condition can |
| also significantly speed up queries with joins. |
| </para> |
| |
| <para> |
| Creating an index on a large table can take a long time. By default, |
| <productname>PostgreSQL</productname> allows reads (<command>SELECT</command> statements) to occur |
| on the table in parallel with index creation, but writes (<command>INSERT</command>, |
| <command>UPDATE</command>, <command>DELETE</command>) are blocked until the index build is finished. |
| In production environments this is often unacceptable. |
| It is possible to allow writes to occur in parallel with index |
| creation, but there are several caveats to be aware of — |
| for more information see <xref linkend="sql-createindex-concurrently"/>. |
| </para> |
| |
| <para> |
| After an index is created, the system has to keep it synchronized with the |
| table. This adds overhead to data manipulation operations. |
| Therefore indexes that are seldom or never used in queries |
| should be removed. |
| </para> |
| </sect1> |
| |
| |
| <sect1 id="indexes-types"> |
| <title>Index Types</title> |
| |
| <para> |
| <productname>PostgreSQL</productname> provides several index types: |
| B-tree, Hash, GiST, SP-GiST, GIN and BRIN. |
| Each index type uses a different |
| algorithm that is best suited to different types of queries. |
| By default, the <link linkend="sql-createindex"><command>CREATE |
| INDEX</command></link> command creates |
| B-tree indexes, which fit the most common situations. |
| The other index types are selected by writing the keyword |
| <literal>USING</literal> followed by the index type name. |
| For example, to create a Hash index: |
| <programlisting> |
| CREATE INDEX <replaceable>name</replaceable> ON <replaceable>table</replaceable> USING HASH (<replaceable>column</replaceable>); |
| </programlisting> |
| </para> |
| |
| <sect2 id="indexes-types-btree"> |
| <title>B-Tree</title> |
| |
| <indexterm> |
| <primary>index</primary> |
| <secondary>B-Tree</secondary> |
| </indexterm> |
| <indexterm> |
| <primary>B-Tree</primary> |
| <see>index</see> |
| </indexterm> |
| |
| <para> |
| B-trees can handle equality and range queries on data that can be sorted |
| into some ordering. |
| In particular, the <productname>PostgreSQL</productname> query planner |
| will consider using a B-tree index whenever an indexed column is |
| involved in a comparison using one of these operators: |
| |
| <synopsis> |
| < <= = >= > |
| </synopsis> |
| |
| Constructs equivalent to combinations of these operators, such as |
| <literal>BETWEEN</literal> and <literal>IN</literal>, can also be implemented with |
| a B-tree index search. Also, an <literal>IS NULL</literal> or <literal>IS NOT |
| NULL</literal> condition on an index column can be used with a B-tree index. |
| </para> |
| |
| <para> |
| The optimizer can also use a B-tree index for queries involving the |
| pattern matching operators <literal>LIKE</literal> and <literal>~</literal> |
| <emphasis>if</emphasis> the pattern is a constant and is anchored to |
| the beginning of the string — for example, <literal>col LIKE |
| 'foo%'</literal> or <literal>col ~ '^foo'</literal>, but not |
| <literal>col LIKE '%bar'</literal>. However, if your database does not |
| use the C locale you will need to create the index with a special |
| operator class to support indexing of pattern-matching queries; see |
| <xref linkend="indexes-opclass"/> below. It is also possible to use |
| B-tree indexes for <literal>ILIKE</literal> and |
| <literal>~*</literal>, but only if the pattern starts with |
| non-alphabetic characters, i.e., characters that are not affected by |
| upper/lower case conversion. |
| </para> |
| |
| <para> |
| B-tree indexes can also be used to retrieve data in sorted order. |
| This is not always faster than a simple scan and sort, but it is |
| often helpful. |
| </para> |
| </sect2> |
| |
| <sect2 id="indexes-types-hash"> |
| <title>Hash</title> |
| |
| <indexterm> |
| <primary>index</primary> |
| <secondary>hash</secondary> |
| </indexterm> |
| <indexterm> |
| <primary>hash</primary> |
| <see>index</see> |
| </indexterm> |
| |
| <para> |
| Hash indexes store a 32-bit hash code derived from the |
| value of the indexed column. Hence, |
| such indexes can only handle simple equality comparisons. |
| The query planner will consider using a hash index whenever an |
| indexed column is involved in a comparison using the |
| equal operator: |
| |
| <synopsis> |
| = |
| </synopsis> |
| </para> |
| </sect2> |
| |
| <sect2 id="indexes-type-gist"> |
| <title>GiST</title> |
| |
| <indexterm> |
| <primary>index</primary> |
| <secondary>GiST</secondary> |
| </indexterm> |
| <indexterm> |
| <primary>GiST</primary> |
| <see>index</see> |
| </indexterm> |
| |
| <para> |
| GiST indexes are not a single kind of index, but rather an infrastructure |
| within which many different indexing strategies can be implemented. |
| Accordingly, the particular operators with which a GiST index can be |
| used vary depending on the indexing strategy (the <firstterm>operator |
| class</firstterm>). As an example, the standard distribution of |
| <productname>PostgreSQL</productname> includes GiST operator classes |
| for several two-dimensional geometric data types, which support indexed |
| queries using these operators: |
| |
| <synopsis> |
| << &< &> >> <<| &<| |&> |>> @> <@ ~= && |
| </synopsis> |
| |
| (See <xref linkend="functions-geometry"/> for the meaning of |
| these operators.) |
| The GiST operator classes included in the standard distribution are |
| documented in <xref linkend="gist-builtin-opclasses-table"/>. |
| Many other GiST operator |
| classes are available in the <literal>contrib</literal> collection or as separate |
| projects. For more information see <xref linkend="gist"/>. |
| </para> |
| |
| <para> |
| GiST indexes are also capable of optimizing <quote>nearest-neighbor</quote> |
| searches, such as |
| <programlisting><![CDATA[ |
| SELECT * FROM places ORDER BY location <-> point '(101,456)' LIMIT 10; |
| ]]> |
| </programlisting> |
| which finds the ten places closest to a given target point. The ability |
| to do this is again dependent on the particular operator class being used. |
| In <xref linkend="gist-builtin-opclasses-table"/>, operators that can be |
| used in this way are listed in the column <quote>Ordering Operators</quote>. |
| </para> |
| </sect2> |
| |
| <sect2 id="indexes-type-spgist"> |
| <title>SP-GiST</title> |
| |
| <indexterm> |
| <primary>index</primary> |
| <secondary>SP-GiST</secondary> |
| </indexterm> |
| <indexterm> |
| <primary>SP-GiST</primary> |
| <see>index</see> |
| </indexterm> |
| |
| <para> |
| SP-GiST indexes, like GiST indexes, offer an infrastructure that supports |
| various kinds of searches. SP-GiST permits implementation of a wide range |
| of different non-balanced disk-based data structures, such as quadtrees, |
| k-d trees, and radix trees (tries). As an example, the standard distribution of |
| <productname>PostgreSQL</productname> includes SP-GiST operator classes |
| for two-dimensional points, which support indexed |
| queries using these operators: |
| |
| <synopsis> |
| << >> ~= <@ <<| |>> |
| </synopsis> |
| |
| (See <xref linkend="functions-geometry"/> for the meaning of |
| these operators.) |
| The SP-GiST operator classes included in the standard distribution are |
| documented in <xref linkend="spgist-builtin-opclasses-table"/>. |
| For more information see <xref linkend="spgist"/>. |
| </para> |
| |
| <para> |
| Like GiST, SP-GiST supports <quote>nearest-neighbor</quote> searches. |
| For SP-GiST operator classes that support distance ordering, the |
| corresponding operator is listed in the <quote>Ordering Operators</quote> |
| column in <xref linkend="spgist-builtin-opclasses-table"/>. |
| </para> |
| </sect2> |
| |
| <sect2 id="indexes-types-gin"> |
| <title>GIN</title> |
| |
| <indexterm> |
| <primary>index</primary> |
| <secondary>GIN</secondary> |
| </indexterm> |
| <indexterm> |
| <primary>GIN</primary> |
| <see>index</see> |
| </indexterm> |
| |
| <para> |
| GIN indexes are <quote>inverted indexes</quote> which are appropriate for |
| data values that contain multiple component values, such as arrays. An |
| inverted index contains a separate entry for each component value, and |
| can efficiently handle queries that test for the presence of specific |
| component values. |
| </para> |
| |
| <para> |
| Like GiST and SP-GiST, GIN can support |
| many different user-defined indexing strategies, and the particular |
| operators with which a GIN index can be used vary depending on the |
| indexing strategy. |
| As an example, the standard distribution of |
| <productname>PostgreSQL</productname> includes a GIN operator class |
| for arrays, which supports indexed queries using these operators: |
| |
| <synopsis> |
| <@ @> = && |
| </synopsis> |
| |
| (See <xref linkend="functions-array"/> for the meaning of |
| these operators.) |
| The GIN operator classes included in the standard distribution are |
| documented in <xref linkend="gin-builtin-opclasses-table"/>. |
| Many other GIN operator |
| classes are available in the <literal>contrib</literal> collection or as separate |
| projects. For more information see <xref linkend="gin"/>. |
| </para> |
| </sect2> |
| |
| <sect2 id="indexes-types-brin"> |
| <title>BRIN</title> |
| |
| <indexterm> |
| <primary>index</primary> |
| <secondary>BRIN</secondary> |
| </indexterm> |
| <indexterm> |
| <primary>BRIN</primary> |
| <see>index</see> |
| </indexterm> |
| |
| <para> |
| BRIN indexes (a shorthand for Block Range INdexes) store summaries about |
| the values stored in consecutive physical block ranges of a table. |
| Thus, they are most effective for columns whose values are well-correlated |
| with the physical order of the table rows. |
| Like GiST, SP-GiST and GIN, |
| BRIN can support many different indexing strategies, |
| and the particular operators with which a BRIN index can be used |
| vary depending on the indexing strategy. |
| For data types that have a linear sort order, the indexed data |
| corresponds to the minimum and maximum values of the |
| values in the column for each block range. This supports indexed queries |
| using these operators: |
| |
| <synopsis> |
| < <= = >= > |
| </synopsis> |
| |
| The BRIN operator classes included in the standard distribution are |
| documented in <xref linkend="brin-builtin-opclasses-table"/>. |
| For more information see <xref linkend="brin"/>. |
| </para> |
| </sect2> |
| </sect1> |
| |
| |
| <sect1 id="indexes-multicolumn"> |
| <title>Multicolumn Indexes</title> |
| |
| <indexterm zone="indexes-multicolumn"> |
| <primary>index</primary> |
| <secondary>multicolumn</secondary> |
| </indexterm> |
| |
| <para> |
| An index can be defined on more than one column of a table. For example, if |
| you have a table of this form: |
| <programlisting> |
| CREATE TABLE test2 ( |
| major int, |
| minor int, |
| name varchar |
| ); |
| </programlisting> |
| (say, you keep your <filename class="directory">/dev</filename> |
| directory in a database...) and you frequently issue queries like: |
| <programlisting> |
| SELECT name FROM test2 WHERE major = <replaceable>constant</replaceable> AND minor = <replaceable>constant</replaceable>; |
| </programlisting> |
| then it might be appropriate to define an index on the columns |
| <structfield>major</structfield> and |
| <structfield>minor</structfield> together, e.g.: |
| <programlisting> |
| CREATE INDEX test2_mm_idx ON test2 (major, minor); |
| </programlisting> |
| </para> |
| |
| <para> |
| Currently, only the B-tree, GiST, GIN, and BRIN index types support |
| multiple-key-column indexes. Whether there can be multiple key |
| columns is independent of whether <literal>INCLUDE</literal> columns |
| can be added to the index. Indexes can have up to 32 columns, |
| including <literal>INCLUDE</literal> columns. (This limit can be |
| altered when building <productname>PostgreSQL</productname>; see the |
| file <filename>pg_config_manual.h</filename>.) |
| </para> |
| |
| <para> |
| A multicolumn B-tree index can be used with query conditions that |
| involve any subset of the index's columns, but the index is most |
| efficient when there are constraints on the leading (leftmost) columns. |
| The exact rule is that equality constraints on leading columns, plus |
| any inequality constraints on the first column that does not have an |
| equality constraint, will be used to limit the portion of the index |
| that is scanned. Constraints on columns to the right of these columns |
| are checked in the index, so they save visits to the table proper, but |
| they do not reduce the portion of the index that has to be scanned. |
| For example, given an index on <literal>(a, b, c)</literal> and a |
| query condition <literal>WHERE a = 5 AND b >= 42 AND c < 77</literal>, |
| the index would have to be scanned from the first entry with |
| <literal>a</literal> = 5 and <literal>b</literal> = 42 up through the last entry with |
| <literal>a</literal> = 5. Index entries with <literal>c</literal> >= 77 would be |
| skipped, but they'd still have to be scanned through. |
| This index could in principle be used for queries that have constraints |
| on <literal>b</literal> and/or <literal>c</literal> with no constraint on <literal>a</literal> |
| — but the entire index would have to be scanned, so in most cases |
| the planner would prefer a sequential table scan over using the index. |
| </para> |
| |
| <para> |
| A multicolumn GiST index can be used with query conditions that |
| involve any subset of the index's columns. Conditions on additional |
| columns restrict the entries returned by the index, but the condition on |
| the first column is the most important one for determining how much of |
| the index needs to be scanned. A GiST index will be relatively |
| ineffective if its first column has only a few distinct values, even if |
| there are many distinct values in additional columns. |
| </para> |
| |
| <para> |
| A multicolumn GIN index can be used with query conditions that |
| involve any subset of the index's columns. Unlike B-tree or GiST, |
| index search effectiveness is the same regardless of which index column(s) |
| the query conditions use. |
| </para> |
| |
| <para> |
| A multicolumn BRIN index can be used with query conditions that |
| involve any subset of the index's columns. Like GIN and unlike B-tree or |
| GiST, index search effectiveness is the same regardless of which index |
| column(s) the query conditions use. The only reason to have multiple BRIN |
| indexes instead of one multicolumn BRIN index on a single table is to have |
| a different <literal>pages_per_range</literal> storage parameter. |
| </para> |
| |
| <para> |
| Of course, each column must be used with operators appropriate to the index |
| type; clauses that involve other operators will not be considered. |
| </para> |
| |
| <para> |
| Multicolumn indexes should be used sparingly. In most situations, |
| an index on a single column is sufficient and saves space and time. |
| Indexes with more than three columns are unlikely to be helpful |
| unless the usage of the table is extremely stylized. See also |
| <xref linkend="indexes-bitmap-scans"/> and |
| <xref linkend="indexes-index-only-scans"/> for some discussion of the |
| merits of different index configurations. |
| </para> |
| </sect1> |
| |
| |
| <sect1 id="indexes-ordering"> |
| <title>Indexes and <literal>ORDER BY</literal></title> |
| |
| <indexterm zone="indexes-ordering"> |
| <primary>index</primary> |
| <secondary>and <literal>ORDER BY</literal></secondary> |
| </indexterm> |
| |
| <para> |
| In addition to simply finding the rows to be returned by a query, |
| an index may be able to deliver them in a specific sorted order. |
| This allows a query's <literal>ORDER BY</literal> specification to be honored |
| without a separate sorting step. Of the index types currently |
| supported by <productname>PostgreSQL</productname>, only B-tree |
| can produce sorted output — the other index types return |
| matching rows in an unspecified, implementation-dependent order. |
| </para> |
| |
| <para> |
| The planner will consider satisfying an <literal>ORDER BY</literal> specification |
| either by scanning an available index that matches the specification, |
| or by scanning the table in physical order and doing an explicit |
| sort. For a query that requires scanning a large fraction of the |
| table, an explicit sort is likely to be faster than using an index |
| because it requires |
| less disk I/O due to following a sequential access pattern. Indexes are |
| more useful when only a few rows need be fetched. An important |
| special case is <literal>ORDER BY</literal> in combination with |
| <literal>LIMIT</literal> <replaceable>n</replaceable>: an explicit sort will have to process |
| all the data to identify the first <replaceable>n</replaceable> rows, but if there is |
| an index matching the <literal>ORDER BY</literal>, the first <replaceable>n</replaceable> |
| rows can be retrieved directly, without scanning the remainder at all. |
| </para> |
| |
| <para> |
| By default, B-tree indexes store their entries in ascending order |
| with nulls last (table TID is treated as a tiebreaker column among |
| otherwise equal entries). This means that a forward scan of an |
| index on column <literal>x</literal> produces output satisfying <literal>ORDER BY x</literal> |
| (or more verbosely, <literal>ORDER BY x ASC NULLS LAST</literal>). The |
| index can also be scanned backward, producing output satisfying |
| <literal>ORDER BY x DESC</literal> |
| (or more verbosely, <literal>ORDER BY x DESC NULLS FIRST</literal>, since |
| <literal>NULLS FIRST</literal> is the default for <literal>ORDER BY DESC</literal>). |
| </para> |
| |
| <para> |
| You can adjust the ordering of a B-tree index by including the |
| options <literal>ASC</literal>, <literal>DESC</literal>, <literal>NULLS FIRST</literal>, |
| and/or <literal>NULLS LAST</literal> when creating the index; for example: |
| <programlisting> |
| CREATE INDEX test2_info_nulls_low ON test2 (info NULLS FIRST); |
| CREATE INDEX test3_desc_index ON test3 (id DESC NULLS LAST); |
| </programlisting> |
| An index stored in ascending order with nulls first can satisfy |
| either <literal>ORDER BY x ASC NULLS FIRST</literal> or |
| <literal>ORDER BY x DESC NULLS LAST</literal> depending on which direction |
| it is scanned in. |
| </para> |
| |
| <para> |
| You might wonder why bother providing all four options, when two |
| options together with the possibility of backward scan would cover |
| all the variants of <literal>ORDER BY</literal>. In single-column indexes |
| the options are indeed redundant, but in multicolumn indexes they can be |
| useful. Consider a two-column index on <literal>(x, y)</literal>: this can |
| satisfy <literal>ORDER BY x, y</literal> if we scan forward, or |
| <literal>ORDER BY x DESC, y DESC</literal> if we scan backward. |
| But it might be that the application frequently needs to use |
| <literal>ORDER BY x ASC, y DESC</literal>. There is no way to get that |
| ordering from a plain index, but it is possible if the index is defined |
| as <literal>(x ASC, y DESC)</literal> or <literal>(x DESC, y ASC)</literal>. |
| </para> |
| |
| <para> |
| Obviously, indexes with non-default sort orderings are a fairly |
| specialized feature, but sometimes they can produce tremendous |
| speedups for certain queries. Whether it's worth maintaining such an |
| index depends on how often you use queries that require a special |
| sort ordering. |
| </para> |
| </sect1> |
| |
| |
| <sect1 id="indexes-bitmap-scans"> |
| <title>Combining Multiple Indexes</title> |
| |
| <indexterm zone="indexes-bitmap-scans"> |
| <primary>index</primary> |
| <secondary>combining multiple indexes</secondary> |
| </indexterm> |
| |
| <indexterm zone="indexes-bitmap-scans"> |
| <primary>bitmap scan</primary> |
| </indexterm> |
| |
| <para> |
| A single index scan can only use query clauses that use the index's |
| columns with operators of its operator class and are joined with |
| <literal>AND</literal>. For example, given an index on <literal>(a, b)</literal> |
| a query condition like <literal>WHERE a = 5 AND b = 6</literal> could |
| use the index, but a query like <literal>WHERE a = 5 OR b = 6</literal> could not |
| directly use the index. |
| </para> |
| |
| <para> |
| Fortunately, |
| <productname>PostgreSQL</productname> has the ability to combine multiple indexes |
| (including multiple uses of the same index) to handle cases that cannot |
| be implemented by single index scans. The system can form <literal>AND</literal> |
| and <literal>OR</literal> conditions across several index scans. For example, |
| a query like <literal>WHERE x = 42 OR x = 47 OR x = 53 OR x = 99</literal> |
| could be broken down into four separate scans of an index on <literal>x</literal>, |
| each scan using one of the query clauses. The results of these scans are |
| then ORed together to produce the result. Another example is that if we |
| have separate indexes on <literal>x</literal> and <literal>y</literal>, one possible |
| implementation of a query like <literal>WHERE x = 5 AND y = 6</literal> is to |
| use each index with the appropriate query clause and then AND together |
| the index results to identify the result rows. |
| </para> |
| |
| <para> |
| To combine multiple indexes, the system scans each needed index and |
| prepares a <firstterm>bitmap</firstterm> in memory giving the locations of |
| table rows that are reported as matching that index's conditions. |
| The bitmaps are then ANDed and ORed together as needed by the query. |
| Finally, the actual table rows are visited and returned. The table rows |
| are visited in physical order, because that is how the bitmap is laid |
| out; this means that any ordering of the original indexes is lost, and |
| so a separate sort step will be needed if the query has an <literal>ORDER |
| BY</literal> clause. For this reason, and because each additional index scan |
| adds extra time, the planner will sometimes choose to use a simple index |
| scan even though additional indexes are available that could have been |
| used as well. |
| </para> |
| |
| <para> |
| In all but the simplest applications, there are various combinations of |
| indexes that might be useful, and the database developer must make |
| trade-offs to decide which indexes to provide. Sometimes multicolumn |
| indexes are best, but sometimes it's better to create separate indexes |
| and rely on the index-combination feature. For example, if your |
| workload includes a mix of queries that sometimes involve only column |
| <literal>x</literal>, sometimes only column <literal>y</literal>, and sometimes both |
| columns, you might choose to create two separate indexes on |
| <literal>x</literal> and <literal>y</literal>, relying on index combination to |
| process the queries that use both columns. You could also create a |
| multicolumn index on <literal>(x, y)</literal>. This index would typically be |
| more efficient than index combination for queries involving both |
| columns, but as discussed in <xref linkend="indexes-multicolumn"/>, it |
| would be almost useless for queries involving only <literal>y</literal>, so it |
| should not be the only index. A combination of the multicolumn index |
| and a separate index on <literal>y</literal> would serve reasonably well. For |
| queries involving only <literal>x</literal>, the multicolumn index could be |
| used, though it would be larger and hence slower than an index on |
| <literal>x</literal> alone. The last alternative is to create all three |
| indexes, but this is probably only reasonable if the table is searched |
| much more often than it is updated and all three types of query are |
| common. If one of the types of query is much less common than the |
| others, you'd probably settle for creating just the two indexes that |
| best match the common types. |
| </para> |
| |
| </sect1> |
| |
| |
| <sect1 id="indexes-unique"> |
| <title>Unique Indexes</title> |
| |
| <indexterm zone="indexes-unique"> |
| <primary>index</primary> |
| <secondary>unique</secondary> |
| </indexterm> |
| |
| <para> |
| Indexes can also be used to enforce uniqueness of a column's value, |
| or the uniqueness of the combined values of more than one column. |
| <synopsis> |
| CREATE UNIQUE INDEX <replaceable>name</replaceable> ON <replaceable>table</replaceable> (<replaceable>column</replaceable> <optional>, ...</optional>); |
| </synopsis> |
| Currently, only B-tree indexes can be declared unique. |
| </para> |
| |
| <para> |
| When an index is declared unique, multiple table rows with equal |
| indexed values are not allowed. Null values are not considered |
| equal. A multicolumn unique index will only reject cases where all |
| indexed columns are equal in multiple rows. |
| </para> |
| |
| <para> |
| <productname>PostgreSQL</productname> automatically creates a unique |
| index when a unique constraint or primary key is defined for a table. |
| The index covers the columns that make up the primary key or unique |
| constraint (a multicolumn index, if appropriate), and is the mechanism |
| that enforces the constraint. |
| </para> |
| |
| <note> |
| <para> |
| There's no need to manually |
| create indexes on unique columns; doing so would just duplicate |
| the automatically-created index. |
| </para> |
| </note> |
| </sect1> |
| |
| |
| <sect1 id="indexes-expressional"> |
| <title>Indexes on Expressions</title> |
| |
| <indexterm zone="indexes-expressional"> |
| <primary>index</primary> |
| <secondary sortas="expressions">on expressions</secondary> |
| </indexterm> |
| |
| <para> |
| An index column need not be just a column of the underlying table, |
| but can be a function or scalar expression computed from one or |
| more columns of the table. This feature is useful to obtain fast |
| access to tables based on the results of computations. |
| </para> |
| |
| <para> |
| For example, a common way to do case-insensitive comparisons is to |
| use the <function>lower</function> function: |
| <programlisting> |
| SELECT * FROM test1 WHERE lower(col1) = 'value'; |
| </programlisting> |
| This query can use an index if one has been |
| defined on the result of the <literal>lower(col1)</literal> |
| function: |
| <programlisting> |
| CREATE INDEX test1_lower_col1_idx ON test1 (lower(col1)); |
| </programlisting> |
| </para> |
| |
| <para> |
| If we were to declare this index <literal>UNIQUE</literal>, it would prevent |
| creation of rows whose <literal>col1</literal> values differ only in case, |
| as well as rows whose <literal>col1</literal> values are actually identical. |
| Thus, indexes on expressions can be used to enforce constraints that |
| are not definable as simple unique constraints. |
| </para> |
| |
| <para> |
| As another example, if one often does queries like: |
| <programlisting> |
| SELECT * FROM people WHERE (first_name || ' ' || last_name) = 'John Smith'; |
| </programlisting> |
| then it might be worth creating an index like this: |
| <programlisting> |
| CREATE INDEX people_names ON people ((first_name || ' ' || last_name)); |
| </programlisting> |
| </para> |
| |
| <para> |
| The syntax of the <command>CREATE INDEX</command> command normally requires |
| writing parentheses around index expressions, as shown in the second |
| example. The parentheses can be omitted when the expression is just |
| a function call, as in the first example. |
| </para> |
| |
| <para> |
| Index expressions are relatively expensive to maintain, because the |
| derived expression(s) must be computed for each row insertion |
| and non-HOT update. However, the index expressions are |
| <emphasis>not</emphasis> recomputed during an indexed search, since they are |
| already stored in the index. In both examples above, the system |
| sees the query as just <literal>WHERE indexedcolumn = 'constant'</literal> |
| and so the speed of the search is equivalent to any other simple index |
| query. Thus, indexes on expressions are useful when retrieval speed |
| is more important than insertion and update speed. |
| </para> |
| </sect1> |
| |
| |
| <sect1 id="indexes-partial"> |
| <title>Partial Indexes</title> |
| |
| <indexterm zone="indexes-partial"> |
| <primary>index</primary> |
| <secondary>partial</secondary> |
| </indexterm> |
| |
| <para> |
| A <firstterm>partial index</firstterm> is an index built over a |
| subset of a table; the subset is defined by a conditional |
| expression (called the <firstterm>predicate</firstterm> of the |
| partial index). The index contains entries only for those table |
| rows that satisfy the predicate. Partial indexes are a specialized |
| feature, but there are several situations in which they are useful. |
| </para> |
| |
| <para> |
| One major reason for using a partial index is to avoid indexing common |
| values. Since a query searching for a common value (one that |
| accounts for more than a few percent of all the table rows) will not |
| use the index anyway, there is no point in keeping those rows in the |
| index at all. This reduces the size of the index, which will speed |
| up those queries that do use the index. It will also speed up many table |
| update operations because the index does not need to be |
| updated in all cases. <xref linkend="indexes-partial-ex1"/> shows a |
| possible application of this idea. |
| </para> |
| |
| <example id="indexes-partial-ex1"> |
| <title>Setting up a Partial Index to Exclude Common Values</title> |
| |
| <para> |
| Suppose you are storing web server access logs in a database. |
| Most accesses originate from the IP address range of your organization but |
| some are from elsewhere (say, employees on dial-up connections). |
| If your searches by IP are primarily for outside accesses, |
| you probably do not need to index the IP range that corresponds to your |
| organization's subnet. |
| </para> |
| |
| <para> |
| Assume a table like this: |
| <programlisting> |
| CREATE TABLE access_log ( |
| url varchar, |
| client_ip inet, |
| ... |
| ); |
| </programlisting> |
| </para> |
| |
| <para> |
| To create a partial index that suits our example, use a command |
| such as this: |
| <programlisting> |
| CREATE INDEX access_log_client_ip_ix ON access_log (client_ip) |
| WHERE NOT (client_ip > inet '192.168.100.0' AND |
| client_ip < inet '192.168.100.255'); |
| </programlisting> |
| </para> |
| |
| <para> |
| A typical query that can use this index would be: |
| <programlisting> |
| SELECT * |
| FROM access_log |
| WHERE url = '/index.html' AND client_ip = inet '212.78.10.32'; |
| </programlisting> |
| Here the query's IP address is covered by the partial index. The |
| following query cannot use the partial index, as it uses an IP address |
| that is excluded from the index: |
| <programlisting> |
| SELECT * |
| FROM access_log |
| WHERE url = '/index.html' AND client_ip = inet '192.168.100.23'; |
| </programlisting> |
| </para> |
| |
| <para> |
| Observe that this kind of partial index requires that the common |
| values be predetermined, so such partial indexes are best used for |
| data distributions that do not change. Such indexes can be recreated |
| occasionally to adjust for new data distributions, but this adds |
| maintenance effort. |
| </para> |
| </example> |
| |
| <para> |
| Another possible use for a partial index is to exclude values from the |
| index that the |
| typical query workload is not interested in; this is shown in <xref |
| linkend="indexes-partial-ex2"/>. This results in the same |
| advantages as listed above, but it prevents the |
| <quote>uninteresting</quote> values from being accessed via that |
| index, even if an index scan might be profitable in that |
| case. Obviously, setting up partial indexes for this kind of |
| scenario will require a lot of care and experimentation. |
| </para> |
| |
| <example id="indexes-partial-ex2"> |
| <title>Setting up a Partial Index to Exclude Uninteresting Values</title> |
| |
| <para> |
| If you have a table that contains both billed and unbilled orders, |
| where the unbilled orders take up a small fraction of the total |
| table and yet those are the most-accessed rows, you can improve |
| performance by creating an index on just the unbilled rows. The |
| command to create the index would look like this: |
| <programlisting> |
| CREATE INDEX orders_unbilled_index ON orders (order_nr) |
| WHERE billed is not true; |
| </programlisting> |
| </para> |
| |
| <para> |
| A possible query to use this index would be: |
| <programlisting> |
| SELECT * FROM orders WHERE billed is not true AND order_nr < 10000; |
| </programlisting> |
| However, the index can also be used in queries that do not involve |
| <structfield>order_nr</structfield> at all, e.g.: |
| <programlisting> |
| SELECT * FROM orders WHERE billed is not true AND amount > 5000.00; |
| </programlisting> |
| This is not as efficient as a partial index on the |
| <structfield>amount</structfield> column would be, since the system has to |
| scan the entire index. Yet, if there are relatively few unbilled |
| orders, using this partial index just to find the unbilled orders |
| could be a win. |
| </para> |
| |
| <para> |
| Note that this query cannot use this index: |
| <programlisting> |
| SELECT * FROM orders WHERE order_nr = 3501; |
| </programlisting> |
| The order 3501 might be among the billed or unbilled |
| orders. |
| </para> |
| </example> |
| |
| <para> |
| <xref linkend="indexes-partial-ex2"/> also illustrates that the |
| indexed column and the column used in the predicate do not need to |
| match. <productname>PostgreSQL</productname> supports partial |
| indexes with arbitrary predicates, so long as only columns of the |
| table being indexed are involved. However, keep in mind that the |
| predicate must match the conditions used in the queries that |
| are supposed to benefit from the index. To be precise, a partial |
| index can be used in a query only if the system can recognize that |
| the <literal>WHERE</literal> condition of the query mathematically implies |
| the predicate of the index. |
| <productname>PostgreSQL</productname> does not have a sophisticated |
| theorem prover that can recognize mathematically equivalent |
| expressions that are written in different forms. (Not |
| only is such a general theorem prover extremely difficult to |
| create, it would probably be too slow to be of any real use.) |
| The system can recognize simple inequality implications, for example |
| <quote>x < 1</quote> implies <quote>x < 2</quote>; otherwise |
| the predicate condition must exactly match part of the query's |
| <literal>WHERE</literal> condition |
| or the index will not be recognized as usable. Matching takes |
| place at query planning time, not at run time. As a result, |
| parameterized query clauses do not work with a partial index. For |
| example a prepared query with a parameter might specify |
| <quote>x < ?</quote> which will never imply |
| <quote>x < 2</quote> for all possible values of the parameter. |
| </para> |
| |
| <para> |
| A third possible use for partial indexes does not require the |
| index to be used in queries at all. The idea here is to create |
| a unique index over a subset of a table, as in <xref |
| linkend="indexes-partial-ex3"/>. This enforces uniqueness |
| among the rows that satisfy the index predicate, without constraining |
| those that do not. |
| </para> |
| |
| <example id="indexes-partial-ex3"> |
| <title>Setting up a Partial Unique Index</title> |
| |
| <para> |
| Suppose that we have a table describing test outcomes. We wish |
| to ensure that there is only one <quote>successful</quote> entry for |
| a given subject and target combination, but there might be any number of |
| <quote>unsuccessful</quote> entries. Here is one way to do it: |
| <programlisting> |
| CREATE TABLE tests ( |
| subject text, |
| target text, |
| success boolean, |
| ... |
| ); |
| |
| CREATE UNIQUE INDEX tests_success_constraint ON tests (subject, target) |
| WHERE success; |
| </programlisting> |
| This is a particularly efficient approach when there are few |
| successful tests and many unsuccessful ones. It is also possible to |
| allow only one null in a column by creating a unique partial index |
| with an <literal>IS NULL</literal> restriction. |
| </para> |
| |
| </example> |
| |
| <para> |
| Finally, a partial index can also be used to override the system's |
| query plan choices. Also, data sets with peculiar |
| distributions might cause the system to use an index when it really |
| should not. In that case the index can be set up so that it is not |
| available for the offending query. Normally, |
| <productname>PostgreSQL</productname> makes reasonable choices about index |
| usage (e.g., it avoids them when retrieving common values, so the |
| earlier example really only saves index size, it is not required to |
| avoid index usage), and grossly incorrect plan choices are cause |
| for a bug report. |
| </para> |
| |
| <para> |
| Keep in mind that setting up a partial index indicates that you |
| know at least as much as the query planner knows, in particular you |
| know when an index might be profitable. Forming this knowledge |
| requires experience and understanding of how indexes in |
| <productname>PostgreSQL</productname> work. In most cases, the |
| advantage of a partial index over a regular index will be minimal. |
| There are cases where they are quite counterproductive, as in <xref |
| linkend="indexes-partial-ex4"/>. |
| </para> |
| |
| <example id="indexes-partial-ex4"> |
| <title>Do Not Use Partial Indexes as a Substitute for Partitioning</title> |
| |
| <para> |
| You might be tempted to create a large set of non-overlapping partial |
| indexes, for example |
| |
| <programlisting> |
| CREATE INDEX mytable_cat_1 ON mytable (data) WHERE category = 1; |
| CREATE INDEX mytable_cat_2 ON mytable (data) WHERE category = 2; |
| CREATE INDEX mytable_cat_3 ON mytable (data) WHERE category = 3; |
| ... |
| CREATE INDEX mytable_cat_<replaceable>N</replaceable> ON mytable (data) WHERE category = <replaceable>N</replaceable>; |
| </programlisting> |
| |
| This is a bad idea! Almost certainly, you'll be better off with a |
| single non-partial index, declared like |
| |
| <programlisting> |
| CREATE INDEX mytable_cat_data ON mytable (category, data); |
| </programlisting> |
| |
| (Put the category column first, for the reasons described in |
| <xref linkend="indexes-multicolumn"/>.) While a search in this larger |
| index might have to descend through a couple more tree levels than a |
| search in a smaller index, that's almost certainly going to be cheaper |
| than the planner effort needed to select the appropriate one of the |
| partial indexes. The core of the problem is that the system does not |
| understand the relationship among the partial indexes, and will |
| laboriously test each one to see if it's applicable to the current |
| query. |
| </para> |
| |
| <para> |
| If your table is large enough that a single index really is a bad idea, |
| you should look into using partitioning instead (see |
| <xref linkend="ddl-partitioning"/>). With that mechanism, the system |
| does understand that the tables and indexes are non-overlapping, so |
| far better performance is possible. |
| </para> |
| </example> |
| |
| <para> |
| More information about partial indexes can be found in <xref |
| linkend="ston89b"/>, <xref linkend="olson93"/>, and <xref |
| linkend="seshadri95"/>. |
| </para> |
| </sect1> |
| |
| |
| <sect1 id="indexes-index-only-scans"> |
| <title>Index-Only Scans and Covering Indexes</title> |
| |
| <indexterm zone="indexes-index-only-scans"> |
| <primary>index</primary> |
| <secondary>index-only scans</secondary> |
| </indexterm> |
| <indexterm zone="indexes-index-only-scans"> |
| <primary>index-only scan</primary> |
| </indexterm> |
| <indexterm zone="indexes-index-only-scans"> |
| <primary>index</primary> |
| <secondary>covering</secondary> |
| </indexterm> |
| <indexterm zone="indexes-index-only-scans"> |
| <primary>covering index</primary> |
| </indexterm> |
| |
| <para> |
| All indexes in <productname>PostgreSQL</productname> |
| are <firstterm>secondary</firstterm> indexes, meaning that each index is |
| stored separately from the table's main data area (which is called the |
| table's <firstterm>heap</firstterm> |
| in <productname>PostgreSQL</productname> terminology). This means that |
| in an ordinary index scan, each row retrieval requires fetching data from |
| both the index and the heap. Furthermore, while the index entries that |
| match a given indexable <literal>WHERE</literal> condition are usually |
| close together in the index, the table rows they reference might be |
| anywhere in the heap. The heap-access portion of an index scan thus |
| involves a lot of random access into the heap, which can be slow, |
| particularly on traditional rotating media. (As described in |
| <xref linkend="indexes-bitmap-scans"/>, bitmap scans try to alleviate |
| this cost by doing the heap accesses in sorted order, but that only goes |
| so far.) |
| </para> |
| |
| <para> |
| To solve this performance problem, <productname>PostgreSQL</productname> |
| supports <firstterm>index-only scans</firstterm>, which can answer |
| queries from an index alone without any heap access. The basic idea is |
| to return values directly out of each index entry instead of consulting |
| the associated heap entry. There are two fundamental restrictions on |
| when this method can be used: |
| |
| <orderedlist> |
| <listitem> |
| <para> |
| The index type must support index-only scans. B-tree indexes always |
| do. GiST and SP-GiST indexes support index-only scans for some |
| operator classes but not others. Other index types have no support. |
| The underlying requirement is that the index must physically store, or |
| else be able to reconstruct, the original data value for each index |
| entry. As a counterexample, GIN indexes cannot support index-only |
| scans because each index entry typically holds only part of the |
| original data value. |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| The query must reference only columns stored in the index. For |
| example, given an index on columns <literal>x</literal> |
| and <literal>y</literal> of a table that also has a |
| column <literal>z</literal>, these queries could use index-only scans: |
| <programlisting> |
| SELECT x, y FROM tab WHERE x = 'key'; |
| SELECT x FROM tab WHERE x = 'key' AND y < 42; |
| </programlisting> |
| but these queries could not: |
| <programlisting> |
| SELECT x, z FROM tab WHERE x = 'key'; |
| SELECT x FROM tab WHERE x = 'key' AND z < 42; |
| </programlisting> |
| (Expression indexes and partial indexes complicate this rule, |
| as discussed below.) |
| </para> |
| </listitem> |
| </orderedlist> |
| </para> |
| |
| <para> |
| If these two fundamental requirements are met, then all the data values |
| required by the query are available from the index, so an index-only scan |
| is physically possible. But there is an additional requirement for any |
| table scan in <productname>PostgreSQL</productname>: it must verify that |
| each retrieved row be <quote>visible</quote> to the query's MVCC |
| snapshot, as discussed in <xref linkend="mvcc"/>. Visibility information |
| is not stored in index entries, only in heap entries; so at first glance |
| it would seem that every row retrieval would require a heap access |
| anyway. And this is indeed the case, if the table row has been modified |
| recently. However, for seldom-changing data there is a way around this |
| problem. <productname>PostgreSQL</productname> tracks, for each page in |
| a table's heap, whether all rows stored in that page are old enough to be |
| visible to all current and future transactions. This information is |
| stored in a bit in the table's <firstterm>visibility map</firstterm>. An |
| index-only scan, after finding a candidate index entry, checks the |
| visibility map bit for the corresponding heap page. If it's set, the row |
| is known visible and so the data can be returned with no further work. |
| If it's not set, the heap entry must be visited to find out whether it's |
| visible, so no performance advantage is gained over a standard index |
| scan. Even in the successful case, this approach trades visibility map |
| accesses for heap accesses; but since the visibility map is four orders |
| of magnitude smaller than the heap it describes, far less physical I/O is |
| needed to access it. In most situations the visibility map remains |
| cached in memory all the time. |
| </para> |
| |
| <para> |
| In short, while an index-only scan is possible given the two fundamental |
| requirements, it will be a win only if a significant fraction of the |
| table's heap pages have their all-visible map bits set. But tables in |
| which a large fraction of the rows are unchanging are common enough to |
| make this type of scan very useful in practice. |
| </para> |
| |
| <para> |
| <indexterm> |
| <primary><literal>INCLUDE</literal></primary> |
| <secondary>in index definitions</secondary> |
| </indexterm> |
| To make effective use of the index-only scan feature, you might choose to |
| create a <firstterm>covering index</firstterm>, which is an index |
| specifically designed to include the columns needed by a particular |
| type of query that you run frequently. Since queries typically need to |
| retrieve more columns than just the ones they search |
| on, <productname>PostgreSQL</productname> allows you to create an index |
| in which some columns are just <quote>payload</quote> and are not part |
| of the search key. This is done by adding an <literal>INCLUDE</literal> |
| clause listing the extra columns. For example, if you commonly run |
| queries like |
| <programlisting> |
| SELECT y FROM tab WHERE x = 'key'; |
| </programlisting> |
| the traditional approach to speeding up such queries would be to create |
| an index on <literal>x</literal> only. However, an index defined as |
| <programlisting> |
| CREATE INDEX tab_x_y ON tab(x) INCLUDE (y); |
| </programlisting> |
| could handle these queries as index-only scans, |
| because <literal>y</literal> can be obtained from the index without |
| visiting the heap. |
| </para> |
| |
| <para> |
| Because column <literal>y</literal> is not part of the index's search |
| key, it does not have to be of a data type that the index can handle; |
| it's merely stored in the index and is not interpreted by the index |
| machinery. Also, if the index is a unique index, that is |
| <programlisting> |
| CREATE UNIQUE INDEX tab_x_y ON tab(x) INCLUDE (y); |
| </programlisting> |
| the uniqueness condition applies to just column <literal>x</literal>, |
| not to the combination of <literal>x</literal> and <literal>y</literal>. |
| (An <literal>INCLUDE</literal> clause can also be written |
| in <literal>UNIQUE</literal> and <literal>PRIMARY KEY</literal> |
| constraints, providing alternative syntax for setting up an index like |
| this.) |
| </para> |
| |
| <para> |
| It's wise to be conservative about adding non-key payload columns to an |
| index, especially wide columns. If an index tuple exceeds the |
| maximum size allowed for the index type, data insertion will fail. |
| In any case, non-key columns duplicate data from the index's table |
| and bloat the size of the index, thus potentially slowing searches. |
| And remember that there is little point in including payload columns in an |
| index unless the table changes slowly enough that an index-only scan is |
| likely to not need to access the heap. If the heap tuple must be visited |
| anyway, it costs nothing more to get the column's value from there. |
| Other restrictions are that expressions are not currently supported as |
| included columns, and that only B-tree, GiST and SP-GiST indexes currently |
| support included columns. |
| </para> |
| |
| <para> |
| Before <productname>PostgreSQL</productname> had |
| the <literal>INCLUDE</literal> feature, people sometimes made covering |
| indexes by writing the payload columns as ordinary index columns, |
| that is writing |
| <programlisting> |
| CREATE INDEX tab_x_y ON tab(x, y); |
| </programlisting> |
| even though they had no intention of ever using <literal>y</literal> as |
| part of a <literal>WHERE</literal> clause. This works fine as long as |
| the extra columns are trailing columns; making them be leading columns is |
| unwise for the reasons explained in <xref linkend="indexes-multicolumn"/>. |
| However, this method doesn't support the case where you want the index to |
| enforce uniqueness on the key column(s). |
| </para> |
| |
| <para> |
| <firstterm>Suffix truncation</firstterm> always removes non-key |
| columns from upper B-Tree levels. As payload columns, they are |
| never used to guide index scans. The truncation process also |
| removes one or more trailing key column(s) when the remaining |
| prefix of key column(s) happens to be sufficient to describe tuples |
| on the lowest B-Tree level. In practice, covering indexes without |
| an <literal>INCLUDE</literal> clause often avoid storing columns |
| that are effectively payload in the upper levels. However, |
| explicitly defining payload columns as non-key columns |
| <emphasis>reliably</emphasis> keeps the tuples in upper levels |
| small. |
| </para> |
| |
| <para> |
| In principle, index-only scans can be used with expression indexes. |
| For example, given an index on <literal>f(x)</literal> |
| where <literal>x</literal> is a table column, it should be possible to |
| execute |
| <programlisting> |
| SELECT f(x) FROM tab WHERE f(x) < 1; |
| </programlisting> |
| as an index-only scan; and this is very attractive |
| if <literal>f()</literal> is an expensive-to-compute function. |
| However, <productname>PostgreSQL</productname>'s planner is currently not |
| very smart about such cases. It considers a query to be potentially |
| executable by index-only scan only when all <emphasis>columns</emphasis> |
| needed by the query are available from the index. In this |
| example, <literal>x</literal> is not needed except in the |
| context <literal>f(x)</literal>, but the planner does not notice that and |
| concludes that an index-only scan is not possible. If an index-only scan |
| seems sufficiently worthwhile, this can be worked around by |
| adding <literal>x</literal> as an included column, for example |
| <programlisting> |
| CREATE INDEX tab_f_x ON tab (f(x)) INCLUDE (x); |
| </programlisting> |
| An additional caveat, if the goal is to avoid |
| recalculating <literal>f(x)</literal>, is that the planner won't |
| necessarily match uses of <literal>f(x)</literal> that aren't in |
| indexable <literal>WHERE</literal> clauses to the index column. It will |
| usually get this right in simple queries such as shown above, but not in |
| queries that involve joins. These deficiencies may be remedied in future |
| versions of <productname>PostgreSQL</productname>. |
| </para> |
| |
| <para> |
| Partial indexes also have interesting interactions with index-only scans. |
| Consider the partial index shown in <xref linkend="indexes-partial-ex3"/>: |
| <programlisting> |
| CREATE UNIQUE INDEX tests_success_constraint ON tests (subject, target) |
| WHERE success; |
| </programlisting> |
| In principle, we could do an index-only scan on this index to satisfy a |
| query like |
| <programlisting> |
| SELECT target FROM tests WHERE subject = 'some-subject' AND success; |
| </programlisting> |
| But there's a problem: the <literal>WHERE</literal> clause refers |
| to <literal>success</literal> which is not available as a result column |
| of the index. Nonetheless, an index-only scan is possible because the |
| plan does not need to recheck that part of the <literal>WHERE</literal> |
| clause at run time: all entries found in the index necessarily |
| have <literal>success = true</literal> so this need not be explicitly |
| checked in the plan. <productname>PostgreSQL</productname> versions 9.6 |
| and later will recognize such cases and allow index-only scans to be |
| generated, but older versions will not. |
| </para> |
| </sect1> |
| |
| |
| <sect1 id="indexes-opclass"> |
| <title>Operator Classes and Operator Families</title> |
| |
| <indexterm zone="indexes-opclass"> |
| <primary>operator class</primary> |
| </indexterm> |
| |
| <indexterm zone="indexes-opclass"> |
| <primary>operator family</primary> |
| </indexterm> |
| |
| <para> |
| An index definition can specify an <firstterm>operator |
| class</firstterm> for each column of an index. |
| <synopsis> |
| CREATE INDEX <replaceable>name</replaceable> ON <replaceable>table</replaceable> (<replaceable>column</replaceable> <replaceable>opclass</replaceable> [ ( <replaceable>opclass_options</replaceable> ) ] <optional><replaceable>sort options</replaceable></optional> <optional>, ...</optional>); |
| </synopsis> |
| The operator class identifies the operators to be used by the index |
| for that column. For example, a B-tree index on the type <type>int4</type> |
| would use the <literal>int4_ops</literal> class; this operator |
| class includes comparison functions for values of type <type>int4</type>. |
| In practice the default operator class for the column's data type is |
| usually sufficient. The main reason for having operator classes is |
| that for some data types, there could be more than one meaningful |
| index behavior. For example, we might want to sort a complex-number data |
| type either by absolute value or by real part. We could do this by |
| defining two operator classes for the data type and then selecting |
| the proper class when making an index. The operator class determines |
| the basic sort ordering (which can then be modified by adding sort options |
| <literal>COLLATE</literal>, |
| <literal>ASC</literal>/<literal>DESC</literal> and/or |
| <literal>NULLS FIRST</literal>/<literal>NULLS LAST</literal>). |
| </para> |
| |
| <para> |
| There are also some built-in operator classes besides the default ones: |
| |
| <itemizedlist> |
| <listitem> |
| <para> |
| The operator classes <literal>text_pattern_ops</literal>, |
| <literal>varchar_pattern_ops</literal>, and |
| <literal>bpchar_pattern_ops</literal> support B-tree indexes on |
| the types <type>text</type>, <type>varchar</type>, and |
| <type>char</type> respectively. The |
| difference from the default operator classes is that the values |
| are compared strictly character by character rather than |
| according to the locale-specific collation rules. This makes |
| these operator classes suitable for use by queries involving |
| pattern matching expressions (<literal>LIKE</literal> or POSIX |
| regular expressions) when the database does not use the standard |
| <quote>C</quote> locale. As an example, you might index a |
| <type>varchar</type> column like this: |
| <programlisting> |
| CREATE INDEX test_index ON test_table (col varchar_pattern_ops); |
| </programlisting> |
| Note that you should also create an index with the default operator |
| class if you want queries involving ordinary <literal><</literal>, |
| <literal><=</literal>, <literal>></literal>, or <literal>>=</literal> comparisons |
| to use an index. Such queries cannot use the |
| <literal><replaceable>xxx</replaceable>_pattern_ops</literal> |
| operator classes. (Ordinary equality comparisons can use these |
| operator classes, however.) It is possible to create multiple |
| indexes on the same column with different operator classes. |
| If you do use the C locale, you do not need the |
| <literal><replaceable>xxx</replaceable>_pattern_ops</literal> |
| operator classes, because an index with the default operator class |
| is usable for pattern-matching queries in the C locale. |
| </para> |
| </listitem> |
| </itemizedlist> |
| </para> |
| |
| <para> |
| The following query shows all defined operator classes: |
| |
| <programlisting> |
| SELECT am.amname AS index_method, |
| opc.opcname AS opclass_name, |
| opc.opcintype::regtype AS indexed_type, |
| opc.opcdefault AS is_default |
| FROM pg_am am, pg_opclass opc |
| WHERE opc.opcmethod = am.oid |
| ORDER BY index_method, opclass_name; |
| </programlisting> |
| </para> |
| |
| <para> |
| An operator class is actually just a subset of a larger structure called an |
| <firstterm>operator family</firstterm>. In cases where several data types have |
| similar behaviors, it is frequently useful to define cross-data-type |
| operators and allow these to work with indexes. To do this, the operator |
| classes for each of the types must be grouped into the same operator |
| family. The cross-type operators are members of the family, but are not |
| associated with any single class within the family. |
| </para> |
| |
| <para> |
| This expanded version of the previous query shows the operator family |
| each operator class belongs to: |
| <programlisting> |
| SELECT am.amname AS index_method, |
| opc.opcname AS opclass_name, |
| opf.opfname AS opfamily_name, |
| opc.opcintype::regtype AS indexed_type, |
| opc.opcdefault AS is_default |
| FROM pg_am am, pg_opclass opc, pg_opfamily opf |
| WHERE opc.opcmethod = am.oid AND |
| opc.opcfamily = opf.oid |
| ORDER BY index_method, opclass_name; |
| </programlisting> |
| </para> |
| |
| <para> |
| This query shows all defined operator families and all |
| the operators included in each family: |
| <programlisting> |
| SELECT am.amname AS index_method, |
| opf.opfname AS opfamily_name, |
| amop.amopopr::regoperator AS opfamily_operator |
| FROM pg_am am, pg_opfamily opf, pg_amop amop |
| WHERE opf.opfmethod = am.oid AND |
| amop.amopfamily = opf.oid |
| ORDER BY index_method, opfamily_name, opfamily_operator; |
| </programlisting> |
| </para> |
| |
| <tip> |
| <para> |
| <xref linkend="app-psql"/> has |
| commands <command>\dAc</command>, <command>\dAf</command>, |
| and <command>\dAo</command>, which provide slightly more sophisticated |
| versions of these queries. |
| </para> |
| </tip> |
| </sect1> |
| |
| |
| <sect1 id="indexes-collations"> |
| <title>Indexes and Collations</title> |
| |
| <para> |
| An index can support only one collation per index column. |
| If multiple collations are of interest, multiple indexes may be needed. |
| </para> |
| |
| <para> |
| Consider these statements: |
| <programlisting> |
| CREATE TABLE test1c ( |
| id integer, |
| content varchar COLLATE "x" |
| ); |
| |
| CREATE INDEX test1c_content_index ON test1c (content); |
| </programlisting> |
| The index automatically uses the collation of the |
| underlying column. So a query of the form |
| <programlisting> |
| SELECT * FROM test1c WHERE content > <replaceable>constant</replaceable>; |
| </programlisting> |
| could use the index, because the comparison will by default use the |
| collation of the column. However, this index cannot accelerate queries |
| that involve some other collation. So if queries of the form, say, |
| <programlisting> |
| SELECT * FROM test1c WHERE content > <replaceable>constant</replaceable> COLLATE "y"; |
| </programlisting> |
| are also of interest, an additional index could be created that supports |
| the <literal>"y"</literal> collation, like this: |
| <programlisting> |
| CREATE INDEX test1c_content_y_index ON test1c (content COLLATE "y"); |
| </programlisting> |
| </para> |
| </sect1> |
| |
| |
| <sect1 id="indexes-examine"> |
| <title>Examining Index Usage</title> |
| |
| <indexterm zone="indexes-examine"> |
| <primary>index</primary> |
| <secondary>examining usage</secondary> |
| </indexterm> |
| |
| <para> |
| Although indexes in <productname>PostgreSQL</productname> do not need |
| maintenance or tuning, it is still important to check |
| which indexes are actually used by the real-life query workload. |
| Examining index usage for an individual query is done with the |
| <xref linkend="sql-explain"/> |
| command; its application for this purpose is |
| illustrated in <xref linkend="using-explain"/>. |
| It is also possible to gather overall statistics about index usage |
| in a running server, as described in <xref linkend="monitoring-stats"/>. |
| </para> |
| |
| <para> |
| It is difficult to formulate a general procedure for determining |
| which indexes to create. There are a number of typical cases that |
| have been shown in the examples throughout the previous sections. |
| A good deal of experimentation is often necessary. |
| The rest of this section gives some tips for that: |
| </para> |
| |
| <itemizedlist> |
| <listitem> |
| <para> |
| Always run <xref linkend="sql-analyze"/> |
| first. This command |
| collects statistics about the distribution of the values in the |
| table. This information is required to estimate the number of rows |
| returned by a query, which is needed by the planner to assign |
| realistic costs to each possible query plan. In absence of any |
| real statistics, some default values are assumed, which are |
| almost certain to be inaccurate. Examining an application's |
| index usage without having run <command>ANALYZE</command> is |
| therefore a lost cause. |
| See <xref linkend="vacuum-for-statistics"/> |
| and <xref linkend="autovacuum"/> for more information. |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| Use real data for experimentation. Using test data for setting |
| up indexes will tell you what indexes you need for the test data, |
| but that is all. |
| </para> |
| |
| <para> |
| It is especially fatal to use very small test data sets. |
| While selecting 1000 out of 100000 rows could be a candidate for |
| an index, selecting 1 out of 100 rows will hardly be, because the |
| 100 rows probably fit within a single disk page, and there |
| is no plan that can beat sequentially fetching 1 disk page. |
| </para> |
| |
| <para> |
| Also be careful when making up test data, which is often |
| unavoidable when the application is not yet in production. |
| Values that are very similar, completely random, or inserted in |
| sorted order will skew the statistics away from the distribution |
| that real data would have. |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| When indexes are not used, it can be useful for testing to force |
| their use. There are run-time parameters that can turn off |
| various plan types (see <xref linkend="runtime-config-query-enable"/>). |
| For instance, turning off sequential scans |
| (<varname>enable_seqscan</varname>) and nested-loop joins |
| (<varname>enable_nestloop</varname>), which are the most basic plans, |
| will force the system to use a different plan. If the system |
| still chooses a sequential scan or nested-loop join then there is |
| probably a more fundamental reason why the index is not being |
| used; for example, the query condition does not match the index. |
| (What kind of query can use what kind of index is explained in |
| the previous sections.) |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| If forcing index usage does use the index, then there are two |
| possibilities: Either the system is right and using the index is |
| indeed not appropriate, or the cost estimates of the query plans |
| are not reflecting reality. So you should time your query with |
| and without indexes. The <command>EXPLAIN ANALYZE</command> |
| command can be useful here. |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| If it turns out that the cost estimates are wrong, there are, |
| again, two possibilities. The total cost is computed from the |
| per-row costs of each plan node times the selectivity estimate of |
| the plan node. The costs estimated for the plan nodes can be adjusted |
| via run-time parameters (described in <xref |
| linkend="runtime-config-query-constants"/>). |
| An inaccurate selectivity estimate is due to |
| insufficient statistics. It might be possible to improve this by |
| tuning the statistics-gathering parameters (see |
| <xref linkend="sql-altertable"/>). |
| </para> |
| |
| <para> |
| If you do not succeed in adjusting the costs to be more |
| appropriate, then you might have to resort to forcing index usage |
| explicitly. You might also want to contact the |
| <productname>PostgreSQL</productname> developers to examine the issue. |
| </para> |
| </listitem> |
| </itemizedlist> |
| </sect1> |
| </chapter> |