| <!-- doc/src/sgml/citext.sgml --> |
| |
| <sect1 id="citext" xreflabel="citext"> |
| <title>citext</title> |
| |
| <indexterm zone="citext"> |
| <primary>citext</primary> |
| </indexterm> |
| |
| <para> |
| The <filename>citext</filename> module provides a case-insensitive |
| character string type, <type>citext</type>. Essentially, it internally calls |
| <function>lower</function> when comparing values. Otherwise, it behaves almost |
| exactly like <type>text</type>. |
| </para> |
| |
| <tip> |
| <para> |
| Consider using <firstterm>nondeterministic collations</firstterm> (see |
| <xref linkend="collation-nondeterministic"/>) instead of this module. They |
| can be used for case-insensitive comparisons, accent-insensitive |
| comparisons, and other combinations, and they handle more Unicode special |
| cases correctly. |
| </para> |
| </tip> |
| |
| <para> |
| This module is considered <quote>trusted</quote>, that is, it can be |
| installed by non-superusers who have <literal>CREATE</literal> privilege |
| on the current database. |
| </para> |
| |
| <sect2> |
| <title>Rationale</title> |
| |
| <para> |
| The standard approach to doing case-insensitive matches |
| in <productname>PostgreSQL</productname> has been to use the <function>lower</function> |
| function when comparing values, for example |
| |
| <programlisting> |
| SELECT * FROM tab WHERE lower(col) = LOWER(?); |
| </programlisting> |
| </para> |
| |
| <para> |
| This works reasonably well, but has a number of drawbacks: |
| </para> |
| |
| <itemizedlist> |
| <listitem> |
| <para> |
| It makes your SQL statements verbose, and you always have to remember to |
| use <function>lower</function> on both the column and the query value. |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| It won't use an index, unless you create a functional index using |
| <function>lower</function>. |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| If you declare a column as <literal>UNIQUE</literal> or <literal>PRIMARY |
| KEY</literal>, the implicitly generated index is case-sensitive. So it's |
| useless for case-insensitive searches, and it won't enforce |
| uniqueness case-insensitively. |
| </para> |
| </listitem> |
| </itemizedlist> |
| |
| <para> |
| The <type>citext</type> data type allows you to eliminate calls |
| to <function>lower</function> in SQL queries, and allows a primary key to |
| be case-insensitive. <type>citext</type> is locale-aware, just |
| like <type>text</type>, which means that the matching of upper case and |
| lower case characters is dependent on the rules of |
| the database's <literal>LC_CTYPE</literal> setting. Again, this behavior is |
| identical to the use of <function>lower</function> in queries. But because it's |
| done transparently by the data type, you don't have to remember to do |
| anything special in your queries. |
| </para> |
| |
| </sect2> |
| |
| <sect2> |
| <title>How to Use It</title> |
| |
| <para> |
| Here's a simple example of usage: |
| |
| <programlisting> |
| CREATE TABLE users ( |
| nick CITEXT PRIMARY KEY, |
| pass TEXT NOT NULL |
| ); |
| |
| INSERT INTO users VALUES ( 'larry', sha256(random()::text::bytea) ); |
| INSERT INTO users VALUES ( 'Tom', sha256(random()::text::bytea) ); |
| INSERT INTO users VALUES ( 'Damian', sha256(random()::text::bytea) ); |
| INSERT INTO users VALUES ( 'NEAL', sha256(random()::text::bytea) ); |
| INSERT INTO users VALUES ( 'Bjørn', sha256(random()::text::bytea) ); |
| |
| SELECT * FROM users WHERE nick = 'Larry'; |
| </programlisting> |
| |
| The <command>SELECT</command> statement will return one tuple, even though |
| the <structfield>nick</structfield> column was set to <literal>larry</literal> and the query |
| was for <literal>Larry</literal>. |
| </para> |
| </sect2> |
| |
| <sect2> |
| <title>String Comparison Behavior</title> |
| |
| <para> |
| <type>citext</type> performs comparisons by converting each string to lower |
| case (as though <function>lower</function> were called) and then comparing the |
| results normally. Thus, for example, two strings are considered equal |
| if <function>lower</function> would produce identical results for them. |
| </para> |
| |
| <para> |
| In order to emulate a case-insensitive collation as closely as possible, |
| there are <type>citext</type>-specific versions of a number of string-processing |
| operators and functions. So, for example, the regular expression |
| operators <literal>~</literal> and <literal>~*</literal> exhibit the same behavior when |
| applied to <type>citext</type>: they both match case-insensitively. |
| The same is true |
| for <literal>!~</literal> and <literal>!~*</literal>, as well as for the |
| <literal>LIKE</literal> operators <literal>~~</literal> and <literal>~~*</literal>, and |
| <literal>!~~</literal> and <literal>!~~*</literal>. If you'd like to match |
| case-sensitively, you can cast the operator's arguments to <type>text</type>. |
| </para> |
| |
| <para> |
| Similarly, all of the following functions perform matching |
| case-insensitively if their arguments are <type>citext</type>: |
| </para> |
| |
| <itemizedlist> |
| <listitem> |
| <para> |
| <function>regexp_match()</function> |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| <function>regexp_matches()</function> |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| <function>regexp_replace()</function> |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| <function>regexp_split_to_array()</function> |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| <function>regexp_split_to_table()</function> |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| <function>replace()</function> |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| <function>split_part()</function> |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| <function>strpos()</function> |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| <function>translate()</function> |
| </para> |
| </listitem> |
| </itemizedlist> |
| |
| <para> |
| For the regexp functions, if you want to match case-sensitively, you can |
| specify the <quote>c</quote> flag to force a case-sensitive match. Otherwise, |
| you must cast to <type>text</type> before using one of these functions if |
| you want case-sensitive behavior. |
| </para> |
| |
| </sect2> |
| |
| <sect2> |
| <title>Limitations</title> |
| |
| <itemizedlist> |
| <listitem> |
| <para> |
| <type>citext</type>'s case-folding behavior depends on |
| the <literal>LC_CTYPE</literal> setting of your database. How it compares |
| values is therefore determined when the database is created. |
| It is not truly |
| case-insensitive in the terms defined by the Unicode standard. |
| Effectively, what this means is that, as long as you're happy with your |
| collation, you should be happy with <type>citext</type>'s comparisons. But |
| if you have data in different languages stored in your database, users |
| of one language may find their query results are not as expected if the |
| collation is for another language. |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| As of <productname>PostgreSQL</productname> 9.1, you can attach a |
| <literal>COLLATE</literal> specification to <type>citext</type> columns or data |
| values. Currently, <type>citext</type> operators will honor a non-default |
| <literal>COLLATE</literal> specification while comparing case-folded strings, |
| but the initial folding to lower case is always done according to the |
| database's <literal>LC_CTYPE</literal> setting (that is, as though |
| <literal>COLLATE "default"</literal> were given). This may be changed in a |
| future release so that both steps follow the input <literal>COLLATE</literal> |
| specification. |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| <type>citext</type> is not as efficient as <type>text</type> because the |
| operator functions and the B-tree comparison functions must make copies |
| of the data and convert it to lower case for comparisons. Also, only |
| <type>text</type> can support B-Tree deduplication. However, |
| <type>citext</type> is slightly more efficient than using |
| <function>lower</function> to get case-insensitive matching. |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| <type>citext</type> doesn't help much if you need data to compare |
| case-sensitively in some contexts and case-insensitively in other |
| contexts. The standard answer is to use the <type>text</type> type and |
| manually use the <function>lower</function> function when you need to compare |
| case-insensitively; this works all right if case-insensitive comparison |
| is needed only infrequently. If you need case-insensitive behavior most |
| of the time and case-sensitive infrequently, consider storing the data |
| as <type>citext</type> and explicitly casting the column to <type>text</type> |
| when you want case-sensitive comparison. In either situation, you will |
| need two indexes if you want both types of searches to be fast. |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| The schema containing the <type>citext</type> operators must be |
| in the current <varname>search_path</varname> (typically <literal>public</literal>); |
| if it is not, the normal case-sensitive <type>text</type> operators |
| will be invoked instead. |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| The approach of lower-casing strings for comparison does not handle some |
| Unicode special cases correctly, for example when one upper-case letter |
| has two lower-case letter equivalents. Unicode distinguishes between |
| <firstterm>case mapping</firstterm> and <firstterm>case |
| folding</firstterm> for this reason. Use nondeterministic collations |
| instead of <type>citext</type> to handle that correctly. |
| </para> |
| </listitem> |
| </itemizedlist> |
| </sect2> |
| |
| <sect2> |
| <title>Author</title> |
| |
| <para> |
| David E. Wheeler <email>david@kineticode.com</email> |
| </para> |
| |
| <para> |
| Inspired by the original <type>citext</type> module by Donald Fraser. |
| </para> |
| |
| </sect2> |
| |
| </sect1> |