| <!-- doc/src/sgml/unaccent.sgml --> |
| |
| <sect1 id="unaccent" xreflabel="unaccent"> |
| <title>unaccent</title> |
| |
| <indexterm zone="unaccent"> |
| <primary>unaccent</primary> |
| </indexterm> |
| |
| <para> |
| <filename>unaccent</filename> is a text search dictionary that removes accents |
| (diacritic signs) from lexemes. |
| It's a filtering dictionary, which means its output is |
| always passed to the next dictionary (if any), unlike the normal |
| behavior of dictionaries. This allows accent-insensitive processing |
| for full text search. |
| </para> |
| |
| <para> |
| The current implementation of <filename>unaccent</filename> cannot be used as a |
| normalizing dictionary for the <filename>thesaurus</filename> dictionary. |
| </para> |
| |
| <para> |
| This module is considered <quote>trusted</quote>, that is, it can be |
| installed by non-superusers who have <literal>CREATE</literal> privilege |
| on the current database. |
| </para> |
| |
| <sect2> |
| <title>Configuration</title> |
| |
| <para> |
| An <literal>unaccent</literal> dictionary accepts the following options: |
| </para> |
| <itemizedlist> |
| <listitem> |
| <para> |
| <literal>RULES</literal> is the base name of the file containing the list of |
| translation rules. This file must be stored in |
| <filename>$SHAREDIR/tsearch_data/</filename> (where <literal>$SHAREDIR</literal> means |
| the <productname>PostgreSQL</productname> installation's shared-data directory). |
| Its name must end in <literal>.rules</literal> (which is not to be included in |
| the <literal>RULES</literal> parameter). |
| </para> |
| </listitem> |
| </itemizedlist> |
| <para> |
| The rules file has the following format: |
| </para> |
| <itemizedlist> |
| <listitem> |
| <para> |
| Each line represents one translation rule, consisting of a character with |
| accent followed by a character without accent. The first is translated |
| into the second. For example, |
| <programlisting> |
| À A |
| Á A |
| Â A |
| Ã A |
| Ä A |
| Å A |
| Æ AE |
| </programlisting> |
| The two characters must be separated by whitespace, and any leading or |
| trailing whitespace on a line is ignored. |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| Alternatively, if only one character is given on a line, instances of |
| that character are deleted; this is useful in languages where accents |
| are represented by separate characters. |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| Actually, each <quote>character</quote> can be any string not containing |
| whitespace, so <filename>unaccent</filename> dictionaries could be used for |
| other sorts of substring substitutions besides diacritic removal. |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| As with other <productname>PostgreSQL</productname> text search configuration files, |
| the rules file must be stored in UTF-8 encoding. The data is |
| automatically translated into the current database's encoding when |
| loaded. Any lines containing untranslatable characters are silently |
| ignored, so that rules files can contain rules that are not applicable in |
| the current encoding. |
| </para> |
| </listitem> |
| </itemizedlist> |
| |
| <para> |
| A more complete example, which is directly useful for most European |
| languages, can be found in <filename>unaccent.rules</filename>, which is installed |
| in <filename>$SHAREDIR/tsearch_data/</filename> when the <filename>unaccent</filename> |
| module is installed. This rules file translates characters with accents |
| to the same characters without accents, and it also expands ligatures |
| into the equivalent series of simple characters (for example, Æ to |
| AE). |
| </para> |
| </sect2> |
| |
| <sect2> |
| <title>Usage</title> |
| |
| <para> |
| Installing the <literal>unaccent</literal> extension creates a text |
| search template <literal>unaccent</literal> and a dictionary <literal>unaccent</literal> |
| based on it. The <literal>unaccent</literal> dictionary has the default |
| parameter setting <literal>RULES='unaccent'</literal>, which makes it immediately |
| usable with the standard <filename>unaccent.rules</filename> file. |
| If you wish, you can alter the parameter, for example |
| |
| <programlisting> |
| mydb=# ALTER TEXT SEARCH DICTIONARY unaccent (RULES='my_rules'); |
| </programlisting> |
| |
| or create new dictionaries based on the template. |
| </para> |
| |
| <para> |
| To test the dictionary, you can try: |
| <programlisting> |
| mydb=# select ts_lexize('unaccent','Hôtel'); |
| ts_lexize |
| ----------- |
| {Hotel} |
| (1 row) |
| </programlisting> |
| </para> |
| |
| <para> |
| Here is an example showing how to insert the |
| <filename>unaccent</filename> dictionary into a text search configuration: |
| <programlisting> |
| mydb=# CREATE TEXT SEARCH CONFIGURATION fr ( COPY = french ); |
| mydb=# ALTER TEXT SEARCH CONFIGURATION fr |
| ALTER MAPPING FOR hword, hword_part, word |
| WITH unaccent, french_stem; |
| mydb=# select to_tsvector('fr','Hôtels de la Mer'); |
| to_tsvector |
| ------------------- |
| 'hotel':1 'mer':4 |
| (1 row) |
| |
| mydb=# select to_tsvector('fr','Hôtel de la Mer') @@ to_tsquery('fr','Hotels'); |
| ?column? |
| ---------- |
| t |
| (1 row) |
| |
| mydb=# select ts_headline('fr','Hôtel de la Mer',to_tsquery('fr','Hotels')); |
| ts_headline |
| ------------------------ |
| <b>Hôtel</b> de la Mer |
| (1 row) |
| </programlisting> |
| </para> |
| </sect2> |
| |
| <sect2> |
| <title>Functions</title> |
| |
| <para> |
| The <function>unaccent()</function> function removes accents (diacritic signs) from |
| a given string. Basically, it's a wrapper around |
| <filename>unaccent</filename>-type dictionaries, but it can be used outside normal |
| text search contexts. |
| </para> |
| |
| <indexterm> |
| <primary>unaccent</primary> |
| </indexterm> |
| |
| <synopsis> |
| unaccent(<optional><replaceable class="parameter">dictionary</replaceable> <type>regdictionary</type>, </optional> <replaceable class="parameter">string</replaceable> <type>text</type>) returns <type>text</type> |
| </synopsis> |
| |
| <para> |
| If the <replaceable class="parameter">dictionary</replaceable> argument is |
| omitted, the text search dictionary named <literal>unaccent</literal> and |
| appearing in the same schema as the <function>unaccent()</function> |
| function itself is used. |
| </para> |
| |
| <para> |
| For example: |
| <programlisting> |
| SELECT unaccent('unaccent', 'Hôtel'); |
| SELECT unaccent('Hôtel'); |
| </programlisting> |
| </para> |
| </sect2> |
| |
| </sect1> |