| <!-- doc/src/sgml/textsearch.sgml --> |
| |
| <chapter id="textsearch"> |
| <title>Full Text Search</title> |
| |
| <indexterm zone="textsearch"> |
| <primary>full text search</primary> |
| </indexterm> |
| |
| <indexterm zone="textsearch"> |
| <primary>text search</primary> |
| </indexterm> |
| |
| <sect1 id="textsearch-intro"> |
| <title>Introduction</title> |
| |
| <para> |
| Full Text Searching (or just <firstterm>text search</firstterm>) provides |
| the capability to identify natural-language <firstterm>documents</firstterm> that |
| satisfy a <firstterm>query</firstterm>, and optionally to sort them by |
| relevance to the query. The most common type of search |
| is to find all documents containing given <firstterm>query terms</firstterm> |
| and return them in order of their <firstterm>similarity</firstterm> to the |
| query. Notions of <varname>query</varname> and |
| <varname>similarity</varname> are very flexible and depend on the specific |
| application. The simplest search considers <varname>query</varname> as a |
| set of words and <varname>similarity</varname> as the frequency of query |
| words in the document. |
| </para> |
| |
| <para> |
| Textual search operators have existed in databases for years. |
| <productname>PostgreSQL</productname> has |
| <literal>~</literal>, <literal>~*</literal>, <literal>LIKE</literal>, and |
| <literal>ILIKE</literal> operators for textual data types, but they lack |
| many essential properties required by modern information systems: |
| </para> |
| |
| <itemizedlist spacing="compact" mark="bullet"> |
| <listitem> |
| <para> |
| There is no linguistic support, even for English. Regular expressions |
| are not sufficient because they cannot easily handle derived words, e.g., |
| <literal>satisfies</literal> and <literal>satisfy</literal>. You might |
| miss documents that contain <literal>satisfies</literal>, although you |
| probably would like to find them when searching for |
| <literal>satisfy</literal>. It is possible to use <literal>OR</literal> |
| to search for multiple derived forms, but this is tedious and error-prone |
| (some words can have several thousand derivatives). |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| They provide no ordering (ranking) of search results, which makes them |
| ineffective when thousands of matching documents are found. |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| They tend to be slow because there is no index support, so they must |
| process all documents for every search. |
| </para> |
| </listitem> |
| </itemizedlist> |
| |
| <para> |
| Full text indexing allows documents to be <emphasis>preprocessed</emphasis> |
| and an index saved for later rapid searching. Preprocessing includes: |
| </para> |
| |
| <itemizedlist mark="none"> |
| <listitem> |
| <para> |
| <emphasis>Parsing documents into <firstterm>tokens</firstterm></emphasis>. It is |
| useful to identify various classes of tokens, e.g., numbers, words, |
| complex words, email addresses, so that they can be processed |
| differently. In principle token classes depend on the specific |
| application, but for most purposes it is adequate to use a predefined |
| set of classes. |
| <productname>PostgreSQL</productname> uses a <firstterm>parser</firstterm> to |
| perform this step. A standard parser is provided, and custom parsers |
| can be created for specific needs. |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| <emphasis>Converting tokens into <firstterm>lexemes</firstterm></emphasis>. |
| A lexeme is a string, just like a token, but it has been |
| <firstterm>normalized</firstterm> so that different forms of the same word |
| are made alike. For example, normalization almost always includes |
| folding upper-case letters to lower-case, and often involves removal |
| of suffixes (such as <literal>s</literal> or <literal>es</literal> in English). |
| This allows searches to find variant forms of the |
| same word, without tediously entering all the possible variants. |
| Also, this step typically eliminates <firstterm>stop words</firstterm>, which |
| are words that are so common that they are useless for searching. |
| (In short, then, tokens are raw fragments of the document text, while |
| lexemes are words that are believed useful for indexing and searching.) |
| <productname>PostgreSQL</productname> uses <firstterm>dictionaries</firstterm> to |
| perform this step. Various standard dictionaries are provided, and |
| custom ones can be created for specific needs. |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| <emphasis>Storing preprocessed documents optimized for |
| searching</emphasis>. For example, each document can be represented |
| as a sorted array of normalized lexemes. Along with the lexemes it is |
| often desirable to store positional information to use for |
| <firstterm>proximity ranking</firstterm>, so that a document that |
| contains a more <quote>dense</quote> region of query words is |
| assigned a higher rank than one with scattered query words. |
| </para> |
| </listitem> |
| </itemizedlist> |
| |
| <para> |
| Dictionaries allow fine-grained control over how tokens are normalized. |
| With appropriate dictionaries, you can: |
| </para> |
| |
| <itemizedlist spacing="compact" mark="bullet"> |
| <listitem> |
| <para> |
| Define stop words that should not be indexed. |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| Map synonyms to a single word using <application>Ispell</application>. |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| Map phrases to a single word using a thesaurus. |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| Map different variations of a word to a canonical form using |
| an <application>Ispell</application> dictionary. |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| Map different variations of a word to a canonical form using |
| <application>Snowball</application> stemmer rules. |
| </para> |
| </listitem> |
| </itemizedlist> |
| |
| <para> |
| A data type <type>tsvector</type> is provided for storing preprocessed |
| documents, along with a type <type>tsquery</type> for representing processed |
| queries (<xref linkend="datatype-textsearch"/>). There are many |
| functions and operators available for these data types |
| (<xref linkend="functions-textsearch"/>), the most important of which is |
| the match operator <literal>@@</literal>, which we introduce in |
| <xref linkend="textsearch-matching"/>. Full text searches can be accelerated |
| using indexes (<xref linkend="textsearch-indexes"/>). |
| </para> |
| |
| |
| <sect2 id="textsearch-document"> |
| <title>What Is a Document?</title> |
| |
| <indexterm zone="textsearch-document"> |
| <primary>document</primary> |
| <secondary>text search</secondary> |
| </indexterm> |
| |
| <para> |
| A <firstterm>document</firstterm> is the unit of searching in a full text search |
| system; for example, a magazine article or email message. The text search |
| engine must be able to parse documents and store associations of lexemes |
| (key words) with their parent document. Later, these associations are |
| used to search for documents that contain query words. |
| </para> |
| |
| <para> |
| For searches within <productname>PostgreSQL</productname>, |
| a document is normally a textual field within a row of a database table, |
| or possibly a combination (concatenation) of such fields, perhaps stored |
| in several tables or obtained dynamically. In other words, a document can |
| be constructed from different parts for indexing and it might not be |
| stored anywhere as a whole. For example: |
| |
| <programlisting> |
| SELECT title || ' ' || author || ' ' || abstract || ' ' || body AS document |
| FROM messages |
| WHERE mid = 12; |
| |
| SELECT m.title || ' ' || m.author || ' ' || m.abstract || ' ' || d.body AS document |
| FROM messages m, docs d |
| WHERE m.mid = d.did AND m.mid = 12; |
| </programlisting> |
| </para> |
| |
| <note> |
| <para> |
| Actually, in these example queries, <function>coalesce</function> |
| should be used to prevent a single <literal>NULL</literal> attribute from |
| causing a <literal>NULL</literal> result for the whole document. |
| </para> |
| </note> |
| |
| <para> |
| Another possibility is to store the documents as simple text files in the |
| file system. In this case, the database can be used to store the full text |
| index and to execute searches, and some unique identifier can be used to |
| retrieve the document from the file system. However, retrieving files |
| from outside the database requires superuser permissions or special |
| function support, so this is usually less convenient than keeping all |
| the data inside <productname>PostgreSQL</productname>. Also, keeping |
| everything inside the database allows easy access |
| to document metadata to assist in indexing and display. |
| </para> |
| |
| <para> |
| For text search purposes, each document must be reduced to the |
| preprocessed <type>tsvector</type> format. Searching and ranking |
| are performed entirely on the <type>tsvector</type> representation |
| of a document — the original text need only be retrieved |
| when the document has been selected for display to a user. |
| We therefore often speak of the <type>tsvector</type> as being the |
| document, but of course it is only a compact representation of |
| the full document. |
| </para> |
| </sect2> |
| |
| <sect2 id="textsearch-matching"> |
| <title>Basic Text Matching</title> |
| |
| <para> |
| Full text searching in <productname>PostgreSQL</productname> is based on |
| the match operator <literal>@@</literal>, which returns |
| <literal>true</literal> if a <type>tsvector</type> |
| (document) matches a <type>tsquery</type> (query). |
| It doesn't matter which data type is written first: |
| |
| <programlisting> |
| SELECT 'a fat cat sat on a mat and ate a fat rat'::tsvector @@ 'cat & rat'::tsquery; |
| ?column? |
| ---------- |
| t |
| |
| SELECT 'fat & cow'::tsquery @@ 'a fat cat sat on a mat and ate a fat rat'::tsvector; |
| ?column? |
| ---------- |
| f |
| </programlisting> |
| </para> |
| |
| <para> |
| As the above example suggests, a <type>tsquery</type> is not just raw |
| text, any more than a <type>tsvector</type> is. A <type>tsquery</type> |
| contains search terms, which must be already-normalized lexemes, and |
| may combine multiple terms using AND, OR, NOT, and FOLLOWED BY operators. |
| (For syntax details see <xref linkend="datatype-tsquery"/>.) There are |
| functions <function>to_tsquery</function>, <function>plainto_tsquery</function>, |
| and <function>phraseto_tsquery</function> |
| that are helpful in converting user-written text into a proper |
| <type>tsquery</type>, primarily by normalizing words appearing in |
| the text. Similarly, <function>to_tsvector</function> is used to parse and |
| normalize a document string. So in practice a text search match would |
| look more like this: |
| |
| <programlisting> |
| SELECT to_tsvector('fat cats ate fat rats') @@ to_tsquery('fat & rat'); |
| ?column? |
| ---------- |
| t |
| </programlisting> |
| |
| Observe that this match would not succeed if written as |
| |
| <programlisting> |
| SELECT 'fat cats ate fat rats'::tsvector @@ to_tsquery('fat & rat'); |
| ?column? |
| ---------- |
| f |
| </programlisting> |
| |
| since here no normalization of the word <literal>rats</literal> will occur. |
| The elements of a <type>tsvector</type> are lexemes, which are assumed |
| already normalized, so <literal>rats</literal> does not match <literal>rat</literal>. |
| </para> |
| |
| <para> |
| The <literal>@@</literal> operator also |
| supports <type>text</type> input, allowing explicit conversion of a text |
| string to <type>tsvector</type> or <type>tsquery</type> to be skipped |
| in simple cases. The variants available are: |
| |
| <programlisting> |
| tsvector @@ tsquery |
| tsquery @@ tsvector |
| text @@ tsquery |
| text @@ text |
| </programlisting> |
| </para> |
| |
| <para> |
| The first two of these we saw already. |
| The form <type>text</type> <literal>@@</literal> <type>tsquery</type> |
| is equivalent to <literal>to_tsvector(x) @@ y</literal>. |
| The form <type>text</type> <literal>@@</literal> <type>text</type> |
| is equivalent to <literal>to_tsvector(x) @@ plainto_tsquery(y)</literal>. |
| </para> |
| |
| <para> |
| Within a <type>tsquery</type>, the <literal>&</literal> (AND) operator |
| specifies that both its arguments must appear in the document to have a |
| match. Similarly, the <literal>|</literal> (OR) operator specifies that |
| at least one of its arguments must appear, while the <literal>!</literal> (NOT) |
| operator specifies that its argument must <emphasis>not</emphasis> appear in |
| order to have a match. |
| For example, the query <literal>fat & ! rat</literal> matches documents that |
| contain <literal>fat</literal> but not <literal>rat</literal>. |
| </para> |
| |
| <para> |
| Searching for phrases is possible with the help of |
| the <literal><-></literal> (FOLLOWED BY) <type>tsquery</type> operator, which |
| matches only if its arguments have matches that are adjacent and in the |
| given order. For example: |
| |
| <programlisting> |
| SELECT to_tsvector('fatal error') @@ to_tsquery('fatal <-> error'); |
| ?column? |
| ---------- |
| t |
| |
| SELECT to_tsvector('error is not fatal') @@ to_tsquery('fatal <-> error'); |
| ?column? |
| ---------- |
| f |
| </programlisting> |
| |
| There is a more general version of the FOLLOWED BY operator having the |
| form <literal><<replaceable>N</replaceable>></literal>, |
| where <replaceable>N</replaceable> is an integer standing for the difference between |
| the positions of the matching lexemes. <literal><1></literal> is |
| the same as <literal><-></literal>, while <literal><2></literal> |
| allows exactly one other lexeme to appear between the matches, and so |
| on. The <literal>phraseto_tsquery</literal> function makes use of this |
| operator to construct a <literal>tsquery</literal> that can match a multi-word |
| phrase when some of the words are stop words. For example: |
| |
| <programlisting> |
| SELECT phraseto_tsquery('cats ate rats'); |
| phraseto_tsquery |
| ------------------------------- |
| 'cat' <-> 'ate' <-> 'rat' |
| |
| SELECT phraseto_tsquery('the cats ate the rats'); |
| phraseto_tsquery |
| ------------------------------- |
| 'cat' <-> 'ate' <2> 'rat' |
| </programlisting> |
| </para> |
| |
| <para> |
| A special case that's sometimes useful is that <literal><0></literal> |
| can be used to require that two patterns match the same word. |
| </para> |
| |
| <para> |
| Parentheses can be used to control nesting of the <type>tsquery</type> |
| operators. Without parentheses, <literal>|</literal> binds least tightly, |
| then <literal>&</literal>, then <literal><-></literal>, |
| and <literal>!</literal> most tightly. |
| </para> |
| |
| <para> |
| It's worth noticing that the AND/OR/NOT operators mean something subtly |
| different when they are within the arguments of a FOLLOWED BY operator |
| than when they are not, because within FOLLOWED BY the exact position of |
| the match is significant. For example, normally <literal>!x</literal> matches |
| only documents that do not contain <literal>x</literal> anywhere. |
| But <literal>!x <-> y</literal> matches <literal>y</literal> if it is not |
| immediately after an <literal>x</literal>; an occurrence of <literal>x</literal> |
| elsewhere in the document does not prevent a match. Another example is |
| that <literal>x & y</literal> normally only requires that <literal>x</literal> |
| and <literal>y</literal> both appear somewhere in the document, but |
| <literal>(x & y) <-> z</literal> requires <literal>x</literal> |
| and <literal>y</literal> to match at the same place, immediately before |
| a <literal>z</literal>. Thus this query behaves differently from |
| <literal>x <-> z & y <-> z</literal>, which will match a |
| document containing two separate sequences <literal>x z</literal> and |
| <literal>y z</literal>. (This specific query is useless as written, |
| since <literal>x</literal> and <literal>y</literal> could not match at the same place; |
| but with more complex situations such as prefix-match patterns, a query |
| of this form could be useful.) |
| </para> |
| </sect2> |
| |
| <sect2 id="textsearch-intro-configurations"> |
| <title>Configurations</title> |
| |
| <para> |
| The above are all simple text search examples. As mentioned before, full |
| text search functionality includes the ability to do many more things: |
| skip indexing certain words (stop words), process synonyms, and use |
| sophisticated parsing, e.g., parse based on more than just white space. |
| This functionality is controlled by <firstterm>text search |
| configurations</firstterm>. <productname>PostgreSQL</productname> comes with predefined |
| configurations for many languages, and you can easily create your own |
| configurations. (<application>psql</application>'s <command>\dF</command> command |
| shows all available configurations.) |
| </para> |
| |
| <para> |
| During installation an appropriate configuration is selected and |
| <xref linkend="guc-default-text-search-config"/> is set accordingly |
| in <filename>postgresql.conf</filename>. If you are using the same text search |
| configuration for the entire cluster you can use the value in |
| <filename>postgresql.conf</filename>. To use different configurations |
| throughout the cluster but the same configuration within any one database, |
| use <command>ALTER DATABASE ... SET</command>. Otherwise, you can set |
| <varname>default_text_search_config</varname> in each session. |
| </para> |
| |
| <para> |
| Each text search function that depends on a configuration has an optional |
| <type>regconfig</type> argument, so that the configuration to use can be |
| specified explicitly. <varname>default_text_search_config</varname> |
| is used only when this argument is omitted. |
| </para> |
| |
| <para> |
| To make it easier to build custom text search configurations, a |
| configuration is built up from simpler database objects. |
| <productname>PostgreSQL</productname>'s text search facility provides |
| four types of configuration-related database objects: |
| </para> |
| |
| <itemizedlist spacing="compact" mark="bullet"> |
| <listitem> |
| <para> |
| <firstterm>Text search parsers</firstterm> break documents into tokens |
| and classify each token (for example, as words or numbers). |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| <firstterm>Text search dictionaries</firstterm> convert tokens to normalized |
| form and reject stop words. |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| <firstterm>Text search templates</firstterm> provide the functions underlying |
| dictionaries. (A dictionary simply specifies a template and a set |
| of parameters for the template.) |
| </para> |
| </listitem> |
| |
| <listitem> |
| <para> |
| <firstterm>Text search configurations</firstterm> select a parser and a set |
| of dictionaries to use to normalize the tokens produced by the parser. |
| </para> |
| </listitem> |
| </itemizedlist> |
| |
| <para> |
| Text search parsers and templates are built from low-level C functions; |
| therefore it requires C programming ability to develop new ones, and |
| superuser privileges to install one into a database. (There are examples |
| of add-on parsers and templates in the <filename>contrib/</filename> area of the |
| <productname>PostgreSQL</productname> distribution.) Since dictionaries and |
| configurations just parameterize and connect together some underlying |
| parsers and templates, no special privilege is needed to create a new |
| dictionary or configuration. Examples of creating custom dictionaries and |
| configurations appear later in this chapter. |
| </para> |
| |
| </sect2> |
| |
| </sect1> |
| |
| <sect1 id="textsearch-tables"> |
| <title>Tables and Indexes</title> |
| |
| <para> |
| The examples in the previous section illustrated full text matching using |
| simple constant strings. This section shows how to search table data, |
| optionally using indexes. |
| </para> |
| |
| <sect2 id="textsearch-tables-search"> |
| <title>Searching a Table</title> |
| |
| <para> |
| It is possible to do a full text search without an index. A simple query |
| to print the <structname>title</structname> of each row that contains the word |
| <literal>friend</literal> in its <structfield>body</structfield> field is: |
| |
| <programlisting> |
| SELECT title |
| FROM pgweb |
| WHERE to_tsvector('english', body) @@ to_tsquery('english', 'friend'); |
| </programlisting> |
| |
| This will also find related words such as <literal>friends</literal> |
| and <literal>friendly</literal>, since all these are reduced to the same |
| normalized lexeme. |
| </para> |
| |
| <para> |
| The query above specifies that the <literal>english</literal> configuration |
| is to be used to parse and normalize the strings. Alternatively we |
| could omit the configuration parameters: |
| |
| <programlisting> |
| SELECT title |
| FROM pgweb |
| WHERE to_tsvector(body) @@ to_tsquery('friend'); |
| </programlisting> |
| |
| This query will use the configuration set by <xref |
| linkend="guc-default-text-search-config"/>. |
| </para> |
| |
| <para> |
| A more complex example is to |
| select the ten most recent documents that contain <literal>create</literal> and |
| <literal>table</literal> in the <structname>title</structname> or <structname>body</structname>: |
| |
| <programlisting> |
| SELECT title |
| FROM pgweb |
| WHERE to_tsvector(title || ' ' || body) @@ to_tsquery('create & table') |
| ORDER BY last_mod_date DESC |
| LIMIT 10; |
| </programlisting> |
| |
| For clarity we omitted the <function>coalesce</function> function calls |
| which would be needed to find rows that contain <literal>NULL</literal> |
| in one of the two fields. |
| </para> |
| |
| <para> |
| Although these queries will work without an index, most applications |
| will find this approach too slow, except perhaps for occasional ad-hoc |
| searches. Practical use of text searching usually requires creating |
| an index. |
| </para> |
| |
| </sect2> |
| |
| <sect2 id="textsearch-tables-index"> |
| <title>Creating Indexes</title> |
| |
| <para> |
| We can create a <acronym>GIN</acronym> index (<xref |
| linkend="textsearch-indexes"/>) to speed up text searches: |
| |
| <programlisting> |
| CREATE INDEX pgweb_idx ON pgweb USING GIN (to_tsvector('english', body)); |
| </programlisting> |
| |
| Notice that the 2-argument version of <function>to_tsvector</function> is |
| used. Only text search functions that specify a configuration name can |
| be used in expression indexes (<xref linkend="indexes-expressional"/>). |
| This is because the index contents must be unaffected by <xref |
| linkend="guc-default-text-search-config"/>. If they were affected, the |
| index contents might be inconsistent because different entries could |
| contain <type>tsvector</type>s that were created with different text search |
| configurations, and there would be no way to guess which was which. It |
| would be impossible to dump and restore such an index correctly. |
| </para> |
| |
| <para> |
| Because the two-argument version of <function>to_tsvector</function> was |
| used in the index above, only a query reference that uses the 2-argument |
| version of <function>to_tsvector</function> with the same configuration |
| name will use that index. That is, <literal>WHERE |
| to_tsvector('english', body) @@ 'a & b'</literal> can use the index, |
| but <literal>WHERE to_tsvector(body) @@ 'a & b'</literal> cannot. |
| This ensures that an index will be used only with the same configuration |
| used to create the index entries. |
| </para> |
| |
| <para> |
| It is possible to set up more complex expression indexes wherein the |
| configuration name is specified by another column, e.g.: |
| |
| <programlisting> |
| CREATE INDEX pgweb_idx ON pgweb USING GIN (to_tsvector(config_name, body)); |
| </programlisting> |
| |
| where <literal>config_name</literal> is a column in the <literal>pgweb</literal> |
| table. This allows mixed configurations in the same index while |
| recording which configuration was used for each index entry. This |
| would be useful, for example, if the document collection contained |
| documents in different languages. Again, |
| queries that are meant to use the index must be phrased to match, e.g., |
| <literal>WHERE to_tsvector(config_name, body) @@ 'a & b'</literal>. |
| </para> |
| |
| <para> |
| Indexes can even concatenate columns: |
| |
| <programlisting> |
| CREATE INDEX pgweb_idx ON pgweb USING GIN (to_tsvector('english', title || ' ' || body)); |
| </programlisting> |
| </para> |
| |
| <para> |
| Another approach is to create a separate <type>tsvector</type> column |
| to hold the output of <function>to_tsvector</function>. To keep this |
| column automatically up to date with its source data, use a stored |
| generated column. This example is a |
| concatenation of <literal>title</literal> and <literal>body</literal>, |
| using <function>coalesce</function> to ensure that one field will still be |
| indexed when the other is <literal>NULL</literal>: |
| |
| <programlisting> |
| ALTER TABLE pgweb |
| ADD COLUMN textsearchable_index_col tsvector |
| GENERATED ALWAYS AS (to_tsvector('english', coalesce(title, '') || ' ' || coalesce(body, ''))) STORED; |
| </programlisting> |
| |
| Then we create a <acronym>GIN</acronym> index to speed up the search: |
| |
| <programlisting> |
| CREATE INDEX textsearch_idx ON pgweb USING GIN (textsearchable_index_col); |
| </programlisting> |
| |
| Now we are ready to perform a fast full text search: |
| |
| <programlisting> |
| SELECT title |
| FROM pgweb |
| WHERE textsearchable_index_col @@ to_tsquery('create & table') |
| ORDER BY last_mod_date DESC |
| LIMIT 10; |
| </programlisting> |
| </para> |
| |
| <para> |
| One advantage of the separate-column approach over an expression index |
| is that it is not necessary to explicitly specify the text search |
| configuration in queries in order to make use of the index. As shown |
| in the example above, the query can depend on |
| <varname>default_text_search_config</varname>. Another advantage is that |
| searches will be faster, since it will not be necessary to redo the |
| <function>to_tsvector</function> calls to verify index matches. (This is more |
| important when using a GiST index than a GIN index; see <xref |
| linkend="textsearch-indexes"/>.) The expression-index approach is |
| simpler to set up, however, and it requires less disk space since the |
| <type>tsvector</type> representation is not stored explicitly. |
| </para> |
| |
| </sect2> |
| |
| </sect1> |
| |
| <sect1 id="textsearch-controls"> |
| <title>Controlling Text Search</title> |
| |
| <para> |
| To implement full text searching there must be a function to create a |
| <type>tsvector</type> from a document and a <type>tsquery</type> from a |
| user query. Also, we need to return results in a useful order, so we need |
| a function that compares documents with respect to their relevance to |
| the query. It's also important to be able to display the results nicely. |
| <productname>PostgreSQL</productname> provides support for all of these |
| functions. |
| </para> |
| |
| <sect2 id="textsearch-parsing-documents"> |
| <title>Parsing Documents</title> |
| |
| <para> |
| <productname>PostgreSQL</productname> provides the |
| function <function>to_tsvector</function> for converting a document to |
| the <type>tsvector</type> data type. |
| </para> |
| |
| <indexterm> |
| <primary>to_tsvector</primary> |
| </indexterm> |
| |
| <synopsis> |
| to_tsvector(<optional> <replaceable class="parameter">config</replaceable> <type>regconfig</type>, </optional> <replaceable class="parameter">document</replaceable> <type>text</type>) returns <type>tsvector</type> |
| </synopsis> |
| |
| <para> |
| <function>to_tsvector</function> parses a textual document into tokens, |
| reduces the tokens to lexemes, and returns a <type>tsvector</type> which |
| lists the lexemes together with their positions in the document. |
| The document is processed according to the specified or default |
| text search configuration. |
| Here is a simple example: |
| |
| <screen> |
| SELECT to_tsvector('english', 'a fat cat sat on a mat - it ate a fat rats'); |
| to_tsvector |
| ----------------------------------------------------- |
| 'ate':9 'cat':3 'fat':2,11 'mat':7 'rat':12 'sat':4 |
| </screen> |
| </para> |
| |
| <para> |
| In the example above we see that the resulting <type>tsvector</type> does not |
| contain the words <literal>a</literal>, <literal>on</literal>, or |
| <literal>it</literal>, the word <literal>rats</literal> became |
| <literal>rat</literal>, and the punctuation sign <literal>-</literal> was |
| ignored. |
| </para> |
| |
| <para> |
| The <function>to_tsvector</function> function internally calls a parser |
| which breaks the document text into tokens and assigns a type to |
| each token. For each token, a list of |
| dictionaries (<xref linkend="textsearch-dictionaries"/>) is consulted, |
| where the list can vary depending on the token type. The first dictionary |
| that <firstterm>recognizes</firstterm> the token emits one or more normalized |
| <firstterm>lexemes</firstterm> to represent the token. For example, |
| <literal>rats</literal> became <literal>rat</literal> because one of the |
| dictionaries recognized that the word <literal>rats</literal> is a plural |
| form of <literal>rat</literal>. Some words are recognized as |
| <firstterm>stop words</firstterm> (<xref linkend="textsearch-stopwords"/>), which |
| causes them to be ignored since they occur too frequently to be useful in |
| searching. In our example these are |
| <literal>a</literal>, <literal>on</literal>, and <literal>it</literal>. |
| If no dictionary in the list recognizes the token then it is also ignored. |
| In this example that happened to the punctuation sign <literal>-</literal> |
| because there are in fact no dictionaries assigned for its token type |
| (<literal>Space symbols</literal>), meaning space tokens will never be |
| indexed. The choices of parser, dictionaries and which types of tokens to |
| index are determined by the selected text search configuration (<xref |
| linkend="textsearch-configuration"/>). It is possible to have |
| many different configurations in the same database, and predefined |
| configurations are available for various languages. In our example |
| we used the default configuration <literal>english</literal> for the |
| English language. |
| </para> |
| |
| <para> |
| The function <function>setweight</function> can be used to label the |
| entries of a <type>tsvector</type> with a given <firstterm>weight</firstterm>, |
| where a weight is one of the letters <literal>A</literal>, <literal>B</literal>, |
| <literal>C</literal>, or <literal>D</literal>. |
| This is typically used to mark entries coming from |
| different parts of a document, such as title versus body. Later, this |
| information can be used for ranking of search results. |
| </para> |
| |
| <para> |
| Because <function>to_tsvector</function>(<literal>NULL</literal>) will |
| return <literal>NULL</literal>, it is recommended to use |
| <function>coalesce</function> whenever a field might be null. |
| Here is the recommended method for creating |
| a <type>tsvector</type> from a structured document: |
| |
| <programlisting> |
| UPDATE tt SET ti = |
| setweight(to_tsvector(coalesce(title,'')), 'A') || |
| setweight(to_tsvector(coalesce(keyword,'')), 'B') || |
| setweight(to_tsvector(coalesce(abstract,'')), 'C') || |
| setweight(to_tsvector(coalesce(body,'')), 'D'); |
| </programlisting> |
| |
| Here we have used <function>setweight</function> to label the source |
| of each lexeme in the finished <type>tsvector</type>, and then merged |
| the labeled <type>tsvector</type> values using the <type>tsvector</type> |
| concatenation operator <literal>||</literal>. (<xref |
| linkend="textsearch-manipulate-tsvector"/> gives details about these |
| operations.) |
| </para> |
| |
| </sect2> |
| |
| <sect2 id="textsearch-parsing-queries"> |
| <title>Parsing Queries</title> |
| |
| <para> |
| <productname>PostgreSQL</productname> provides the |
| functions <function>to_tsquery</function>, |
| <function>plainto_tsquery</function>, |
| <function>phraseto_tsquery</function> and |
| <function>websearch_to_tsquery</function> |
| for converting a query to the <type>tsquery</type> data type. |
| <function>to_tsquery</function> offers access to more features |
| than either <function>plainto_tsquery</function> or |
| <function>phraseto_tsquery</function>, but it is less forgiving about its |
| input. <function>websearch_to_tsquery</function> is a simplified version |
| of <function>to_tsquery</function> with an alternative syntax, similar |
| to the one used by web search engines. |
| </para> |
| |
| <indexterm> |
| <primary>to_tsquery</primary> |
| </indexterm> |
| |
| <synopsis> |
| to_tsquery(<optional> <replaceable class="parameter">config</replaceable> <type>regconfig</type>, </optional> <replaceable class="parameter">querytext</replaceable> <type>text</type>) returns <type>tsquery</type> |
| </synopsis> |
| |
| <para> |
| <function>to_tsquery</function> creates a <type>tsquery</type> value from |
| <replaceable>querytext</replaceable>, which must consist of single tokens |
| separated by the <type>tsquery</type> operators <literal>&</literal> (AND), |
| <literal>|</literal> (OR), <literal>!</literal> (NOT), and |
| <literal><-></literal> (FOLLOWED BY), possibly grouped |
| using parentheses. In other words, the input to |
| <function>to_tsquery</function> must already follow the general rules for |
| <type>tsquery</type> input, as described in <xref |
| linkend="datatype-tsquery"/>. The difference is that while basic |
| <type>tsquery</type> input takes the tokens at face value, |
| <function>to_tsquery</function> normalizes each token into a lexeme using |
| the specified or default configuration, and discards any tokens that are |
| stop words according to the configuration. For example: |
| |
| <screen> |
| SELECT to_tsquery('english', 'The & Fat & Rats'); |
| to_tsquery |
| --------------- |
| 'fat' & 'rat' |
| </screen> |
| |
| As in basic <type>tsquery</type> input, weight(s) can be attached to each |
| lexeme to restrict it to match only <type>tsvector</type> lexemes of those |
| weight(s). For example: |
| |
| <screen> |
| SELECT to_tsquery('english', 'Fat | Rats:AB'); |
| to_tsquery |
| ------------------ |
| 'fat' | 'rat':AB |
| </screen> |
| |
| Also, <literal>*</literal> can be attached to a lexeme to specify prefix matching: |
| |
| <screen> |
| SELECT to_tsquery('supern:*A & star:A*B'); |
| to_tsquery |
| -------------------------- |
| 'supern':*A & 'star':*AB |
| </screen> |
| |
| Such a lexeme will match any word in a <type>tsvector</type> that begins |
| with the given string. |
| </para> |
| |
| <para> |
| <function>to_tsquery</function> can also accept single-quoted |
| phrases. This is primarily useful when the configuration includes a |
| thesaurus dictionary that may trigger on such phrases. |
| In the example below, a thesaurus contains the rule <literal>supernovae |
| stars : sn</literal>: |
| |
| <screen> |
| SELECT to_tsquery('''supernovae stars'' & !crab'); |
| to_tsquery |
| --------------- |
| 'sn' & !'crab' |
| </screen> |
| |
| Without quotes, <function>to_tsquery</function> will generate a syntax |
| error for tokens that are not separated by an AND, OR, or FOLLOWED BY |
| operator. |
| </para> |
| |
| <indexterm> |
| <primary>plainto_tsquery</primary> |
| </indexterm> |
| |
| <synopsis> |
| plainto_tsquery(<optional> <replaceable class="parameter">config</replaceable> <type>regconfig</type>, </optional> <replaceable class="parameter">querytext</replaceable> <type>text</type>) returns <type>tsquery</type> |
| </synopsis> |
| |
| <para> |
| <function>plainto_tsquery</function> transforms the unformatted text |
| <replaceable>querytext</replaceable> to a <type>tsquery</type> value. |
| The text is parsed and normalized much as for <function>to_tsvector</function>, |
| then the <literal>&</literal> (AND) <type>tsquery</type> operator is |
| inserted between surviving words. |
| </para> |
| |
| <para> |
| Example: |
| |
| <screen> |
| SELECT plainto_tsquery('english', 'The Fat Rats'); |
| plainto_tsquery |
| ----------------- |
| 'fat' & 'rat' |
| </screen> |
| |
| Note that <function>plainto_tsquery</function> will not |
| recognize <type>tsquery</type> operators, weight labels, |
| or prefix-match labels in its input: |
| |
| <screen> |
| SELECT plainto_tsquery('english', 'The Fat & Rats:C'); |
| plainto_tsquery |
| --------------------- |
| 'fat' & 'rat' & 'c' |
| </screen> |
| |
| Here, all the input punctuation was discarded. |
| </para> |
| |
| <indexterm> |
| <primary>phraseto_tsquery</primary> |
| </indexterm> |
| |
| <synopsis> |
| phraseto_tsquery(<optional> <replaceable class="parameter">config</replaceable> <type>regconfig</type>, </optional> <replaceable class="parameter">querytext</replaceable> <type>text</type>) returns <type>tsquery</type> |
| </synopsis> |
| |
| <para> |
| <function>phraseto_tsquery</function> behaves much like |
| <function>plainto_tsquery</function>, except that it inserts |
| the <literal><-></literal> (FOLLOWED BY) operator between |
| surviving words instead of the <literal>&</literal> (AND) operator. |
| Also, stop words are not simply discarded, but are accounted for by |
| inserting <literal><<replaceable>N</replaceable>></literal> operators rather |
| than <literal><-></literal> operators. This function is useful |
| when searching for exact lexeme sequences, since the FOLLOWED BY |
| operators check lexeme order not just the presence of all the lexemes. |
| </para> |
| |
| <para> |
| Example: |
| |
| <screen> |
| SELECT phraseto_tsquery('english', 'The Fat Rats'); |
| phraseto_tsquery |
| ------------------ |
| 'fat' <-> 'rat' |
| </screen> |
| |
| Like <function>plainto_tsquery</function>, the |
| <function>phraseto_tsquery</function> function will not |
| recognize <type>tsquery</type> operators, weight labels, |
| or prefix-match labels in its input: |
| |
| <screen> |
| SELECT phraseto_tsquery('english', 'The Fat & Rats:C'); |
| phraseto_tsquery |
| ----------------------------- |
| 'fat' <-> 'rat' <-> 'c' |
| </screen> |
| </para> |
| |
| <synopsis> |
| websearch_to_tsquery(<optional> <replaceable class="parameter">config</replaceable> <type>regconfig</type>, </optional> <replaceable class="parameter">querytext</replaceable> <type>text</type>) returns <type>tsquery</type> |
| </synopsis> |
| |
| <para> |
| <function>websearch_to_tsquery</function> creates a <type>tsquery</type> |
| value from <replaceable>querytext</replaceable> using an alternative |
| syntax in which simple unformatted text is a valid query. |
| Unlike <function>plainto_tsquery</function> |
| and <function>phraseto_tsquery</function>, it also recognizes certain |
| operators. Moreover, this function will never raise syntax errors, |
| which makes it possible to use raw user-supplied input for search. |
| The following syntax is supported: |
| |
| <itemizedlist spacing="compact" mark="bullet"> |
| <listitem> |
| <para> |
| <literal>unquoted text</literal>: text not inside quote marks will be |
| converted to terms separated by <literal>&</literal> operators, as |
| if processed by <function>plainto_tsquery</function>. |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| <literal>"quoted text"</literal>: text inside quote marks will be |
| converted to terms separated by <literal><-></literal> |
| operators, as if processed by <function>phraseto_tsquery</function>. |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| <literal>OR</literal>: the word <quote>or</quote> will be converted to |
| the <literal>|</literal> operator. |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| <literal>-</literal>: a dash will be converted to |
| the <literal>!</literal> operator. |
| </para> |
| </listitem> |
| </itemizedlist> |
| |
| Other punctuation is ignored. So |
| like <function>plainto_tsquery</function> |
| and <function>phraseto_tsquery</function>, |
| the <function>websearch_to_tsquery</function> function will not |
| recognize <type>tsquery</type> operators, weight labels, or prefix-match |
| labels in its input. |
| </para> |
| |
| <para> |
| Examples: |
| <screen> |
| SELECT websearch_to_tsquery('english', 'The fat rats'); |
| websearch_to_tsquery |
| ---------------------- |
| 'fat' & 'rat' |
| (1 row) |
| |
| SELECT websearch_to_tsquery('english', '"supernovae stars" -crab'); |
| websearch_to_tsquery |
| ---------------------------------- |
| 'supernova' <-> 'star' & !'crab' |
| (1 row) |
| |
| SELECT websearch_to_tsquery('english', '"sad cat" or "fat rat"'); |
| websearch_to_tsquery |
| ----------------------------------- |
| 'sad' <-> 'cat' | 'fat' <-> 'rat' |
| (1 row) |
| |
| SELECT websearch_to_tsquery('english', 'signal -"segmentation fault"'); |
| websearch_to_tsquery |
| --------------------------------------- |
| 'signal' & !( 'segment' <-> 'fault' ) |
| (1 row) |
| |
| SELECT websearch_to_tsquery('english', '""" )( dummy \\ query <->'); |
| websearch_to_tsquery |
| ---------------------- |
| 'dummi' & 'queri' |
| (1 row) |
| </screen> |
| </para> |
| </sect2> |
| |
| <sect2 id="textsearch-ranking"> |
| <title>Ranking Search Results</title> |
| |
| <para> |
| Ranking attempts to measure how relevant documents are to a particular |
| query, so that when there are many matches the most relevant ones can be |
| shown first. <productname>PostgreSQL</productname> provides two |
| predefined ranking functions, which take into account lexical, proximity, |
| and structural information; that is, they consider how often the query |
| terms appear in the document, how close together the terms are in the |
| document, and how important is the part of the document where they occur. |
| However, the concept of relevancy is vague and very application-specific. |
| Different applications might require additional information for ranking, |
| e.g., document modification time. The built-in ranking functions are only |
| examples. You can write your own ranking functions and/or combine their |
| results with additional factors to fit your specific needs. |
| </para> |
| |
| <para> |
| The two ranking functions currently available are: |
| |
| <variablelist> |
| |
| <varlistentry> |
| |
| <term> |
| <indexterm> |
| <primary>ts_rank</primary> |
| </indexterm> |
| |
| <literal>ts_rank(<optional> <replaceable class="parameter">weights</replaceable> <type>float4[]</type>, </optional> <replaceable class="parameter">vector</replaceable> <type>tsvector</type>, <replaceable class="parameter">query</replaceable> <type>tsquery</type> <optional>, <replaceable class="parameter">normalization</replaceable> <type>integer</type> </optional>) returns <type>float4</type></literal> |
| </term> |
| |
| <listitem> |
| <para> |
| Ranks vectors based on the frequency of their matching lexemes. |
| </para> |
| </listitem> |
| </varlistentry> |
| |
| <varlistentry> |
| |
| <term> |
| <indexterm> |
| <primary>ts_rank_cd</primary> |
| </indexterm> |
| |
| <literal>ts_rank_cd(<optional> <replaceable class="parameter">weights</replaceable> <type>float4[]</type>, </optional> <replaceable class="parameter">vector</replaceable> <type>tsvector</type>, <replaceable class="parameter">query</replaceable> <type>tsquery</type> <optional>, <replaceable class="parameter">normalization</replaceable> <type>integer</type> </optional>) returns <type>float4</type></literal> |
| </term> |
| |
| <listitem> |
| <para> |
| This function computes the <firstterm>cover density</firstterm> |
| ranking for the given document vector and query, as described in |
| Clarke, Cormack, and Tudhope's "Relevance Ranking for One to Three |
| Term Queries" in the journal "Information Processing and Management", |
| 1999. Cover density is similar to <function>ts_rank</function> ranking |
| except that the proximity of matching lexemes to each other is |
| taken into consideration. |
| </para> |
| |
| <para> |
| This function requires lexeme positional information to perform |
| its calculation. Therefore, it ignores any <quote>stripped</quote> |
| lexemes in the <type>tsvector</type>. If there are no unstripped |
| lexemes in the input, the result will be zero. (See <xref |
| linkend="textsearch-manipulate-tsvector"/> for more information |
| about the <function>strip</function> function and positional information |
| in <type>tsvector</type>s.) |
| </para> |
| </listitem> |
| </varlistentry> |
| |
| </variablelist> |
| |
| </para> |
| |
| <para> |
| For both these functions, |
| the optional <replaceable class="parameter">weights</replaceable> |
| argument offers the ability to weigh word instances more or less |
| heavily depending on how they are labeled. The weight arrays specify |
| how heavily to weigh each category of word, in the order: |
| |
| <synopsis> |
| {D-weight, C-weight, B-weight, A-weight} |
| </synopsis> |
| |
| If no <replaceable class="parameter">weights</replaceable> are provided, |
| then these defaults are used: |
| |
| <programlisting> |
| {0.1, 0.2, 0.4, 1.0} |
| </programlisting> |
| |
| Typically weights are used to mark words from special areas of the |
| document, like the title or an initial abstract, so they can be |
| treated with more or less importance than words in the document body. |
| </para> |
| |
| <para> |
| Since a longer document has a greater chance of containing a query term |
| it is reasonable to take into account document size, e.g., a hundred-word |
| document with five instances of a search word is probably more relevant |
| than a thousand-word document with five instances. Both ranking functions |
| take an integer <replaceable>normalization</replaceable> option that |
| specifies whether and how a document's length should impact its rank. |
| The integer option controls several behaviors, so it is a bit mask: |
| you can specify one or more behaviors using |
| <literal>|</literal> (for example, <literal>2|4</literal>). |
| |
| <itemizedlist spacing="compact" mark="bullet"> |
| <listitem> |
| <para> |
| 0 (the default) ignores the document length |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| 1 divides the rank by 1 + the logarithm of the document length |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| 2 divides the rank by the document length |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| 4 divides the rank by the mean harmonic distance between extents |
| (this is implemented only by <function>ts_rank_cd</function>) |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| 8 divides the rank by the number of unique words in document |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| 16 divides the rank by 1 + the logarithm of the number |
| of unique words in document |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| 32 divides the rank by itself + 1 |
| </para> |
| </listitem> |
| </itemizedlist> |
| |
| If more than one flag bit is specified, the transformations are |
| applied in the order listed. |
| </para> |
| |
| <para> |
| It is important to note that the ranking functions do not use any global |
| information, so it is impossible to produce a fair normalization to 1% or |
| 100% as sometimes desired. Normalization option 32 |
| (<literal>rank/(rank+1)</literal>) can be applied to scale all ranks |
| into the range zero to one, but of course this is just a cosmetic change; |
| it will not affect the ordering of the search results. |
| </para> |
| |
| <para> |
| Here is an example that selects only the ten highest-ranked matches: |
| |
| <screen> |
| SELECT title, ts_rank_cd(textsearch, query) AS rank |
| FROM apod, to_tsquery('neutrino|(dark & matter)') query |
| WHERE query @@ textsearch |
| ORDER BY rank DESC |
| LIMIT 10; |
| title | rank |
| -----------------------------------------------+---------- |
| Neutrinos in the Sun | 3.1 |
| The Sudbury Neutrino Detector | 2.4 |
| A MACHO View of Galactic Dark Matter | 2.01317 |
| Hot Gas and Dark Matter | 1.91171 |
| The Virgo Cluster: Hot Plasma and Dark Matter | 1.90953 |
| Rafting for Solar Neutrinos | 1.9 |
| NGC 4650A: Strange Galaxy and Dark Matter | 1.85774 |
| Hot Gas and Dark Matter | 1.6123 |
| Ice Fishing for Cosmic Neutrinos | 1.6 |
| Weak Lensing Distorts the Universe | 0.818218 |
| </screen> |
| |
| This is the same example using normalized ranking: |
| |
| <screen> |
| SELECT title, ts_rank_cd(textsearch, query, 32 /* rank/(rank+1) */ ) AS rank |
| FROM apod, to_tsquery('neutrino|(dark & matter)') query |
| WHERE query @@ textsearch |
| ORDER BY rank DESC |
| LIMIT 10; |
| title | rank |
| -----------------------------------------------+------------------- |
| Neutrinos in the Sun | 0.756097569485493 |
| The Sudbury Neutrino Detector | 0.705882361190954 |
| A MACHO View of Galactic Dark Matter | 0.668123210574724 |
| Hot Gas and Dark Matter | 0.65655958650282 |
| The Virgo Cluster: Hot Plasma and Dark Matter | 0.656301290640973 |
| Rafting for Solar Neutrinos | 0.655172410958162 |
| NGC 4650A: Strange Galaxy and Dark Matter | 0.650072921219637 |
| Hot Gas and Dark Matter | 0.617195790024749 |
| Ice Fishing for Cosmic Neutrinos | 0.615384618911517 |
| Weak Lensing Distorts the Universe | 0.450010798361481 |
| </screen> |
| </para> |
| |
| <para> |
| Ranking can be expensive since it requires consulting the |
| <type>tsvector</type> of each matching document, which can be I/O bound and |
| therefore slow. Unfortunately, it is almost impossible to avoid since |
| practical queries often result in large numbers of matches. |
| </para> |
| |
| </sect2> |
| |
| <sect2 id="textsearch-headline"> |
| <title>Highlighting Results</title> |
| |
| <para> |
| To present search results it is ideal to show a part of each document and |
| how it is related to the query. Usually, search engines show fragments of |
| the document with marked search terms. <productname>PostgreSQL</productname> |
| provides a function <function>ts_headline</function> that |
| implements this functionality. |
| </para> |
| |
| <indexterm> |
| <primary>ts_headline</primary> |
| </indexterm> |
| |
| <synopsis> |
| ts_headline(<optional> <replaceable class="parameter">config</replaceable> <type>regconfig</type>, </optional> <replaceable class="parameter">document</replaceable> <type>text</type>, <replaceable class="parameter">query</replaceable> <type>tsquery</type> <optional>, <replaceable class="parameter">options</replaceable> <type>text</type> </optional>) returns <type>text</type> |
| </synopsis> |
| |
| <para> |
| <function>ts_headline</function> accepts a document along |
| with a query, and returns an excerpt from |
| the document in which terms from the query are highlighted. The |
| configuration to be used to parse the document can be specified by |
| <replaceable>config</replaceable>; if <replaceable>config</replaceable> |
| is omitted, the |
| <varname>default_text_search_config</varname> configuration is used. |
| </para> |
| |
| <para> |
| If an <replaceable>options</replaceable> string is specified it must |
| consist of a comma-separated list of one or more |
| <replaceable>option</replaceable><literal>=</literal><replaceable>value</replaceable> pairs. |
| The available options are: |
| |
| <itemizedlist spacing="compact" mark="bullet"> |
| <listitem> |
| <para> |
| <literal>MaxWords</literal>, <literal>MinWords</literal> (integers): |
| these numbers determine the longest and shortest headlines to output. |
| The default values are 35 and 15. |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| <literal>ShortWord</literal> (integer): words of this length or less |
| will be dropped at the start and end of a headline, unless they are |
| query terms. The default value of three eliminates common English |
| articles. |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| <literal>HighlightAll</literal> (boolean): if |
| <literal>true</literal> the whole document will be used as the |
| headline, ignoring the preceding three parameters. The default |
| is <literal>false</literal>. |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| <literal>MaxFragments</literal> (integer): maximum number of text |
| fragments to display. The default value of zero selects a |
| non-fragment-based headline generation method. A value greater |
| than zero selects fragment-based headline generation (see below). |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| <literal>StartSel</literal>, <literal>StopSel</literal> (strings): |
| the strings with which to delimit query words appearing in the |
| document, to distinguish them from other excerpted words. The |
| default values are <quote><literal><b></literal></quote> and |
| <quote><literal></b></literal></quote>, which can be suitable |
| for HTML output. |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| <literal>FragmentDelimiter</literal> (string): When more than one |
| fragment is displayed, the fragments will be separated by this string. |
| The default is <quote><literal> ... </literal></quote>. |
| </para> |
| </listitem> |
| </itemizedlist> |
| |
| These option names are recognized case-insensitively. |
| You must double-quote string values if they contain spaces or commas. |
| </para> |
| |
| <para> |
| In non-fragment-based headline |
| generation, <function>ts_headline</function> locates matches for the |
| given <replaceable class="parameter">query</replaceable> and chooses a |
| single one to display, preferring matches that have more query words |
| within the allowed headline length. |
| In fragment-based headline generation, <function>ts_headline</function> |
| locates the query matches and splits each match |
| into <quote>fragments</quote> of no more than <literal>MaxWords</literal> |
| words each, preferring fragments with more query words, and when |
| possible <quote>stretching</quote> fragments to include surrounding |
| words. The fragment-based mode is thus more useful when the query |
| matches span large sections of the document, or when it's desirable to |
| display multiple matches. |
| In either mode, if no query matches can be identified, then a single |
| fragment of the first <literal>MinWords</literal> words in the document |
| will be displayed. |
| </para> |
| |
| <para> |
| For example: |
| |
| <screen> |
| SELECT ts_headline('english', |
| 'The most common type of search |
| is to find all documents containing given query terms |
| and return them in order of their similarity to the |
| query.', |
| to_tsquery('english', 'query & similarity')); |
| ts_headline |
| ------------------------------------------------------------ |
| containing given <b>query</b> terms + |
| and return them in order of their <b>similarity</b> to the+ |
| <b>query</b>. |
| |
| SELECT ts_headline('english', |
| 'Search terms may occur |
| many times in a document, |
| requiring ranking of the search matches to decide which |
| occurrences to display in the result.', |
| to_tsquery('english', 'search & term'), |
| 'MaxFragments=10, MaxWords=7, MinWords=3, StartSel=<<, StopSel=>>'); |
| ts_headline |
| ------------------------------------------------------------ |
| <<Search>> <<terms>> may occur + |
| many times ... ranking of the <<search>> matches to decide |
| </screen> |
| </para> |
| |
| <para> |
| <function>ts_headline</function> uses the original document, not a |
| <type>tsvector</type> summary, so it can be slow and should be used with |
| care. |
| </para> |
| |
| </sect2> |
| |
| </sect1> |
| |
| <sect1 id="textsearch-features"> |
| <title>Additional Features</title> |
| |
| <para> |
| This section describes additional functions and operators that are |
| useful in connection with text search. |
| </para> |
| |
| <sect2 id="textsearch-manipulate-tsvector"> |
| <title>Manipulating Documents</title> |
| |
| <para> |
| <xref linkend="textsearch-parsing-documents"/> showed how raw textual |
| documents can be converted into <type>tsvector</type> values. |
| <productname>PostgreSQL</productname> also provides functions and |
| operators that can be used to manipulate documents that are already |
| in <type>tsvector</type> form. |
| </para> |
| |
| <variablelist> |
| |
| <varlistentry> |
| |
| <term> |
| <indexterm> |
| <primary>tsvector concatenation</primary> |
| </indexterm> |
| |
| <literal><type>tsvector</type> || <type>tsvector</type></literal> |
| </term> |
| |
| <listitem> |
| <para> |
| The <type>tsvector</type> concatenation operator |
| returns a vector which combines the lexemes and positional information |
| of the two vectors given as arguments. Positions and weight labels |
| are retained during the concatenation. |
| Positions appearing in the right-hand vector are offset by the largest |
| position mentioned in the left-hand vector, so that the result is |
| nearly equivalent to the result of performing <function>to_tsvector</function> |
| on the concatenation of the two original document strings. (The |
| equivalence is not exact, because any stop-words removed from the |
| end of the left-hand argument will not affect the result, whereas |
| they would have affected the positions of the lexemes in the |
| right-hand argument if textual concatenation were used.) |
| </para> |
| |
| <para> |
| One advantage of using concatenation in the vector form, rather than |
| concatenating text before applying <function>to_tsvector</function>, is that |
| you can use different configurations to parse different sections |
| of the document. Also, because the <function>setweight</function> function |
| marks all lexemes of the given vector the same way, it is necessary |
| to parse the text and do <function>setweight</function> before concatenating |
| if you want to label different parts of the document with different |
| weights. |
| </para> |
| </listitem> |
| </varlistentry> |
| |
| <varlistentry> |
| |
| <term> |
| <indexterm> |
| <primary>setweight</primary> |
| </indexterm> |
| |
| <literal>setweight(<replaceable class="parameter">vector</replaceable> <type>tsvector</type>, <replaceable class="parameter">weight</replaceable> <type>"char"</type>) returns <type>tsvector</type></literal> |
| </term> |
| |
| <listitem> |
| <para> |
| <function>setweight</function> returns a copy of the input vector in which every |
| position has been labeled with the given <replaceable>weight</replaceable>, either |
| <literal>A</literal>, <literal>B</literal>, <literal>C</literal>, or |
| <literal>D</literal>. (<literal>D</literal> is the default for new |
| vectors and as such is not displayed on output.) These labels are |
| retained when vectors are concatenated, allowing words from different |
| parts of a document to be weighted differently by ranking functions. |
| </para> |
| |
| <para> |
| Note that weight labels apply to <emphasis>positions</emphasis>, not |
| <emphasis>lexemes</emphasis>. If the input vector has been stripped of |
| positions then <function>setweight</function> does nothing. |
| </para> |
| </listitem> |
| </varlistentry> |
| |
| <varlistentry> |
| <term> |
| <indexterm> |
| <primary>length(tsvector)</primary> |
| </indexterm> |
| |
| <literal>length(<replaceable class="parameter">vector</replaceable> <type>tsvector</type>) returns <type>integer</type></literal> |
| </term> |
| |
| <listitem> |
| <para> |
| Returns the number of lexemes stored in the vector. |
| </para> |
| </listitem> |
| </varlistentry> |
| |
| <varlistentry> |
| |
| <term> |
| <indexterm> |
| <primary>strip</primary> |
| </indexterm> |
| |
| <literal>strip(<replaceable class="parameter">vector</replaceable> <type>tsvector</type>) returns <type>tsvector</type></literal> |
| </term> |
| |
| <listitem> |
| <para> |
| Returns a vector that lists the same lexemes as the given vector, but |
| lacks any position or weight information. The result is usually much |
| smaller than an unstripped vector, but it is also less useful. |
| Relevance ranking does not work as well on stripped vectors as |
| unstripped ones. Also, |
| the <literal><-></literal> (FOLLOWED BY) <type>tsquery</type> operator |
| will never match stripped input, since it cannot determine the |
| distance between lexeme occurrences. |
| </para> |
| </listitem> |
| |
| </varlistentry> |
| |
| </variablelist> |
| |
| <para> |
| A full list of <type>tsvector</type>-related functions is available |
| in <xref linkend="textsearch-functions-table"/>. |
| </para> |
| |
| </sect2> |
| |
| <sect2 id="textsearch-manipulate-tsquery"> |
| <title>Manipulating Queries</title> |
| |
| <para> |
| <xref linkend="textsearch-parsing-queries"/> showed how raw textual |
| queries can be converted into <type>tsquery</type> values. |
| <productname>PostgreSQL</productname> also provides functions and |
| operators that can be used to manipulate queries that are already |
| in <type>tsquery</type> form. |
| </para> |
| |
| <variablelist> |
| |
| <varlistentry> |
| |
| <term> |
| <literal><type>tsquery</type> && <type>tsquery</type></literal> |
| </term> |
| |
| <listitem> |
| <para> |
| Returns the AND-combination of the two given queries. |
| </para> |
| </listitem> |
| |
| </varlistentry> |
| |
| <varlistentry> |
| |
| <term> |
| <literal><type>tsquery</type> || <type>tsquery</type></literal> |
| </term> |
| |
| <listitem> |
| <para> |
| Returns the OR-combination of the two given queries. |
| </para> |
| </listitem> |
| |
| </varlistentry> |
| |
| <varlistentry> |
| |
| <term> |
| <literal>!! <type>tsquery</type></literal> |
| </term> |
| |
| <listitem> |
| <para> |
| Returns the negation (NOT) of the given query. |
| </para> |
| </listitem> |
| |
| </varlistentry> |
| |
| <varlistentry> |
| |
| <term> |
| <literal><type>tsquery</type> <-> <type>tsquery</type></literal> |
| </term> |
| |
| <listitem> |
| <para> |
| Returns a query that searches for a match to the first given query |
| immediately followed by a match to the second given query, using |
| the <literal><-></literal> (FOLLOWED BY) |
| <type>tsquery</type> operator. For example: |
| |
| <screen> |
| SELECT to_tsquery('fat') <-> to_tsquery('cat | rat'); |
| ?column? |
| ---------------------------- |
| 'fat' <-> ( 'cat' | 'rat' ) |
| </screen> |
| </para> |
| </listitem> |
| |
| </varlistentry> |
| |
| <varlistentry> |
| |
| <term> |
| <indexterm> |
| <primary>tsquery_phrase</primary> |
| </indexterm> |
| |
| <literal>tsquery_phrase(<replaceable class="parameter">query1</replaceable> <type>tsquery</type>, <replaceable class="parameter">query2</replaceable> <type>tsquery</type> [, <replaceable class="parameter">distance</replaceable> <type>integer</type> ]) returns <type>tsquery</type></literal> |
| </term> |
| |
| <listitem> |
| <para> |
| Returns a query that searches for a match to the first given query |
| followed by a match to the second given query at a distance of exactly |
| <replaceable>distance</replaceable> lexemes, using |
| the <literal><<replaceable>N</replaceable>></literal> |
| <type>tsquery</type> operator. For example: |
| |
| <screen> |
| SELECT tsquery_phrase(to_tsquery('fat'), to_tsquery('cat'), 10); |
| tsquery_phrase |
| ------------------ |
| 'fat' <10> 'cat' |
| </screen> |
| </para> |
| </listitem> |
| |
| </varlistentry> |
| |
| <varlistentry> |
| |
| <term> |
| <indexterm> |
| <primary>numnode</primary> |
| </indexterm> |
| |
| <literal>numnode(<replaceable class="parameter">query</replaceable> <type>tsquery</type>) returns <type>integer</type></literal> |
| </term> |
| |
| <listitem> |
| <para> |
| Returns the number of nodes (lexemes plus operators) in a |
| <type>tsquery</type>. This function is useful |
| to determine if the <replaceable>query</replaceable> is meaningful |
| (returns > 0), or contains only stop words (returns 0). |
| Examples: |
| |
| <screen> |
| SELECT numnode(plainto_tsquery('the any')); |
| NOTICE: query contains only stopword(s) or doesn't contain lexeme(s), ignored |
| numnode |
| --------- |
| 0 |
| |
| SELECT numnode('foo & bar'::tsquery); |
| numnode |
| --------- |
| 3 |
| </screen> |
| </para> |
| </listitem> |
| </varlistentry> |
| |
| <varlistentry> |
| |
| <term> |
| <indexterm> |
| <primary>querytree</primary> |
| </indexterm> |
| |
| <literal>querytree(<replaceable class="parameter">query</replaceable> <type>tsquery</type>) returns <type>text</type></literal> |
| </term> |
| |
| <listitem> |
| <para> |
| Returns the portion of a <type>tsquery</type> that can be used for |
| searching an index. This function is useful for detecting |
| unindexable queries, for example those containing only stop words |
| or only negated terms. For example: |
| |
| <screen> |
| SELECT querytree(to_tsquery('defined')); |
| querytree |
| ----------- |
| 'defin' |
| |
| SELECT querytree(to_tsquery('!defined')); |
| querytree |
| ----------- |
| T |
| </screen> |
| </para> |
| </listitem> |
| </varlistentry> |
| |
| </variablelist> |
| |
| <sect3 id="textsearch-query-rewriting"> |
| <title>Query Rewriting</title> |
| |
| <indexterm zone="textsearch-query-rewriting"> |
| <primary>ts_rewrite</primary> |
| </indexterm> |
| |
| <para> |
| The <function>ts_rewrite</function> family of functions search a |
| given <type>tsquery</type> for occurrences of a target |
| subquery, and replace each occurrence with a |
| substitute subquery. In essence this operation is a |
| <type>tsquery</type>-specific version of substring replacement. |
| A target and substitute combination can be |
| thought of as a <firstterm>query rewrite rule</firstterm>. A collection |
| of such rewrite rules can be a powerful search aid. |
| For example, you can expand the search using synonyms |
| (e.g., <literal>new york</literal>, <literal>big apple</literal>, <literal>nyc</literal>, |
| <literal>gotham</literal>) or narrow the search to direct the user to some hot |
| topic. There is some overlap in functionality between this feature |
| and thesaurus dictionaries (<xref linkend="textsearch-thesaurus"/>). |
| However, you can modify a set of rewrite rules on-the-fly without |
| reindexing, whereas updating a thesaurus requires reindexing to be |
| effective. |
| </para> |
| |
| <variablelist> |
| |
| <varlistentry> |
| |
| <term> |
| <literal>ts_rewrite (<replaceable class="parameter">query</replaceable> <type>tsquery</type>, <replaceable class="parameter">target</replaceable> <type>tsquery</type>, <replaceable class="parameter">substitute</replaceable> <type>tsquery</type>) returns <type>tsquery</type></literal> |
| </term> |
| |
| <listitem> |
| <para> |
| This form of <function>ts_rewrite</function> simply applies a single |
| rewrite rule: <replaceable class="parameter">target</replaceable> |
| is replaced by <replaceable class="parameter">substitute</replaceable> |
| wherever it appears in <replaceable |
| class="parameter">query</replaceable>. For example: |
| |
| <screen> |
| SELECT ts_rewrite('a & b'::tsquery, 'a'::tsquery, 'c'::tsquery); |
| ts_rewrite |
| ------------ |
| 'b' & 'c' |
| </screen> |
| </para> |
| </listitem> |
| </varlistentry> |
| |
| <varlistentry> |
| |
| <term> |
| <literal>ts_rewrite (<replaceable class="parameter">query</replaceable> <type>tsquery</type>, <replaceable class="parameter">select</replaceable> <type>text</type>) returns <type>tsquery</type></literal> |
| </term> |
| |
| <listitem> |
| <para> |
| This form of <function>ts_rewrite</function> accepts a starting |
| <replaceable>query</replaceable> and an SQL <replaceable>select</replaceable> command, which |
| is given as a text string. The <replaceable>select</replaceable> must yield two |
| columns of <type>tsquery</type> type. For each row of the |
| <replaceable>select</replaceable> result, occurrences of the first column value |
| (the target) are replaced by the second column value (the substitute) |
| within the current <replaceable>query</replaceable> value. For example: |
| |
| <screen> |
| CREATE TABLE aliases (t tsquery PRIMARY KEY, s tsquery); |
| INSERT INTO aliases VALUES('a', 'c'); |
| |
| SELECT ts_rewrite('a & b'::tsquery, 'SELECT t,s FROM aliases'); |
| ts_rewrite |
| ------------ |
| 'b' & 'c' |
| </screen> |
| </para> |
| |
| <para> |
| Note that when multiple rewrite rules are applied in this way, |
| the order of application can be important; so in practice you will |
| want the source query to <literal>ORDER BY</literal> some ordering key. |
| </para> |
| </listitem> |
| </varlistentry> |
| |
| </variablelist> |
| |
| <para> |
| Let's consider a real-life astronomical example. We'll expand query |
| <literal>supernovae</literal> using table-driven rewriting rules: |
| |
| <screen> |
| CREATE TABLE aliases (t tsquery primary key, s tsquery); |
| INSERT INTO aliases VALUES(to_tsquery('supernovae'), to_tsquery('supernovae|sn')); |
| |
| SELECT ts_rewrite(to_tsquery('supernovae & crab'), 'SELECT * FROM aliases'); |
| ts_rewrite |
| --------------------------------- |
| 'crab' & ( 'supernova' | 'sn' ) |
| </screen> |
| |
| We can change the rewriting rules just by updating the table: |
| |
| <screen> |
| UPDATE aliases |
| SET s = to_tsquery('supernovae|sn & !nebulae') |
| WHERE t = to_tsquery('supernovae'); |
| |
| SELECT ts_rewrite(to_tsquery('supernovae & crab'), 'SELECT * FROM aliases'); |
| ts_rewrite |
| --------------------------------------------- |
| 'crab' & ( 'supernova' | 'sn' & !'nebula' ) |
| </screen> |
| </para> |
| |
| <para> |
| Rewriting can be slow when there are many rewriting rules, since it |
| checks every rule for a possible match. To filter out obvious non-candidate |
| rules we can use the containment operators for the <type>tsquery</type> |
| type. In the example below, we select only those rules which might match |
| the original query: |
| |
| <screen> |
| SELECT ts_rewrite('a & b'::tsquery, |
| 'SELECT t,s FROM aliases WHERE ''a & b''::tsquery @> t'); |
| ts_rewrite |
| ------------ |
| 'b' & 'c' |
| </screen> |
| </para> |
| |
| </sect3> |
| |
| </sect2> |
| |
| <sect2 id="textsearch-update-triggers"> |
| <title>Triggers for Automatic Updates</title> |
| |
| <indexterm> |
| <primary>trigger</primary> |
| <secondary>for updating a derived tsvector column</secondary> |
| </indexterm> |
| |
| <note> |
| <para> |
| The method described in this section has been obsoleted by the use of |
| stored generated columns, as described in <xref |
| linkend="textsearch-tables-index"/>. |
| </para> |
| </note> |
| |
| <para> |
| When using a separate column to store the <type>tsvector</type> representation |
| of your documents, it is necessary to create a trigger to update the |
| <type>tsvector</type> column when the document content columns change. |
| Two built-in trigger functions are available for this, or you can write |
| your own. |
| </para> |
| |
| <synopsis> |
| tsvector_update_trigger(<replaceable class="parameter">tsvector_column_name</replaceable>,&zwsp; <replaceable class="parameter">config_name</replaceable>, <replaceable class="parameter">text_column_name</replaceable> <optional>, ... </optional>) |
| tsvector_update_trigger_column(<replaceable class="parameter">tsvector_column_name</replaceable>,&zwsp; <replaceable class="parameter">config_column_name</replaceable>, <replaceable class="parameter">text_column_name</replaceable> <optional>, ... </optional>) |
| </synopsis> |
| |
| <para> |
| These trigger functions automatically compute a <type>tsvector</type> |
| column from one or more textual columns, under the control of |
| parameters specified in the <command>CREATE TRIGGER</command> command. |
| An example of their use is: |
| |
| <screen> |
| CREATE TABLE messages ( |
| title text, |
| body text, |
| tsv tsvector |
| ); |
| |
| CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE |
| ON messages FOR EACH ROW EXECUTE FUNCTION |
| tsvector_update_trigger(tsv, 'pg_catalog.english', title, body); |
| |
| INSERT INTO messages VALUES('title here', 'the body text is here'); |
| |
| SELECT * FROM messages; |
| title | body | tsv |
| ------------+-----------------------+---------------------------- |
| title here | the body text is here | 'bodi':4 'text':5 'titl':1 |
| |
| SELECT title, body FROM messages WHERE tsv @@ to_tsquery('title & body'); |
| title | body |
| ------------+----------------------- |
| title here | the body text is here |
| </screen> |
| |
| Having created this trigger, any change in <structfield>title</structfield> or |
| <structfield>body</structfield> will automatically be reflected into |
| <structfield>tsv</structfield>, without the application having to worry about it. |
| </para> |
| |
| <para> |
| The first trigger argument must be the name of the <type>tsvector</type> |
| column to be updated. The second argument specifies the text search |
| configuration to be used to perform the conversion. For |
| <function>tsvector_update_trigger</function>, the configuration name is simply |
| given as the second trigger argument. It must be schema-qualified as |
| shown above, so that the trigger behavior will not change with changes |
| in <varname>search_path</varname>. For |
| <function>tsvector_update_trigger_column</function>, the second trigger argument |
| is the name of another table column, which must be of type |
| <type>regconfig</type>. This allows a per-row selection of configuration |
| to be made. The remaining argument(s) are the names of textual columns |
| (of type <type>text</type>, <type>varchar</type>, or <type>char</type>). These |
| will be included in the document in the order given. NULL values will |
| be skipped (but the other columns will still be indexed). |
| </para> |
| |
| <para> |
| A limitation of these built-in triggers is that they treat all the |
| input columns alike. To process columns differently — for |
| example, to weight title differently from body — it is necessary |
| to write a custom trigger. Here is an example using |
| <application>PL/pgSQL</application> as the trigger language: |
| |
| <programlisting> |
| CREATE FUNCTION messages_trigger() RETURNS trigger AS $$ |
| begin |
| new.tsv := |
| setweight(to_tsvector('pg_catalog.english', coalesce(new.title,'')), 'A') || |
| setweight(to_tsvector('pg_catalog.english', coalesce(new.body,'')), 'D'); |
| return new; |
| end |
| $$ LANGUAGE plpgsql; |
| |
| CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE |
| ON messages FOR EACH ROW EXECUTE FUNCTION messages_trigger(); |
| </programlisting> |
| </para> |
| |
| <para> |
| Keep in mind that it is important to specify the configuration name |
| explicitly when creating <type>tsvector</type> values inside triggers, |
| so that the column's contents will not be affected by changes to |
| <varname>default_text_search_config</varname>. Failure to do this is likely to |
| lead to problems such as search results changing after a dump and restore. |
| </para> |
| |
| </sect2> |
| |
| <sect2 id="textsearch-statistics"> |
| <title>Gathering Document Statistics</title> |
| |
| <indexterm> |
| <primary>ts_stat</primary> |
| </indexterm> |
| |
| <para> |
| The function <function>ts_stat</function> is useful for checking your |
| configuration and for finding stop-word candidates. |
| </para> |
| |
| <synopsis> |
| ts_stat(<replaceable class="parameter">sqlquery</replaceable> <type>text</type>, <optional> <replaceable class="parameter">weights</replaceable> <type>text</type>, </optional> |
| OUT <replaceable class="parameter">word</replaceable> <type>text</type>, OUT <replaceable class="parameter">ndoc</replaceable> <type>integer</type>, |
| OUT <replaceable class="parameter">nentry</replaceable> <type>integer</type>) returns <type>setof record</type> |
| </synopsis> |
| |
| <para> |
| <replaceable>sqlquery</replaceable> is a text value containing an SQL |
| query which must return a single <type>tsvector</type> column. |
| <function>ts_stat</function> executes the query and returns statistics about |
| each distinct lexeme (word) contained in the <type>tsvector</type> |
| data. The columns returned are |
| |
| <itemizedlist spacing="compact" mark="bullet"> |
| <listitem> |
| <para> |
| <replaceable>word</replaceable> <type>text</type> — the value of a lexeme |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| <replaceable>ndoc</replaceable> <type>integer</type> — number of documents |
| (<type>tsvector</type>s) the word occurred in |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| <replaceable>nentry</replaceable> <type>integer</type> — total number of |
| occurrences of the word |
| </para> |
| </listitem> |
| </itemizedlist> |
| |
| If <replaceable>weights</replaceable> is supplied, only occurrences |
| having one of those weights are counted. |
| </para> |
| |
| <para> |
| For example, to find the ten most frequent words in a document collection: |
| |
| <programlisting> |
| SELECT * FROM ts_stat('SELECT vector FROM apod') |
| ORDER BY nentry DESC, ndoc DESC, word |
| LIMIT 10; |
| </programlisting> |
| |
| The same, but counting only word occurrences with weight <literal>A</literal> |
| or <literal>B</literal>: |
| |
| <programlisting> |
| SELECT * FROM ts_stat('SELECT vector FROM apod', 'ab') |
| ORDER BY nentry DESC, ndoc DESC, word |
| LIMIT 10; |
| </programlisting> |
| </para> |
| |
| </sect2> |
| |
| </sect1> |
| |
| <sect1 id="textsearch-parsers"> |
| <title>Parsers</title> |
| |
| <para> |
| Text search parsers are responsible for splitting raw document text |
| into <firstterm>tokens</firstterm> and identifying each token's type, where |
| the set of possible types is defined by the parser itself. |
| Note that a parser does not modify the text at all — it simply |
| identifies plausible word boundaries. Because of this limited scope, |
| there is less need for application-specific custom parsers than there is |
| for custom dictionaries. At present <productname>PostgreSQL</productname> |
| provides just one built-in parser, which has been found to be useful for a |
| wide range of applications. |
| </para> |
| |
| <para> |
| The built-in parser is named <literal>pg_catalog.default</literal>. |
| It recognizes 23 token types, shown in <xref linkend="textsearch-default-parser"/>. |
| </para> |
| |
| <table id="textsearch-default-parser"> |
| <title>Default Parser's Token Types</title> |
| <tgroup cols="3"> |
| <colspec colname="col1" colwidth="2*"/> |
| <colspec colname="col2" colwidth="2*"/> |
| <colspec colname="col3" colwidth="3*"/> |
| <thead> |
| <row> |
| <entry>Alias</entry> |
| <entry>Description</entry> |
| <entry>Example</entry> |
| </row> |
| </thead> |
| <tbody> |
| <row> |
| <entry><literal>asciiword</literal></entry> |
| <entry>Word, all ASCII letters</entry> |
| <entry><literal>elephant</literal></entry> |
| </row> |
| <row> |
| <entry><literal>word</literal></entry> |
| <entry>Word, all letters</entry> |
| <entry><literal>mañana</literal></entry> |
| </row> |
| <row> |
| <entry><literal>numword</literal></entry> |
| <entry>Word, letters and digits</entry> |
| <entry><literal>beta1</literal></entry> |
| </row> |
| <row> |
| <entry><literal>asciihword</literal></entry> |
| <entry>Hyphenated word, all ASCII</entry> |
| <entry><literal>up-to-date</literal></entry> |
| </row> |
| <row> |
| <entry><literal>hword</literal></entry> |
| <entry>Hyphenated word, all letters</entry> |
| <entry><literal>lógico-matemática</literal></entry> |
| </row> |
| <row> |
| <entry><literal>numhword</literal></entry> |
| <entry>Hyphenated word, letters and digits</entry> |
| <entry><literal>postgresql-beta1</literal></entry> |
| </row> |
| <row> |
| <entry><literal>hword_asciipart</literal></entry> |
| <entry>Hyphenated word part, all ASCII</entry> |
| <entry><literal>postgresql</literal> in the context <literal>postgresql-beta1</literal></entry> |
| </row> |
| <row> |
| <entry><literal>hword_part</literal></entry> |
| <entry>Hyphenated word part, all letters</entry> |
| <entry><literal>lógico</literal> or <literal>matemática</literal> |
| in the context <literal>lógico-matemática</literal></entry> |
| </row> |
| <row> |
| <entry><literal>hword_numpart</literal></entry> |
| <entry>Hyphenated word part, letters and digits</entry> |
| <entry><literal>beta1</literal> in the context |
| <literal>postgresql-beta1</literal></entry> |
| </row> |
| <row> |
| <entry><literal>email</literal></entry> |
| <entry>Email address</entry> |
| <entry><literal>foo@example.com</literal></entry> |
| </row> |
| <row> |
| <entry><literal>protocol</literal></entry> |
| <entry>Protocol head</entry> |
| <entry><literal>http://</literal></entry> |
| </row> |
| <row> |
| <entry><literal>url</literal></entry> |
| <entry>URL</entry> |
| <entry><literal>example.com/stuff/index.html</literal></entry> |
| </row> |
| <row> |
| <entry><literal>host</literal></entry> |
| <entry>Host</entry> |
| <entry><literal>example.com</literal></entry> |
| </row> |
| <row> |
| <entry><literal>url_path</literal></entry> |
| <entry>URL path</entry> |
| <entry><literal>/stuff/index.html</literal>, in the context of a URL</entry> |
| </row> |
| <row> |
| <entry><literal>file</literal></entry> |
| <entry>File or path name</entry> |
| <entry><literal>/usr/local/foo.txt</literal>, if not within a URL</entry> |
| </row> |
| <row> |
| <entry><literal>sfloat</literal></entry> |
| <entry>Scientific notation</entry> |
| <entry><literal>-1.234e56</literal></entry> |
| </row> |
| <row> |
| <entry><literal>float</literal></entry> |
| <entry>Decimal notation</entry> |
| <entry><literal>-1.234</literal></entry> |
| </row> |
| <row> |
| <entry><literal>int</literal></entry> |
| <entry>Signed integer</entry> |
| <entry><literal>-1234</literal></entry> |
| </row> |
| <row> |
| <entry><literal>uint</literal></entry> |
| <entry>Unsigned integer</entry> |
| <entry><literal>1234</literal></entry> |
| </row> |
| <row> |
| <entry><literal>version</literal></entry> |
| <entry>Version number</entry> |
| <entry><literal>8.3.0</literal></entry> |
| </row> |
| <row> |
| <entry><literal>tag</literal></entry> |
| <entry>XML tag</entry> |
| <entry><literal><a href="dictionaries.html"></literal></entry> |
| </row> |
| <row> |
| <entry><literal>entity</literal></entry> |
| <entry>XML entity</entry> |
| <entry><literal>&amp;</literal></entry> |
| </row> |
| <row> |
| <entry><literal>blank</literal></entry> |
| <entry>Space symbols</entry> |
| <entry>(any whitespace or punctuation not otherwise recognized)</entry> |
| </row> |
| </tbody> |
| </tgroup> |
| </table> |
| |
| <note> |
| <para> |
| The parser's notion of a <quote>letter</quote> is determined by the database's |
| locale setting, specifically <varname>lc_ctype</varname>. Words containing |
| only the basic ASCII letters are reported as a separate token type, |
| since it is sometimes useful to distinguish them. In most European |
| languages, token types <literal>word</literal> and <literal>asciiword</literal> |
| should be treated alike. |
| </para> |
| |
| <para> |
| <literal>email</literal> does not support all valid email characters as |
| defined by <ulink url="https://tools.ietf.org/html/rfc5322">RFC 5322</ulink>. |
| Specifically, the only non-alphanumeric characters supported for |
| email user names are period, dash, and underscore. |
| </para> |
| </note> |
| |
| <para> |
| It is possible for the parser to produce overlapping tokens from the same |
| piece of text. As an example, a hyphenated word will be reported both |
| as the entire word and as each component: |
| |
| <screen> |
| SELECT alias, description, token FROM ts_debug('foo-bar-beta1'); |
| alias | description | token |
| -----------------+------------------------------------------+--------------- |
| numhword | Hyphenated word, letters and digits | foo-bar-beta1 |
| hword_asciipart | Hyphenated word part, all ASCII | foo |
| blank | Space symbols | - |
| hword_asciipart | Hyphenated word part, all ASCII | bar |
| blank | Space symbols | - |
| hword_numpart | Hyphenated word part, letters and digits | beta1 |
| </screen> |
| |
| This behavior is desirable since it allows searches to work for both |
| the whole compound word and for components. Here is another |
| instructive example: |
| |
| <screen> |
| SELECT alias, description, token FROM ts_debug('http://example.com/stuff/index.html'); |
| alias | description | token |
| ----------+---------------+------------------------------ |
| protocol | Protocol head | http:// |
| url | URL | example.com/stuff/index.html |
| host | Host | example.com |
| url_path | URL path | /stuff/index.html |
| </screen> |
| </para> |
| |
| </sect1> |
| |
| <sect1 id="textsearch-dictionaries"> |
| <title>Dictionaries</title> |
| |
| <para> |
| Dictionaries are used to eliminate words that should not be considered in a |
| search (<firstterm>stop words</firstterm>), and to <firstterm>normalize</firstterm> words so |
| that different derived forms of the same word will match. A successfully |
| normalized word is called a <firstterm>lexeme</firstterm>. Aside from |
| improving search quality, normalization and removal of stop words reduce the |
| size of the <type>tsvector</type> representation of a document, thereby |
| improving performance. Normalization does not always have linguistic meaning |
| and usually depends on application semantics. |
| </para> |
| |
| <para> |
| Some examples of normalization: |
| |
| <itemizedlist spacing="compact" mark="bullet"> |
| |
| <listitem> |
| <para> |
| Linguistic — Ispell dictionaries try to reduce input words to a |
| normalized form; stemmer dictionaries remove word endings |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| <acronym>URL</acronym> locations can be canonicalized to make |
| equivalent URLs match: |
| |
| <itemizedlist spacing="compact" mark="bullet"> |
| <listitem> |
| <para> |
| http://www.pgsql.ru/db/mw/index.html |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| http://www.pgsql.ru/db/mw/ |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| http://www.pgsql.ru/db/../db/mw/index.html |
| </para> |
| </listitem> |
| </itemizedlist> |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| Color names can be replaced by their hexadecimal values, e.g., |
| <literal>red, green, blue, magenta -> FF0000, 00FF00, 0000FF, FF00FF</literal> |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| If indexing numbers, we can |
| remove some fractional digits to reduce the range of possible |
| numbers, so for example <emphasis>3.14</emphasis>159265359, |
| <emphasis>3.14</emphasis>15926, <emphasis>3.14</emphasis> will be the same |
| after normalization if only two digits are kept after the decimal point. |
| </para> |
| </listitem> |
| </itemizedlist> |
| |
| </para> |
| |
| <para> |
| A dictionary is a program that accepts a token as |
| input and returns: |
| <itemizedlist spacing="compact" mark="bullet"> |
| <listitem> |
| <para> |
| an array of lexemes if the input token is known to the dictionary |
| (notice that one token can produce more than one lexeme) |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| a single lexeme with the <literal>TSL_FILTER</literal> flag set, to replace |
| the original token with a new token to be passed to subsequent |
| dictionaries (a dictionary that does this is called a |
| <firstterm>filtering dictionary</firstterm>) |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| an empty array if the dictionary knows the token, but it is a stop word |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| <literal>NULL</literal> if the dictionary does not recognize the input token |
| </para> |
| </listitem> |
| </itemizedlist> |
| </para> |
| |
| <para> |
| <productname>PostgreSQL</productname> provides predefined dictionaries for |
| many languages. There are also several predefined templates that can be |
| used to create new dictionaries with custom parameters. Each predefined |
| dictionary template is described below. If no existing |
| template is suitable, it is possible to create new ones; see the |
| <filename>contrib/</filename> area of the <productname>PostgreSQL</productname> distribution |
| for examples. |
| </para> |
| |
| <para> |
| A text search configuration binds a parser together with a set of |
| dictionaries to process the parser's output tokens. For each token |
| type that the parser can return, a separate list of dictionaries is |
| specified by the configuration. When a token of that type is found |
| by the parser, each dictionary in the list is consulted in turn, |
| until some dictionary recognizes it as a known word. If it is identified |
| as a stop word, or if no dictionary recognizes the token, it will be |
| discarded and not indexed or searched for. |
| Normally, the first dictionary that returns a non-<literal>NULL</literal> |
| output determines the result, and any remaining dictionaries are not |
| consulted; but a filtering dictionary can replace the given word |
| with a modified word, which is then passed to subsequent dictionaries. |
| </para> |
| |
| <para> |
| The general rule for configuring a list of dictionaries |
| is to place first the most narrow, most specific dictionary, then the more |
| general dictionaries, finishing with a very general dictionary, like |
| a <application>Snowball</application> stemmer or <literal>simple</literal>, which |
| recognizes everything. For example, for an astronomy-specific search |
| (<literal>astro_en</literal> configuration) one could bind token type |
| <type>asciiword</type> (ASCII word) to a synonym dictionary of astronomical |
| terms, a general English dictionary and a <application>Snowball</application> English |
| stemmer: |
| |
| <programlisting> |
| ALTER TEXT SEARCH CONFIGURATION astro_en |
| ADD MAPPING FOR asciiword WITH astrosyn, english_ispell, english_stem; |
| </programlisting> |
| </para> |
| |
| <para> |
| A filtering dictionary can be placed anywhere in the list, except at the |
| end where it'd be useless. Filtering dictionaries are useful to partially |
| normalize words to simplify the task of later dictionaries. For example, |
| a filtering dictionary could be used to remove accents from accented |
| letters, as is done by the <xref linkend="unaccent"/> module. |
| </para> |
| |
| <sect2 id="textsearch-stopwords"> |
| <title>Stop Words</title> |
| |
| <para> |
| Stop words are words that are very common, appear in almost every |
| document, and have no discrimination value. Therefore, they can be ignored |
| in the context of full text searching. For example, every English text |
| contains words like <literal>a</literal> and <literal>the</literal>, so it is |
| useless to store them in an index. However, stop words do affect the |
| positions in <type>tsvector</type>, which in turn affect ranking: |
| |
| <screen> |
| SELECT to_tsvector('english', 'in the list of stop words'); |
| to_tsvector |
| ---------------------------- |
| 'list':3 'stop':5 'word':6 |
| </screen> |
| |
| The missing positions 1,2,4 are because of stop words. Ranks |
| calculated for documents with and without stop words are quite different: |
| |
| <screen> |
| SELECT ts_rank_cd (to_tsvector('english', 'in the list of stop words'), to_tsquery('list & stop')); |
| ts_rank_cd |
| ------------ |
| 0.05 |
| |
| SELECT ts_rank_cd (to_tsvector('english', 'list stop words'), to_tsquery('list & stop')); |
| ts_rank_cd |
| ------------ |
| 0.1 |
| </screen> |
| |
| </para> |
| |
| <para> |
| It is up to the specific dictionary how it treats stop words. For example, |
| <literal>ispell</literal> dictionaries first normalize words and then |
| look at the list of stop words, while <literal>Snowball</literal> stemmers |
| first check the list of stop words. The reason for the different |
| behavior is an attempt to decrease noise. |
| </para> |
| |
| </sect2> |
| |
| <sect2 id="textsearch-simple-dictionary"> |
| <title>Simple Dictionary</title> |
| |
| <para> |
| The <literal>simple</literal> dictionary template operates by converting the |
| input token to lower case and checking it against a file of stop words. |
| If it is found in the file then an empty array is returned, causing |
| the token to be discarded. If not, the lower-cased form of the word |
| is returned as the normalized lexeme. Alternatively, the dictionary |
| can be configured to report non-stop-words as unrecognized, allowing |
| them to be passed on to the next dictionary in the list. |
| </para> |
| |
| <para> |
| Here is an example of a dictionary definition using the <literal>simple</literal> |
| template: |
| |
| <programlisting> |
| CREATE TEXT SEARCH DICTIONARY public.simple_dict ( |
| TEMPLATE = pg_catalog.simple, |
| STOPWORDS = english |
| ); |
| </programlisting> |
| |
| Here, <literal>english</literal> is the base name of a file of stop words. |
| The file's full name will be |
| <filename>$SHAREDIR/tsearch_data/english.stop</filename>, |
| where <literal>$SHAREDIR</literal> means the |
| <productname>PostgreSQL</productname> installation's shared-data directory, |
| often <filename>/usr/local/share/postgresql</filename> (use <command>pg_config |
| --sharedir</command> to determine it if you're not sure). |
| The file format is simply a list |
| of words, one per line. Blank lines and trailing spaces are ignored, |
| and upper case is folded to lower case, but no other processing is done |
| on the file contents. |
| </para> |
| |
| <para> |
| Now we can test our dictionary: |
| |
| <screen> |
| SELECT ts_lexize('public.simple_dict', 'YeS'); |
| ts_lexize |
| ----------- |
| {yes} |
| |
| SELECT ts_lexize('public.simple_dict', 'The'); |
| ts_lexize |
| ----------- |
| {} |
| </screen> |
| </para> |
| |
| <para> |
| We can also choose to return <literal>NULL</literal>, instead of the lower-cased |
| word, if it is not found in the stop words file. This behavior is |
| selected by setting the dictionary's <literal>Accept</literal> parameter to |
| <literal>false</literal>. Continuing the example: |
| |
| <screen> |
| ALTER TEXT SEARCH DICTIONARY public.simple_dict ( Accept = false ); |
| |
| SELECT ts_lexize('public.simple_dict', 'YeS'); |
| ts_lexize |
| ----------- |
| |
| |
| SELECT ts_lexize('public.simple_dict', 'The'); |
| ts_lexize |
| ----------- |
| {} |
| </screen> |
| </para> |
| |
| <para> |
| With the default setting of <literal>Accept</literal> = <literal>true</literal>, |
| it is only useful to place a <literal>simple</literal> dictionary at the end |
| of a list of dictionaries, since it will never pass on any token to |
| a following dictionary. Conversely, <literal>Accept</literal> = <literal>false</literal> |
| is only useful when there is at least one following dictionary. |
| </para> |
| |
| <caution> |
| <para> |
| Most types of dictionaries rely on configuration files, such as files of |
| stop words. These files <emphasis>must</emphasis> be stored in UTF-8 encoding. |
| They will be translated to the actual database encoding, if that is |
| different, when they are read into the server. |
| </para> |
| </caution> |
| |
| <caution> |
| <para> |
| Normally, a database session will read a dictionary configuration file |
| only once, when it is first used within the session. If you modify a |
| configuration file and want to force existing sessions to pick up the |
| new contents, issue an <command>ALTER TEXT SEARCH DICTIONARY</command> command |
| on the dictionary. This can be a <quote>dummy</quote> update that doesn't |
| actually change any parameter values. |
| </para> |
| </caution> |
| |
| </sect2> |
| |
| <sect2 id="textsearch-synonym-dictionary"> |
| <title>Synonym Dictionary</title> |
| |
| <para> |
| This dictionary template is used to create dictionaries that replace a |
| word with a synonym. Phrases are not supported (use the thesaurus |
| template (<xref linkend="textsearch-thesaurus"/>) for that). A synonym |
| dictionary can be used to overcome linguistic problems, for example, to |
| prevent an English stemmer dictionary from reducing the word <quote>Paris</quote> to |
| <quote>pari</quote>. It is enough to have a <literal>Paris paris</literal> line in the |
| synonym dictionary and put it before the <literal>english_stem</literal> |
| dictionary. For example: |
| |
| <screen> |
| SELECT * FROM ts_debug('english', 'Paris'); |
| alias | description | token | dictionaries | dictionary | lexemes |
| -----------+-----------------+-------+----------------+--------------+--------- |
| asciiword | Word, all ASCII | Paris | {english_stem} | english_stem | {pari} |
| |
| CREATE TEXT SEARCH DICTIONARY my_synonym ( |
| TEMPLATE = synonym, |
| SYNONYMS = my_synonyms |
| ); |
| |
| ALTER TEXT SEARCH CONFIGURATION english |
| ALTER MAPPING FOR asciiword |
| WITH my_synonym, english_stem; |
| |
| SELECT * FROM ts_debug('english', 'Paris'); |
| alias | description | token | dictionaries | dictionary | lexemes |
| -----------+-----------------+-------+---------------------------+------------+--------- |
| asciiword | Word, all ASCII | Paris | {my_synonym,english_stem} | my_synonym | {paris} |
| </screen> |
| </para> |
| |
| <para> |
| The only parameter required by the <literal>synonym</literal> template is |
| <literal>SYNONYMS</literal>, which is the base name of its configuration file |
| — <literal>my_synonyms</literal> in the above example. |
| The file's full name will be |
| <filename>$SHAREDIR/tsearch_data/my_synonyms.syn</filename> |
| (where <literal>$SHAREDIR</literal> means the |
| <productname>PostgreSQL</productname> installation's shared-data directory). |
| The file format is just one line |
| per word to be substituted, with the word followed by its synonym, |
| separated by white space. Blank lines and trailing spaces are ignored. |
| </para> |
| |
| <para> |
| The <literal>synonym</literal> template also has an optional parameter |
| <literal>CaseSensitive</literal>, which defaults to <literal>false</literal>. When |
| <literal>CaseSensitive</literal> is <literal>false</literal>, words in the synonym file |
| are folded to lower case, as are input tokens. When it is |
| <literal>true</literal>, words and tokens are not folded to lower case, |
| but are compared as-is. |
| </para> |
| |
| <para> |
| An asterisk (<literal>*</literal>) can be placed at the end of a synonym |
| in the configuration file. This indicates that the synonym is a prefix. |
| The asterisk is ignored when the entry is used in |
| <function>to_tsvector()</function>, but when it is used in |
| <function>to_tsquery()</function>, the result will be a query item with |
| the prefix match marker (see |
| <xref linkend="textsearch-parsing-queries"/>). |
| For example, suppose we have these entries in |
| <filename>$SHAREDIR/tsearch_data/synonym_sample.syn</filename>: |
| <programlisting> |
| postgres pgsql |
| postgresql pgsql |
| postgre pgsql |
| gogle googl |
| indices index* |
| </programlisting> |
| Then we will get these results: |
| <screen> |
| mydb=# CREATE TEXT SEARCH DICTIONARY syn (template=synonym, synonyms='synonym_sample'); |
| mydb=# SELECT ts_lexize('syn', 'indices'); |
| ts_lexize |
| ----------- |
| {index} |
| (1 row) |
| |
| mydb=# CREATE TEXT SEARCH CONFIGURATION tst (copy=simple); |
| mydb=# ALTER TEXT SEARCH CONFIGURATION tst ALTER MAPPING FOR asciiword WITH syn; |
| mydb=# SELECT to_tsvector('tst', 'indices'); |
| to_tsvector |
| ------------- |
| 'index':1 |
| (1 row) |
| |
| mydb=# SELECT to_tsquery('tst', 'indices'); |
| to_tsquery |
| ------------ |
| 'index':* |
| (1 row) |
| |
| mydb=# SELECT 'indexes are very useful'::tsvector; |
| tsvector |
| --------------------------------- |
| 'are' 'indexes' 'useful' 'very' |
| (1 row) |
| |
| mydb=# SELECT 'indexes are very useful'::tsvector @@ to_tsquery('tst', 'indices'); |
| ?column? |
| ---------- |
| t |
| (1 row) |
| </screen> |
| </para> |
| </sect2> |
| |
| <sect2 id="textsearch-thesaurus"> |
| <title>Thesaurus Dictionary</title> |
| |
| <para> |
| A thesaurus dictionary (sometimes abbreviated as <acronym>TZ</acronym>) is |
| a collection of words that includes information about the relationships |
| of words and phrases, i.e., broader terms (<acronym>BT</acronym>), narrower |
| terms (<acronym>NT</acronym>), preferred terms, non-preferred terms, related |
| terms, etc. |
| </para> |
| |
| <para> |
| Basically a thesaurus dictionary replaces all non-preferred terms by one |
| preferred term and, optionally, preserves the original terms for indexing |
| as well. <productname>PostgreSQL</productname>'s current implementation of the |
| thesaurus dictionary is an extension of the synonym dictionary with added |
| <firstterm>phrase</firstterm> support. A thesaurus dictionary requires |
| a configuration file of the following format: |
| |
| <programlisting> |
| # this is a comment |
| sample word(s) : indexed word(s) |
| more sample word(s) : more indexed word(s) |
| ... |
| </programlisting> |
| |
| where the colon (<symbol>:</symbol>) symbol acts as a delimiter between a |
| phrase and its replacement. |
| </para> |
| |
| <para> |
| A thesaurus dictionary uses a <firstterm>subdictionary</firstterm> (which |
| is specified in the dictionary's configuration) to normalize the input |
| text before checking for phrase matches. It is only possible to select one |
| subdictionary. An error is reported if the subdictionary fails to |
| recognize a word. In that case, you should remove the use of the word or |
| teach the subdictionary about it. You can place an asterisk |
| (<symbol>*</symbol>) at the beginning of an indexed word to skip applying |
| the subdictionary to it, but all sample words <emphasis>must</emphasis> be known |
| to the subdictionary. |
| </para> |
| |
| <para> |
| The thesaurus dictionary chooses the longest match if there are multiple |
| phrases matching the input, and ties are broken by using the last |
| definition. |
| </para> |
| |
| <para> |
| Specific stop words recognized by the subdictionary cannot be |
| specified; instead use <literal>?</literal> to mark the location where any |
| stop word can appear. For example, assuming that <literal>a</literal> and |
| <literal>the</literal> are stop words according to the subdictionary: |
| |
| <programlisting> |
| ? one ? two : swsw |
| </programlisting> |
| |
| matches <literal>a one the two</literal> and <literal>the one a two</literal>; |
| both would be replaced by <literal>swsw</literal>. |
| </para> |
| |
| <para> |
| Since a thesaurus dictionary has the capability to recognize phrases it |
| must remember its state and interact with the parser. A thesaurus dictionary |
| uses these assignments to check if it should handle the next word or stop |
| accumulation. The thesaurus dictionary must be configured |
| carefully. For example, if the thesaurus dictionary is assigned to handle |
| only the <literal>asciiword</literal> token, then a thesaurus dictionary |
| definition like <literal>one 7</literal> will not work since token type |
| <literal>uint</literal> is not assigned to the thesaurus dictionary. |
| </para> |
| |
| <caution> |
| <para> |
| Thesauruses are used during indexing so any change in the thesaurus |
| dictionary's parameters <emphasis>requires</emphasis> reindexing. |
| For most other dictionary types, small changes such as adding or |
| removing stopwords does not force reindexing. |
| </para> |
| </caution> |
| |
| <sect3 id="textsearch-thesaurus-config"> |
| <title>Thesaurus Configuration</title> |
| |
| <para> |
| To define a new thesaurus dictionary, use the <literal>thesaurus</literal> |
| template. For example: |
| |
| <programlisting> |
| CREATE TEXT SEARCH DICTIONARY thesaurus_simple ( |
| TEMPLATE = thesaurus, |
| DictFile = mythesaurus, |
| Dictionary = pg_catalog.english_stem |
| ); |
| </programlisting> |
| |
| Here: |
| <itemizedlist spacing="compact" mark="bullet"> |
| <listitem> |
| <para> |
| <literal>thesaurus_simple</literal> is the new dictionary's name |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| <literal>mythesaurus</literal> is the base name of the thesaurus |
| configuration file. |
| (Its full name will be <filename>$SHAREDIR/tsearch_data/mythesaurus.ths</filename>, |
| where <literal>$SHAREDIR</literal> means the installation shared-data |
| directory.) |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| <literal>pg_catalog.english_stem</literal> is the subdictionary (here, |
| a Snowball English stemmer) to use for thesaurus normalization. |
| Notice that the subdictionary will have its own |
| configuration (for example, stop words), which is not shown here. |
| </para> |
| </listitem> |
| </itemizedlist> |
| |
| Now it is possible to bind the thesaurus dictionary <literal>thesaurus_simple</literal> |
| to the desired token types in a configuration, for example: |
| |
| <programlisting> |
| ALTER TEXT SEARCH CONFIGURATION russian |
| ALTER MAPPING FOR asciiword, asciihword, hword_asciipart |
| WITH thesaurus_simple; |
| </programlisting> |
| </para> |
| |
| </sect3> |
| |
| <sect3 id="textsearch-thesaurus-examples"> |
| <title>Thesaurus Example</title> |
| |
| <para> |
| Consider a simple astronomical thesaurus <literal>thesaurus_astro</literal>, |
| which contains some astronomical word combinations: |
| |
| <programlisting> |
| supernovae stars : sn |
| crab nebulae : crab |
| </programlisting> |
| |
| Below we create a dictionary and bind some token types to |
| an astronomical thesaurus and English stemmer: |
| |
| <programlisting> |
| CREATE TEXT SEARCH DICTIONARY thesaurus_astro ( |
| TEMPLATE = thesaurus, |
| DictFile = thesaurus_astro, |
| Dictionary = english_stem |
| ); |
| |
| ALTER TEXT SEARCH CONFIGURATION russian |
| ALTER MAPPING FOR asciiword, asciihword, hword_asciipart |
| WITH thesaurus_astro, english_stem; |
| </programlisting> |
| |
| Now we can see how it works. |
| <function>ts_lexize</function> is not very useful for testing a thesaurus, |
| because it treats its input as a single token. Instead we can use |
| <function>plainto_tsquery</function> and <function>to_tsvector</function> |
| which will break their input strings into multiple tokens: |
| |
| <screen> |
| SELECT plainto_tsquery('supernova star'); |
| plainto_tsquery |
| ----------------- |
| 'sn' |
| |
| SELECT to_tsvector('supernova star'); |
| to_tsvector |
| ------------- |
| 'sn':1 |
| </screen> |
| |
| In principle, one can use <function>to_tsquery</function> if you quote |
| the argument: |
| |
| <screen> |
| SELECT to_tsquery('''supernova star'''); |
| to_tsquery |
| ------------ |
| 'sn' |
| </screen> |
| |
| Notice that <literal>supernova star</literal> matches <literal>supernovae |
| stars</literal> in <literal>thesaurus_astro</literal> because we specified |
| the <literal>english_stem</literal> stemmer in the thesaurus definition. |
| The stemmer removed the <literal>e</literal> and <literal>s</literal>. |
| </para> |
| |
| <para> |
| To index the original phrase as well as the substitute, just include it |
| in the right-hand part of the definition: |
| |
| <screen> |
| supernovae stars : sn supernovae stars |
| |
| SELECT plainto_tsquery('supernova star'); |
| plainto_tsquery |
| ----------------------------- |
| 'sn' & 'supernova' & 'star' |
| </screen> |
| </para> |
| |
| </sect3> |
| |
| </sect2> |
| |
| <sect2 id="textsearch-ispell-dictionary"> |
| <title><application>Ispell</application> Dictionary</title> |
| |
| <para> |
| The <application>Ispell</application> dictionary template supports |
| <firstterm>morphological dictionaries</firstterm>, which can normalize many |
| different linguistic forms of a word into the same lexeme. For example, |
| an English <application>Ispell</application> dictionary can match all declensions and |
| conjugations of the search term <literal>bank</literal>, e.g., |
| <literal>banking</literal>, <literal>banked</literal>, <literal>banks</literal>, |
| <literal>banks'</literal>, and <literal>bank's</literal>. |
| </para> |
| |
| <para> |
| The standard <productname>PostgreSQL</productname> distribution does |
| not include any <application>Ispell</application> configuration files. |
| Dictionaries for a large number of languages are available from <ulink |
| url="https://www.cs.hmc.edu/~geoff/ispell.html">Ispell</ulink>. |
| Also, some more modern dictionary file formats are supported — <ulink |
| url="https://en.wikipedia.org/wiki/MySpell">MySpell</ulink> (OO < 2.0.1) |
| and <ulink url="https://sourceforge.net/projects/hunspell/">Hunspell</ulink> |
| (OO >= 2.0.2). A large list of dictionaries is available on the <ulink |
| url="https://wiki.openoffice.org/wiki/Dictionaries">OpenOffice |
| Wiki</ulink>. |
| </para> |
| |
| <para> |
| To create an <application>Ispell</application> dictionary perform these steps: |
| </para> |
| <itemizedlist spacing="compact" mark="bullet"> |
| <listitem> |
| <para> |
| download dictionary configuration files. <productname>OpenOffice</productname> |
| extension files have the <filename>.oxt</filename> extension. It is necessary |
| to extract <filename>.aff</filename> and <filename>.dic</filename> files, change |
| extensions to <filename>.affix</filename> and <filename>.dict</filename>. For some |
| dictionary files it is also needed to convert characters to the UTF-8 |
| encoding with commands (for example, for a Norwegian language dictionary): |
| <programlisting> |
| iconv -f ISO_8859-1 -t UTF-8 -o nn_no.affix nn_NO.aff |
| iconv -f ISO_8859-1 -t UTF-8 -o nn_no.dict nn_NO.dic |
| </programlisting> |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| copy files to the <filename>$SHAREDIR/tsearch_data</filename> directory |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| load files into PostgreSQL with the following command: |
| <programlisting> |
| CREATE TEXT SEARCH DICTIONARY english_hunspell ( |
| TEMPLATE = ispell, |
| DictFile = en_us, |
| AffFile = en_us, |
| Stopwords = english); |
| </programlisting> |
| </para> |
| </listitem> |
| </itemizedlist> |
| |
| <para> |
| Here, <literal>DictFile</literal>, <literal>AffFile</literal>, and <literal>StopWords</literal> |
| specify the base names of the dictionary, affixes, and stop-words files. |
| The stop-words file has the same format explained above for the |
| <literal>simple</literal> dictionary type. The format of the other files is |
| not specified here but is available from the above-mentioned web sites. |
| </para> |
| |
| <para> |
| Ispell dictionaries usually recognize a limited set of words, so they |
| should be followed by another broader dictionary; for |
| example, a Snowball dictionary, which recognizes everything. |
| </para> |
| |
| <para> |
| The <filename>.affix</filename> file of <application>Ispell</application> has the following |
| structure: |
| <programlisting> |
| prefixes |
| flag *A: |
| . > RE # As in enter > reenter |
| suffixes |
| flag T: |
| E > ST # As in late > latest |
| [^AEIOU]Y > -Y,IEST # As in dirty > dirtiest |
| [AEIOU]Y > EST # As in gray > grayest |
| [^EY] > EST # As in small > smallest |
| </programlisting> |
| </para> |
| <para> |
| And the <filename>.dict</filename> file has the following structure: |
| <programlisting> |
| lapse/ADGRS |
| lard/DGRS |
| large/PRTY |
| lark/MRS |
| </programlisting> |
| </para> |
| |
| <para> |
| Format of the <filename>.dict</filename> file is: |
| <programlisting> |
| basic_form/affix_class_name |
| </programlisting> |
| </para> |
| |
| <para> |
| In the <filename>.affix</filename> file every affix flag is described in the |
| following format: |
| <programlisting> |
| condition > [-stripping_letters,] adding_affix |
| </programlisting> |
| </para> |
| |
| <para> |
| Here, condition has a format similar to the format of regular expressions. |
| It can use groupings <literal>[...]</literal> and <literal>[^...]</literal>. |
| For example, <literal>[AEIOU]Y</literal> means that the last letter of the word |
| is <literal>"y"</literal> and the penultimate letter is <literal>"a"</literal>, |
| <literal>"e"</literal>, <literal>"i"</literal>, <literal>"o"</literal> or <literal>"u"</literal>. |
| <literal>[^EY]</literal> means that the last letter is neither <literal>"e"</literal> |
| nor <literal>"y"</literal>. |
| </para> |
| |
| <para> |
| Ispell dictionaries support splitting compound words; |
| a useful feature. |
| Notice that the affix file should specify a special flag using the |
| <literal>compoundwords controlled</literal> statement that marks dictionary |
| words that can participate in compound formation: |
| |
| <programlisting> |
| compoundwords controlled z |
| </programlisting> |
| |
| Here are some examples for the Norwegian language: |
| |
| <programlisting> |
| SELECT ts_lexize('norwegian_ispell', 'overbuljongterningpakkmesterassistent'); |
| {over,buljong,terning,pakk,mester,assistent} |
| SELECT ts_lexize('norwegian_ispell', 'sjokoladefabrikk'); |
| {sjokoladefabrikk,sjokolade,fabrikk} |
| </programlisting> |
| </para> |
| |
| <para> |
| <application>MySpell</application> format is a subset of <application>Hunspell</application>. |
| The <filename>.affix</filename> file of <application>Hunspell</application> has the following |
| structure: |
| <programlisting> |
| PFX A Y 1 |
| PFX A 0 re . |
| SFX T N 4 |
| SFX T 0 st e |
| SFX T y iest [^aeiou]y |
| SFX T 0 est [aeiou]y |
| SFX T 0 est [^ey] |
| </programlisting> |
| </para> |
| |
| <para> |
| The first line of an affix class is the header. Fields of an affix rules are |
| listed after the header: |
| </para> |
| <itemizedlist spacing="compact" mark="bullet"> |
| <listitem> |
| <para> |
| parameter name (PFX or SFX) |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| flag (name of the affix class) |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| stripping characters from beginning (at prefix) or end (at suffix) of the |
| word |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| adding affix |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| condition that has a format similar to the format of regular expressions. |
| </para> |
| </listitem> |
| </itemizedlist> |
| |
| <para> |
| The <filename>.dict</filename> file looks like the <filename>.dict</filename> file of |
| <application>Ispell</application>: |
| <programlisting> |
| larder/M |
| lardy/RT |
| large/RSPMYT |
| largehearted |
| </programlisting> |
| </para> |
| |
| <note> |
| <para> |
| <application>MySpell</application> does not support compound words. |
| <application>Hunspell</application> has sophisticated support for compound words. At |
| present, <productname>PostgreSQL</productname> implements only the basic |
| compound word operations of Hunspell. |
| </para> |
| </note> |
| |
| </sect2> |
| |
| <sect2 id="textsearch-snowball-dictionary"> |
| <title><application>Snowball</application> Dictionary</title> |
| |
| <para> |
| The <application>Snowball</application> dictionary template is based on a project |
| by Martin Porter, inventor of the popular Porter's stemming algorithm |
| for the English language. Snowball now provides stemming algorithms for |
| many languages (see the <ulink url="https://snowballstem.org/">Snowball |
| site</ulink> for more information). Each algorithm understands how to |
| reduce common variant forms of words to a base, or stem, spelling within |
| its language. A Snowball dictionary requires a <literal>language</literal> |
| parameter to identify which stemmer to use, and optionally can specify a |
| <literal>stopword</literal> file name that gives a list of words to eliminate. |
| (<productname>PostgreSQL</productname>'s standard stopword lists are also |
| provided by the Snowball project.) |
| For example, there is a built-in definition equivalent to |
| |
| <programlisting> |
| CREATE TEXT SEARCH DICTIONARY english_stem ( |
| TEMPLATE = snowball, |
| Language = english, |
| StopWords = english |
| ); |
| </programlisting> |
| |
| The stopword file format is the same as already explained. |
| </para> |
| |
| <para> |
| A <application>Snowball</application> dictionary recognizes everything, whether |
| or not it is able to simplify the word, so it should be placed |
| at the end of the dictionary list. It is useless to have it |
| before any other dictionary because a token will never pass through it to |
| the next dictionary. |
| </para> |
| |
| </sect2> |
| |
| </sect1> |
| |
| <sect1 id="textsearch-configuration"> |
| <title>Configuration Example</title> |
| |
| <para> |
| A text search configuration specifies all options necessary to transform a |
| document into a <type>tsvector</type>: the parser to use to break text |
| into tokens, and the dictionaries to use to transform each token into a |
| lexeme. Every call of |
| <function>to_tsvector</function> or <function>to_tsquery</function> |
| needs a text search configuration to perform its processing. |
| The configuration parameter |
| <xref linkend="guc-default-text-search-config"/> |
| specifies the name of the default configuration, which is the |
| one used by text search functions if an explicit configuration |
| parameter is omitted. |
| It can be set in <filename>postgresql.conf</filename>, or set for an |
| individual session using the <command>SET</command> command. |
| </para> |
| |
| <para> |
| Several predefined text search configurations are available, and |
| you can create custom configurations easily. To facilitate management |
| of text search objects, a set of <acronym>SQL</acronym> commands |
| is available, and there are several <application>psql</application> commands that display information |
| about text search objects (<xref linkend="textsearch-psql"/>). |
| </para> |
| |
| <para> |
| As an example we will create a configuration |
| <literal>pg</literal>, starting by duplicating the built-in |
| <literal>english</literal> configuration: |
| |
| <programlisting> |
| CREATE TEXT SEARCH CONFIGURATION public.pg ( COPY = pg_catalog.english ); |
| </programlisting> |
| </para> |
| |
| <para> |
| We will use a PostgreSQL-specific synonym list |
| and store it in <filename>$SHAREDIR/tsearch_data/pg_dict.syn</filename>. |
| The file contents look like: |
| |
| <programlisting> |
| postgres pg |
| pgsql pg |
| postgresql pg |
| </programlisting> |
| |
| We define the synonym dictionary like this: |
| |
| <programlisting> |
| CREATE TEXT SEARCH DICTIONARY pg_dict ( |
| TEMPLATE = synonym, |
| SYNONYMS = pg_dict |
| ); |
| </programlisting> |
| |
| Next we register the <productname>Ispell</productname> dictionary |
| <literal>english_ispell</literal>, which has its own configuration files: |
| |
| <programlisting> |
| CREATE TEXT SEARCH DICTIONARY english_ispell ( |
| TEMPLATE = ispell, |
| DictFile = english, |
| AffFile = english, |
| StopWords = english |
| ); |
| </programlisting> |
| |
| Now we can set up the mappings for words in configuration |
| <literal>pg</literal>: |
| |
| <programlisting> |
| ALTER TEXT SEARCH CONFIGURATION pg |
| ALTER MAPPING FOR asciiword, asciihword, hword_asciipart, |
| word, hword, hword_part |
| WITH pg_dict, english_ispell, english_stem; |
| </programlisting> |
| |
| We choose not to index or search some token types that the built-in |
| configuration does handle: |
| |
| <programlisting> |
| ALTER TEXT SEARCH CONFIGURATION pg |
| DROP MAPPING FOR email, url, url_path, sfloat, float; |
| </programlisting> |
| </para> |
| |
| <para> |
| Now we can test our configuration: |
| |
| <programlisting> |
| SELECT * FROM ts_debug('public.pg', ' |
| PostgreSQL, the highly scalable, SQL compliant, open source object-relational |
| database management system, is now undergoing beta testing of the next |
| version of our software. |
| '); |
| </programlisting> |
| </para> |
| |
| <para> |
| The next step is to set the session to use the new configuration, which was |
| created in the <literal>public</literal> schema: |
| |
| <screen> |
| => \dF |
| List of text search configurations |
| Schema | Name | Description |
| ---------+------+------------- |
| public | pg | |
| |
| SET default_text_search_config = 'public.pg'; |
| SET |
| |
| SHOW default_text_search_config; |
| default_text_search_config |
| ---------------------------- |
| public.pg |
| </screen> |
| </para> |
| |
| </sect1> |
| |
| <sect1 id="textsearch-debugging"> |
| <title>Testing and Debugging Text Search</title> |
| |
| <para> |
| The behavior of a custom text search configuration can easily become |
| confusing. The functions described |
| in this section are useful for testing text search objects. You can |
| test a complete configuration, or test parsers and dictionaries separately. |
| </para> |
| |
| <sect2 id="textsearch-configuration-testing"> |
| <title>Configuration Testing</title> |
| |
| <para> |
| The function <function>ts_debug</function> allows easy testing of a |
| text search configuration. |
| </para> |
| |
| <indexterm> |
| <primary>ts_debug</primary> |
| </indexterm> |
| |
| <synopsis> |
| ts_debug(<optional> <replaceable class="parameter">config</replaceable> <type>regconfig</type>, </optional> <replaceable class="parameter">document</replaceable> <type>text</type>, |
| OUT <replaceable class="parameter">alias</replaceable> <type>text</type>, |
| OUT <replaceable class="parameter">description</replaceable> <type>text</type>, |
| OUT <replaceable class="parameter">token</replaceable> <type>text</type>, |
| OUT <replaceable class="parameter">dictionaries</replaceable> <type>regdictionary[]</type>, |
| OUT <replaceable class="parameter">dictionary</replaceable> <type>regdictionary</type>, |
| OUT <replaceable class="parameter">lexemes</replaceable> <type>text[]</type>) |
| returns setof record |
| </synopsis> |
| |
| <para> |
| <function>ts_debug</function> displays information about every token of |
| <replaceable class="parameter">document</replaceable> as produced by the |
| parser and processed by the configured dictionaries. It uses the |
| configuration specified by <replaceable |
| class="parameter">config</replaceable>, |
| or <varname>default_text_search_config</varname> if that argument is |
| omitted. |
| </para> |
| |
| <para> |
| <function>ts_debug</function> returns one row for each token identified in the text |
| by the parser. The columns returned are |
| |
| <itemizedlist spacing="compact" mark="bullet"> |
| <listitem> |
| <para> |
| <replaceable>alias</replaceable> <type>text</type> — short name of the token type |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| <replaceable>description</replaceable> <type>text</type> — description of the |
| token type |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| <replaceable>token</replaceable> <type>text</type> — text of the token |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| <replaceable>dictionaries</replaceable> <type>regdictionary[]</type> — the |
| dictionaries selected by the configuration for this token type |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| <replaceable>dictionary</replaceable> <type>regdictionary</type> — the dictionary |
| that recognized the token, or <literal>NULL</literal> if none did |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| <replaceable>lexemes</replaceable> <type>text[]</type> — the lexeme(s) produced |
| by the dictionary that recognized the token, or <literal>NULL</literal> if |
| none did; an empty array (<literal>{}</literal>) means it was recognized as a |
| stop word |
| </para> |
| </listitem> |
| </itemizedlist> |
| </para> |
| |
| <para> |
| Here is a simple example: |
| |
| <screen> |
| SELECT * FROM ts_debug('english', 'a fat cat sat on a mat - it ate a fat rats'); |
| alias | description | token | dictionaries | dictionary | lexemes |
| -----------+-----------------+-------+----------------+--------------+--------- |
| asciiword | Word, all ASCII | a | {english_stem} | english_stem | {} |
| blank | Space symbols | | {} | | |
| asciiword | Word, all ASCII | fat | {english_stem} | english_stem | {fat} |
| blank | Space symbols | | {} | | |
| asciiword | Word, all ASCII | cat | {english_stem} | english_stem | {cat} |
| blank | Space symbols | | {} | | |
| asciiword | Word, all ASCII | sat | {english_stem} | english_stem | {sat} |
| blank | Space symbols | | {} | | |
| asciiword | Word, all ASCII | on | {english_stem} | english_stem | {} |
| blank | Space symbols | | {} | | |
| asciiword | Word, all ASCII | a | {english_stem} | english_stem | {} |
| blank | Space symbols | | {} | | |
| asciiword | Word, all ASCII | mat | {english_stem} | english_stem | {mat} |
| blank | Space symbols | | {} | | |
| blank | Space symbols | - | {} | | |
| asciiword | Word, all ASCII | it | {english_stem} | english_stem | {} |
| blank | Space symbols | | {} | | |
| asciiword | Word, all ASCII | ate | {english_stem} | english_stem | {ate} |
| blank | Space symbols | | {} | | |
| asciiword | Word, all ASCII | a | {english_stem} | english_stem | {} |
| blank | Space symbols | | {} | | |
| asciiword | Word, all ASCII | fat | {english_stem} | english_stem | {fat} |
| blank | Space symbols | | {} | | |
| asciiword | Word, all ASCII | rats | {english_stem} | english_stem | {rat} |
| </screen> |
| </para> |
| |
| <para> |
| For a more extensive demonstration, we |
| first create a <literal>public.english</literal> configuration and |
| Ispell dictionary for the English language: |
| </para> |
| |
| <programlisting> |
| CREATE TEXT SEARCH CONFIGURATION public.english ( COPY = pg_catalog.english ); |
| |
| CREATE TEXT SEARCH DICTIONARY english_ispell ( |
| TEMPLATE = ispell, |
| DictFile = english, |
| AffFile = english, |
| StopWords = english |
| ); |
| |
| ALTER TEXT SEARCH CONFIGURATION public.english |
| ALTER MAPPING FOR asciiword WITH english_ispell, english_stem; |
| </programlisting> |
| |
| <screen> |
| SELECT * FROM ts_debug('public.english', 'The Brightest supernovaes'); |
| alias | description | token | dictionaries | dictionary | lexemes |
| -----------+-----------------+-------------+-------------------------------+----------------+------------- |
| asciiword | Word, all ASCII | The | {english_ispell,english_stem} | english_ispell | {} |
| blank | Space symbols | | {} | | |
| asciiword | Word, all ASCII | Brightest | {english_ispell,english_stem} | english_ispell | {bright} |
| blank | Space symbols | | {} | | |
| asciiword | Word, all ASCII | supernovaes | {english_ispell,english_stem} | english_stem | {supernova} |
| </screen> |
| |
| <para> |
| In this example, the word <literal>Brightest</literal> was recognized by the |
| parser as an <literal>ASCII word</literal> (alias <literal>asciiword</literal>). |
| For this token type the dictionary list is |
| <literal>english_ispell</literal> and |
| <literal>english_stem</literal>. The word was recognized by |
| <literal>english_ispell</literal>, which reduced it to the noun |
| <literal>bright</literal>. The word <literal>supernovaes</literal> is |
| unknown to the <literal>english_ispell</literal> dictionary so it |
| was passed to the next dictionary, and, fortunately, was recognized (in |
| fact, <literal>english_stem</literal> is a Snowball dictionary which |
| recognizes everything; that is why it was placed at the end of the |
| dictionary list). |
| </para> |
| |
| <para> |
| The word <literal>The</literal> was recognized by the |
| <literal>english_ispell</literal> dictionary as a stop word (<xref |
| linkend="textsearch-stopwords"/>) and will not be indexed. |
| The spaces are discarded too, since the configuration provides no |
| dictionaries at all for them. |
| </para> |
| |
| <para> |
| You can reduce the width of the output by explicitly specifying which columns |
| you want to see: |
| |
| <screen> |
| SELECT alias, token, dictionary, lexemes |
| FROM ts_debug('public.english', 'The Brightest supernovaes'); |
| alias | token | dictionary | lexemes |
| -----------+-------------+----------------+------------- |
| asciiword | The | english_ispell | {} |
| blank | | | |
| asciiword | Brightest | english_ispell | {bright} |
| blank | | | |
| asciiword | supernovaes | english_stem | {supernova} |
| </screen> |
| </para> |
| |
| </sect2> |
| |
| <sect2 id="textsearch-parser-testing"> |
| <title>Parser Testing</title> |
| |
| <para> |
| The following functions allow direct testing of a text search parser. |
| </para> |
| |
| <indexterm> |
| <primary>ts_parse</primary> |
| </indexterm> |
| |
| <synopsis> |
| ts_parse(<replaceable class="parameter">parser_name</replaceable> <type>text</type>, <replaceable class="parameter">document</replaceable> <type>text</type>, |
| OUT <replaceable class="parameter">tokid</replaceable> <type>integer</type>, OUT <replaceable class="parameter">token</replaceable> <type>text</type>) returns <type>setof record</type> |
| ts_parse(<replaceable class="parameter">parser_oid</replaceable> <type>oid</type>, <replaceable class="parameter">document</replaceable> <type>text</type>, |
| OUT <replaceable class="parameter">tokid</replaceable> <type>integer</type>, OUT <replaceable class="parameter">token</replaceable> <type>text</type>) returns <type>setof record</type> |
| </synopsis> |
| |
| <para> |
| <function>ts_parse</function> parses the given <replaceable>document</replaceable> |
| and returns a series of records, one for each token produced by |
| parsing. Each record includes a <varname>tokid</varname> showing the |
| assigned token type and a <varname>token</varname> which is the text of the |
| token. For example: |
| |
| <screen> |
| SELECT * FROM ts_parse('default', '123 - a number'); |
| tokid | token |
| -------+-------- |
| 22 | 123 |
| 12 | |
| 12 | - |
| 1 | a |
| 12 | |
| 1 | number |
| </screen> |
| </para> |
| |
| <indexterm> |
| <primary>ts_token_type</primary> |
| </indexterm> |
| |
| <synopsis> |
| ts_token_type(<replaceable class="parameter">parser_name</replaceable> <type>text</type>, OUT <replaceable class="parameter">tokid</replaceable> <type>integer</type>, |
| OUT <replaceable class="parameter">alias</replaceable> <type>text</type>, OUT <replaceable class="parameter">description</replaceable> <type>text</type>) returns <type>setof record</type> |
| ts_token_type(<replaceable class="parameter">parser_oid</replaceable> <type>oid</type>, OUT <replaceable class="parameter">tokid</replaceable> <type>integer</type>, |
| OUT <replaceable class="parameter">alias</replaceable> <type>text</type>, OUT <replaceable class="parameter">description</replaceable> <type>text</type>) returns <type>setof record</type> |
| </synopsis> |
| |
| <para> |
| <function>ts_token_type</function> returns a table which describes each type of |
| token the specified parser can recognize. For each token type, the table |
| gives the integer <varname>tokid</varname> that the parser uses to label a |
| token of that type, the <varname>alias</varname> that names the token type |
| in configuration commands, and a short <varname>description</varname>. For |
| example: |
| |
| <screen> |
| SELECT * FROM ts_token_type('default'); |
| tokid | alias | description |
| -------+-----------------+------------------------------------------ |
| 1 | asciiword | Word, all ASCII |
| 2 | word | Word, all letters |
| 3 | numword | Word, letters and digits |
| 4 | email | Email address |
| 5 | url | URL |
| 6 | host | Host |
| 7 | sfloat | Scientific notation |
| 8 | version | Version number |
| 9 | hword_numpart | Hyphenated word part, letters and digits |
| 10 | hword_part | Hyphenated word part, all letters |
| 11 | hword_asciipart | Hyphenated word part, all ASCII |
| 12 | blank | Space symbols |
| 13 | tag | XML tag |
| 14 | protocol | Protocol head |
| 15 | numhword | Hyphenated word, letters and digits |
| 16 | asciihword | Hyphenated word, all ASCII |
| 17 | hword | Hyphenated word, all letters |
| 18 | url_path | URL path |
| 19 | file | File or path name |
| 20 | float | Decimal notation |
| 21 | int | Signed integer |
| 22 | uint | Unsigned integer |
| 23 | entity | XML entity |
| </screen> |
| </para> |
| |
| </sect2> |
| |
| <sect2 id="textsearch-dictionary-testing"> |
| <title>Dictionary Testing</title> |
| |
| <para> |
| The <function>ts_lexize</function> function facilitates dictionary testing. |
| </para> |
| |
| <indexterm> |
| <primary>ts_lexize</primary> |
| </indexterm> |
| |
| <synopsis> |
| ts_lexize(<replaceable class="parameter">dict</replaceable> <type>regdictionary</type>, <replaceable class="parameter">token</replaceable> <type>text</type>) returns <type>text[]</type> |
| </synopsis> |
| |
| <para> |
| <function>ts_lexize</function> returns an array of lexemes if the input |
| <replaceable>token</replaceable> is known to the dictionary, |
| or an empty array if the token |
| is known to the dictionary but it is a stop word, or |
| <literal>NULL</literal> if it is an unknown word. |
| </para> |
| |
| <para> |
| Examples: |
| |
| <screen> |
| SELECT ts_lexize('english_stem', 'stars'); |
| ts_lexize |
| ----------- |
| {star} |
| |
| SELECT ts_lexize('english_stem', 'a'); |
| ts_lexize |
| ----------- |
| {} |
| </screen> |
| </para> |
| |
| <note> |
| <para> |
| The <function>ts_lexize</function> function expects a single |
| <emphasis>token</emphasis>, not text. Here is a case |
| where this can be confusing: |
| |
| <screen> |
| SELECT ts_lexize('thesaurus_astro', 'supernovae stars') is null; |
| ?column? |
| ---------- |
| t |
| </screen> |
| |
| The thesaurus dictionary <literal>thesaurus_astro</literal> does know the |
| phrase <literal>supernovae stars</literal>, but <function>ts_lexize</function> |
| fails since it does not parse the input text but treats it as a single |
| token. Use <function>plainto_tsquery</function> or <function>to_tsvector</function> to |
| test thesaurus dictionaries, for example: |
| |
| <screen> |
| SELECT plainto_tsquery('supernovae stars'); |
| plainto_tsquery |
| ----------------- |
| 'sn' |
| </screen> |
| </para> |
| </note> |
| |
| </sect2> |
| |
| </sect1> |
| |
| <sect1 id="textsearch-indexes"> |
| <title>Preferred Index Types for Text Search</title> |
| |
| <indexterm zone="textsearch-indexes"> |
| <primary>text search</primary> |
| <secondary>indexes</secondary> |
| </indexterm> |
| |
| <para> |
| There are two kinds of indexes that can be used to speed up full text |
| searches: |
| <link linkend="gin"><acronym>GIN</acronym></link> and |
| <link linkend="gist"><acronym>GiST</acronym></link>. |
| Note that indexes are not mandatory for full text searching, but in |
| cases where a column is searched on a regular basis, an index is |
| usually desirable. |
| </para> |
| |
| <para> |
| To create such an index, do one of: |
| |
| <variablelist> |
| |
| <varlistentry> |
| |
| <term> |
| <indexterm zone="textsearch-indexes"> |
| <primary>index</primary> |
| <secondary>GIN</secondary> |
| <tertiary>text search</tertiary> |
| </indexterm> |
| |
| <literal>CREATE INDEX <replaceable>name</replaceable> ON <replaceable>table</replaceable> USING GIN (<replaceable>column</replaceable>);</literal> |
| </term> |
| |
| <listitem> |
| <para> |
| Creates a GIN (Generalized Inverted Index)-based index. |
| The <replaceable>column</replaceable> must be of <type>tsvector</type> type. |
| </para> |
| </listitem> |
| </varlistentry> |
| |
| <varlistentry> |
| |
| <term> |
| <indexterm zone="textsearch-indexes"> |
| <primary>index</primary> |
| <secondary>GiST</secondary> |
| <tertiary>text search</tertiary> |
| </indexterm> |
| |
| <literal>CREATE INDEX <replaceable>name</replaceable> ON <replaceable>table</replaceable> USING GIST (<replaceable>column</replaceable> [ { DEFAULT | tsvector_ops } (siglen = <replaceable>number</replaceable>) ] );</literal> |
| </term> |
| |
| <listitem> |
| <para> |
| Creates a GiST (Generalized Search Tree)-based index. |
| The <replaceable>column</replaceable> can be of <type>tsvector</type> or |
| <type>tsquery</type> type. |
| Optional integer parameter <literal>siglen</literal> determines |
| signature length in bytes (see below for details). |
| </para> |
| </listitem> |
| </varlistentry> |
| |
| </variablelist> |
| </para> |
| |
| <para> |
| GIN indexes are the preferred text search index type. As inverted |
| indexes, they contain an index entry for each word (lexeme), with a |
| compressed list of matching locations. Multi-word searches can find |
| the first match, then use the index to remove rows that are lacking |
| additional words. GIN indexes store only the words (lexemes) of |
| <type>tsvector</type> values, and not their weight labels. Thus a table |
| row recheck is needed when using a query that involves weights. |
| </para> |
| |
| <para> |
| A GiST index is <firstterm>lossy</firstterm>, meaning that the index |
| might produce false matches, and it is necessary |
| to check the actual table row to eliminate such false matches. |
| (<productname>PostgreSQL</productname> does this automatically when needed.) |
| GiST indexes are lossy because each document is represented in the |
| index by a fixed-length signature. The signature length in bytes is determined |
| by the value of the optional integer parameter <literal>siglen</literal>. |
| The default signature length (when <literal>siglen</literal> is not specified) is |
| 124 bytes, the maximum signature length is 2024 bytes. The signature is generated by hashing |
| each word into a single bit in an n-bit string, with all these bits OR-ed |
| together to produce an n-bit document signature. When two words hash to |
| the same bit position there will be a false match. If all words in |
| the query have matches (real or false) then the table row must be |
| retrieved to see if the match is correct. Longer signatures lead to a more |
| precise search (scanning a smaller fraction of the index and fewer heap |
| pages), at the cost of a larger index. |
| </para> |
| |
| <para> |
| A GiST index can be covering, i.e., use the <literal>INCLUDE</literal> |
| clause. Included columns can have data types without any GiST operator |
| class. Included attributes will be stored uncompressed. |
| </para> |
| |
| <para> |
| Lossiness causes performance degradation due to unnecessary fetches of table |
| records that turn out to be false matches. Since random access to table |
| records is slow, this limits the usefulness of GiST indexes. The |
| likelihood of false matches depends on several factors, in particular the |
| number of unique words, so using dictionaries to reduce this number is |
| recommended. |
| </para> |
| |
| <para> |
| Note that <acronym>GIN</acronym> index build time can often be improved |
| by increasing <xref linkend="guc-maintenance-work-mem"/>, while |
| <acronym>GiST</acronym> index build time is not sensitive to that |
| parameter. |
| </para> |
| |
| <para> |
| Partitioning of big collections and the proper use of GIN and GiST indexes |
| allows the implementation of very fast searches with online update. |
| Partitioning can be done at the database level using table inheritance, |
| or by distributing documents over |
| servers and collecting external search results, e.g., via <link |
| linkend="ddl-foreign-data">Foreign Data</link> access. |
| The latter is possible because ranking functions use |
| only local information. |
| </para> |
| |
| </sect1> |
| |
| <sect1 id="textsearch-psql"> |
| <title><application>psql</application> Support</title> |
| |
| <para> |
| Information about text search configuration objects can be obtained |
| in <application>psql</application> using a set of commands: |
| <synopsis> |
| \dF{d,p,t}<optional>+</optional> <optional>PATTERN</optional> |
| </synopsis> |
| An optional <literal>+</literal> produces more details. |
| </para> |
| |
| <para> |
| The optional parameter <replaceable>PATTERN</replaceable> can be the name of |
| a text search object, optionally schema-qualified. If |
| <replaceable>PATTERN</replaceable> is omitted then information about all |
| visible objects will be displayed. <replaceable>PATTERN</replaceable> can be a |
| regular expression and can provide <emphasis>separate</emphasis> patterns |
| for the schema and object names. The following examples illustrate this: |
| |
| <screen> |
| => \dF *fulltext* |
| List of text search configurations |
| Schema | Name | Description |
| --------+--------------+------------- |
| public | fulltext_cfg | |
| </screen> |
| |
| <screen> |
| => \dF *.fulltext* |
| List of text search configurations |
| Schema | Name | Description |
| ----------+---------------------------- |
| fulltext | fulltext_cfg | |
| public | fulltext_cfg | |
| </screen> |
| |
| The available commands are: |
| </para> |
| |
| <variablelist> |
| <varlistentry> |
| <term><literal>\dF<optional>+</optional> <optional>PATTERN</optional></literal></term> |
| <listitem> |
| <para> |
| List text search configurations (add <literal>+</literal> for more detail). |
| <screen> |
| => \dF russian |
| List of text search configurations |
| Schema | Name | Description |
| ------------+---------+------------------------------------ |
| pg_catalog | russian | configuration for russian language |
| |
| => \dF+ russian |
| Text search configuration "pg_catalog.russian" |
| Parser: "pg_catalog.default" |
| Token | Dictionaries |
| -----------------+-------------- |
| asciihword | english_stem |
| asciiword | english_stem |
| email | simple |
| file | simple |
| float | simple |
| host | simple |
| hword | russian_stem |
| hword_asciipart | english_stem |
| hword_numpart | simple |
| hword_part | russian_stem |
| int | simple |
| numhword | simple |
| numword | simple |
| sfloat | simple |
| uint | simple |
| url | simple |
| url_path | simple |
| version | simple |
| word | russian_stem |
| </screen> |
| </para> |
| </listitem> |
| </varlistentry> |
| |
| <varlistentry> |
| <term><literal>\dFd<optional>+</optional> <optional>PATTERN</optional></literal></term> |
| <listitem> |
| <para> |
| List text search dictionaries (add <literal>+</literal> for more detail). |
| <screen> |
| => \dFd |
| List of text search dictionaries |
| Schema | Name | Description |
| ------------+-----------------+----------------------------------------------------------- |
| pg_catalog | arabic_stem | snowball stemmer for arabic language |
| pg_catalog | armenian_stem | snowball stemmer for armenian language |
| pg_catalog | basque_stem | snowball stemmer for basque language |
| pg_catalog | catalan_stem | snowball stemmer for catalan language |
| pg_catalog | danish_stem | snowball stemmer for danish language |
| pg_catalog | dutch_stem | snowball stemmer for dutch language |
| pg_catalog | english_stem | snowball stemmer for english language |
| pg_catalog | finnish_stem | snowball stemmer for finnish language |
| pg_catalog | french_stem | snowball stemmer for french language |
| pg_catalog | german_stem | snowball stemmer for german language |
| pg_catalog | greek_stem | snowball stemmer for greek language |
| pg_catalog | hindi_stem | snowball stemmer for hindi language |
| pg_catalog | hungarian_stem | snowball stemmer for hungarian language |
| pg_catalog | indonesian_stem | snowball stemmer for indonesian language |
| pg_catalog | irish_stem | snowball stemmer for irish language |
| pg_catalog | italian_stem | snowball stemmer for italian language |
| pg_catalog | lithuanian_stem | snowball stemmer for lithuanian language |
| pg_catalog | nepali_stem | snowball stemmer for nepali language |
| pg_catalog | norwegian_stem | snowball stemmer for norwegian language |
| pg_catalog | portuguese_stem | snowball stemmer for portuguese language |
| pg_catalog | romanian_stem | snowball stemmer for romanian language |
| pg_catalog | russian_stem | snowball stemmer for russian language |
| pg_catalog | serbian_stem | snowball stemmer for serbian language |
| pg_catalog | simple | simple dictionary: just lower case and check for stopword |
| pg_catalog | spanish_stem | snowball stemmer for spanish language |
| pg_catalog | swedish_stem | snowball stemmer for swedish language |
| pg_catalog | tamil_stem | snowball stemmer for tamil language |
| pg_catalog | turkish_stem | snowball stemmer for turkish language |
| pg_catalog | yiddish_stem | snowball stemmer for yiddish language |
| </screen> |
| </para> |
| </listitem> |
| </varlistentry> |
| |
| <varlistentry> |
| <term><literal>\dFp<optional>+</optional> <optional>PATTERN</optional></literal></term> |
| <listitem> |
| <para> |
| List text search parsers (add <literal>+</literal> for more detail). |
| <screen> |
| => \dFp |
| List of text search parsers |
| Schema | Name | Description |
| ------------+---------+--------------------- |
| pg_catalog | default | default word parser |
| => \dFp+ |
| Text search parser "pg_catalog.default" |
| Method | Function | Description |
| -----------------+----------------+------------- |
| Start parse | prsd_start | |
| Get next token | prsd_nexttoken | |
| End parse | prsd_end | |
| Get headline | prsd_headline | |
| Get token types | prsd_lextype | |
| |
| Token types for parser "pg_catalog.default" |
| Token name | Description |
| -----------------+------------------------------------------ |
| asciihword | Hyphenated word, all ASCII |
| asciiword | Word, all ASCII |
| blank | Space symbols |
| email | Email address |
| entity | XML entity |
| file | File or path name |
| float | Decimal notation |
| host | Host |
| hword | Hyphenated word, all letters |
| hword_asciipart | Hyphenated word part, all ASCII |
| hword_numpart | Hyphenated word part, letters and digits |
| hword_part | Hyphenated word part, all letters |
| int | Signed integer |
| numhword | Hyphenated word, letters and digits |
| numword | Word, letters and digits |
| protocol | Protocol head |
| sfloat | Scientific notation |
| tag | XML tag |
| uint | Unsigned integer |
| url | URL |
| url_path | URL path |
| version | Version number |
| word | Word, all letters |
| (23 rows) |
| </screen> |
| </para> |
| </listitem> |
| </varlistentry> |
| |
| <varlistentry> |
| <term><literal>\dFt<optional>+</optional> <optional>PATTERN</optional></literal></term> |
| <listitem> |
| <para> |
| List text search templates (add <literal>+</literal> for more detail). |
| <screen> |
| => \dFt |
| List of text search templates |
| Schema | Name | Description |
| ------------+-----------+----------------------------------------------------------- |
| pg_catalog | ispell | ispell dictionary |
| pg_catalog | simple | simple dictionary: just lower case and check for stopword |
| pg_catalog | snowball | snowball stemmer |
| pg_catalog | synonym | synonym dictionary: replace word by its synonym |
| pg_catalog | thesaurus | thesaurus dictionary: phrase by phrase substitution |
| </screen> |
| </para> |
| </listitem> |
| </varlistentry> |
| </variablelist> |
| |
| </sect1> |
| |
| <sect1 id="textsearch-limitations"> |
| <title>Limitations</title> |
| |
| <para> |
| The current limitations of <productname>PostgreSQL</productname>'s |
| text search features are: |
| <itemizedlist spacing="compact" mark="bullet"> |
| <listitem> |
| <para>The length of each lexeme must be less than 2 kilobytes</para> |
| </listitem> |
| <listitem> |
| <para>The length of a <type>tsvector</type> (lexemes + positions) must be |
| less than 1 megabyte</para> |
| </listitem> |
| <listitem> |
| <!-- TODO: number of lexemes in what? This is unclear --> |
| <para>The number of lexemes must be less than |
| 2<superscript>64</superscript></para> |
| </listitem> |
| <listitem> |
| <para>Position values in <type>tsvector</type> must be greater than 0 and |
| no more than 16,383</para> |
| </listitem> |
| <listitem> |
| <para>The match distance in a <literal><<replaceable>N</replaceable>></literal> |
| (FOLLOWED BY) <type>tsquery</type> operator cannot be more than |
| 16,384</para> |
| </listitem> |
| <listitem> |
| <para>No more than 256 positions per lexeme</para> |
| </listitem> |
| <listitem> |
| <para>The number of nodes (lexemes + operators) in a <type>tsquery</type> |
| must be less than 32,768</para> |
| </listitem> |
| </itemizedlist> |
| </para> |
| |
| <para> |
| For comparison, the <productname>PostgreSQL</productname> 8.1 documentation |
| contained 10,441 unique words, a total of 335,420 words, and the most |
| frequent word <quote>postgresql</quote> was mentioned 6,127 times in 655 |
| documents. |
| </para> |
| |
| <!-- TODO we need to put a date on these numbers? --> |
| <para> |
| Another example — the <productname>PostgreSQL</productname> mailing |
| list archives contained 910,989 unique words with 57,491,343 lexemes in |
| 461,020 messages. |
| </para> |
| |
| </sect1> |
| |
| </chapter> |