| = Tokenizers |
| // Licensed to the Apache Software Foundation (ASF) under one |
| // or more contributor license agreements. See the NOTICE file |
| // distributed with this work for additional information |
| // regarding copyright ownership. The ASF licenses this file |
| // to you under the Apache License, Version 2.0 (the |
| // "License"); you may not use this file except in compliance |
| // with the License. You may obtain a copy of the License at |
| // |
| // http://www.apache.org/licenses/LICENSE-2.0 |
| // |
| // Unless required by applicable law or agreed to in writing, |
| // software distributed under the License is distributed on an |
| // "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| // KIND, either express or implied. See the License for the |
| // specific language governing permissions and limitations |
| // under the License. |
| |
| Tokenizers are responsible for breaking field data into lexical units, or _tokens_. |
| |
| You configure the tokenizer for a text field type in `schema.xml` with a `<tokenizer>` element, as a child of `<analyzer>`: |
| |
| [source,xml] |
| ---- |
| <fieldType name="text" class="solr.TextField"> |
| <analyzer type="index"> |
| <tokenizer class="solr.StandardTokenizerFactory"/> |
| <filter class="solr.LowerCaseFilterFactory"/> |
| </analyzer> |
| </fieldType> |
| ---- |
| |
| The class attribute names a factory class that will instantiate a tokenizer object when needed. Tokenizer factory classes implement the `org.apache.solr.analysis.TokenizerFactory`. A TokenizerFactory's `create()` method accepts a Reader and returns a TokenStream. When Solr creates the tokenizer it passes a Reader object that provides the content of the text field. |
| |
| Arguments may be passed to tokenizer factories by setting attributes on the `<tokenizer>` element. |
| |
| [source,xml] |
| ---- |
| <fieldType name="semicolonDelimited" class="solr.TextField"> |
| <analyzer type="query"> |
| <tokenizer class="solr.PatternTokenizerFactory" pattern="; "/> |
| </analyzer> |
| </fieldType> |
| ---- |
| |
| The following sections describe the tokenizer factory classes included in this release of Solr. |
| |
| == Standard Tokenizer |
| |
| This tokenizer splits the text field into tokens, treating whitespace and punctuation as delimiters. Delimiter characters are discarded, with the following exceptions: |
| |
| * Periods (dots) that are not followed by whitespace are kept as part of the token, including Internet domain names. |
| * The "@" character is among the set of token-splitting punctuation, so email addresses are *not* preserved as single tokens. |
| |
| Note that words are split at hyphens. |
| |
| The Standard Tokenizer supports http://unicode.org/reports/tr29/#Word_Boundaries[Unicode standard annex UAX#29] word boundaries with the following token types: `<ALPHANUM>`, `<NUM>`, `<SOUTHEAST_ASIAN>`, `<IDEOGRAPHIC>`, and `<HIRAGANA>`. |
| |
| *Factory class:* `solr.StandardTokenizerFactory` |
| |
| *Arguments:* |
| |
| `maxTokenLength`: (integer, default 255) Solr ignores tokens that exceed the number of characters specified by `maxTokenLength`. |
| |
| *Example:* |
| |
| [source,xml] |
| ---- |
| <analyzer> |
| <tokenizer class="solr.StandardTokenizerFactory"/> |
| </analyzer> |
| ---- |
| |
| *In:* "Please, email john.doe@foo.com by 03-09, re: m37-xq." |
| |
| *Out:* "Please", "email", "john.doe", "foo.com", "by", "03", "09", "re", "m37", "xq" |
| |
| == Classic Tokenizer |
| |
| The Classic Tokenizer preserves the same behavior as the Standard Tokenizer of Solr versions 3.1 and previous. It does not use the http://unicode.org/reports/tr29/#Word_Boundaries[Unicode standard annex UAX#29] word boundary rules that the Standard Tokenizer uses. This tokenizer splits the text field into tokens, treating whitespace and punctuation as delimiters. Delimiter characters are discarded, with the following exceptions: |
| |
| * Periods (dots) that are not followed by whitespace are kept as part of the token. |
| |
| * Words are split at hyphens, unless there is a number in the word, in which case the token is not split and the numbers and hyphen(s) are preserved. |
| |
| * Recognizes Internet domain names and email addresses and preserves them as a single token. |
| |
| *Factory class:* `solr.ClassicTokenizerFactory` |
| |
| *Arguments:* |
| |
| `maxTokenLength`: (integer, default 255) Solr ignores tokens that exceed the number of characters specified by `maxTokenLength`. |
| |
| *Example:* |
| |
| [source,xml] |
| ---- |
| <analyzer> |
| <tokenizer class="solr.ClassicTokenizerFactory"/> |
| </analyzer> |
| ---- |
| |
| *In:* "Please, email john.doe@foo.com by 03-09, re: m37-xq." |
| |
| *Out:* "Please", "email", "john.doe@foo.com", "by", "03-09", "re", "m37-xq" |
| |
| == Keyword Tokenizer |
| |
| This tokenizer treats the entire text field as a single token. |
| |
| *Factory class:* `solr.KeywordTokenizerFactory` |
| |
| *Arguments:* None |
| |
| *Example:* |
| |
| [source,xml] |
| ---- |
| <analyzer> |
| <tokenizer class="solr.KeywordTokenizerFactory"/> |
| </analyzer> |
| ---- |
| |
| *In:* "Please, email john.doe@foo.com by 03-09, re: m37-xq." |
| |
| *Out:* "Please, email john.doe@foo.com by 03-09, re: m37-xq." |
| |
| == Letter Tokenizer |
| |
| This tokenizer creates tokens from strings of contiguous letters, discarding all non-letter characters. |
| |
| *Factory class:* `solr.LetterTokenizerFactory` |
| |
| *Arguments:* None |
| |
| *Example:* |
| |
| [source,xml] |
| ---- |
| <analyzer> |
| <tokenizer class="solr.LetterTokenizerFactory"/> |
| </analyzer> |
| ---- |
| |
| *In:* "I can't." |
| |
| *Out:* "I", "can", "t" |
| |
| == Lower Case Tokenizer |
| |
| Tokenizes the input stream by delimiting at non-letters and then converting all letters to lowercase. Whitespace and non-letters are discarded. |
| |
| *Factory class:* `solr.LowerCaseTokenizerFactory` |
| |
| *Arguments:* None |
| |
| *Example:* |
| |
| [source,xml] |
| ---- |
| <analyzer> |
| <tokenizer class="solr.LowerCaseTokenizerFactory"/> |
| </analyzer> |
| ---- |
| |
| *In:* "I just \*LOVE* my iPhone!" |
| |
| *Out:* "i", "just", "love", "my", "iphone" |
| |
| == N-Gram Tokenizer |
| |
| Reads the field text and generates n-gram tokens of sizes in the given range. |
| |
| *Factory class:* `solr.NGramTokenizerFactory` |
| |
| *Arguments:* |
| |
| `minGramSize`: (integer, default 1) The minimum n-gram size, must be > 0. |
| |
| `maxGramSize`: (integer, default 2) The maximum n-gram size, must be >= `minGramSize`. |
| |
| *Example:* |
| |
| Default behavior. Note that this tokenizer operates over the whole field. It does not break the field at whitespace. As a result, the space character is included in the encoding. |
| |
| [source,xml] |
| ---- |
| <analyzer> |
| <tokenizer class="solr.NGramTokenizerFactory"/> |
| </analyzer> |
| ---- |
| |
| *In:* "hey man" |
| |
| *Out:* "h", "e", "y", " ", "m", "a", "n", "he", "ey", "y ", " m", "ma", "an" |
| |
| *Example:* |
| |
| With an n-gram size range of 4 to 5: |
| |
| [source,xml] |
| ---- |
| <analyzer> |
| <tokenizer class="solr.NGramTokenizerFactory" minGramSize="4" maxGramSize="5"/> |
| </analyzer> |
| ---- |
| |
| *In:* "bicycle" |
| |
| *Out:* "bicy", "bicyc", "icyc", "icycl", "cycl", "cycle", "ycle" |
| |
| == Edge N-Gram Tokenizer |
| |
| Reads the field text and generates edge n-gram tokens of sizes in the given range. |
| |
| *Factory class:* `solr.EdgeNGramTokenizerFactory` |
| |
| *Arguments:* |
| |
| `minGramSize`: (integer, default is 1) The minimum n-gram size, must be > 0. |
| |
| `maxGramSize`: (integer, default is 1) The maximum n-gram size, must be >= `minGramSize`. |
| |
| *Example:* |
| |
| Default behavior (min and max default to 1): |
| |
| [source,xml] |
| ---- |
| <analyzer> |
| <tokenizer class="solr.EdgeNGramTokenizerFactory"/> |
| </analyzer> |
| ---- |
| |
| *In:* "babaloo" |
| |
| *Out:* "b" |
| |
| *Example:* |
| |
| Edge n-gram range of 2 to 5 |
| |
| [source,xml] |
| ---- |
| <analyzer> |
| <tokenizer class="solr.EdgeNGramTokenizerFactory" minGramSize="2" maxGramSize="5"/> |
| </analyzer> |
| ---- |
| |
| *In:* "babaloo" |
| |
| **Out:**"ba", "bab", "baba", "babal" |
| |
| == ICU Tokenizer |
| |
| This tokenizer processes multilingual text and tokenizes it appropriately based on its script attribute. |
| |
| You can customize this tokenizer's behavior by specifying http://userguide.icu-project.org/boundaryanalysis#TOC-RBBI-Rules[per-script rule files]. To add per-script rules, add a `rulefiles` argument, which should contain a comma-separated list of `code:rulefile` pairs in the following format: four-letter ISO 15924 script code, followed by a colon, then a resource path. For example, to specify rules for Latin (script code "Latn") and Cyrillic (script code "Cyrl"), you would enter `Latn:my.Latin.rules.rbbi,Cyrl:my.Cyrillic.rules.rbbi`. |
| |
| The default configuration for `solr.ICUTokenizerFactory` provides UAX#29 word break rules tokenization (like `solr.StandardTokenizer`), but also includes custom tailorings for Hebrew (specializing handling of double and single quotation marks), for syllable tokenization for Khmer, Lao, and Myanmar, and dictionary-based word segmentation for CJK characters. |
| |
| *Factory class:* `solr.ICUTokenizerFactory` |
| |
| *Arguments:* |
| |
| `rulefile`: a comma-separated list of `code:rulefile` pairs in the following format: four-letter ISO 15924 script code, followed by a colon, then a resource path. |
| |
| *Example:* |
| |
| [source,xml] |
| ---- |
| <analyzer> |
| <!-- no customization --> |
| <tokenizer class="solr.ICUTokenizerFactory"/> |
| </analyzer> |
| ---- |
| |
| [source,xml] |
| ---- |
| <analyzer> |
| <tokenizer class="solr.ICUTokenizerFactory" |
| rulefiles="Latn:my.Latin.rules.rbbi,Cyrl:my.Cyrillic.rules.rbbi"/> |
| </analyzer> |
| ---- |
| |
| [IMPORTANT] |
| ==== |
| |
| To use this tokenizer, you must add additional .jars to Solr's classpath (as described in the section <<solr-plugins.adoc#installing-plugins,Solr Plugins>>). See the `solr/contrib/analysis-extras/README.txt` for information on which jars you need to add. |
| |
| ==== |
| |
| == Path Hierarchy Tokenizer |
| |
| This tokenizer creates synonyms from file path hierarchies. |
| |
| *Factory class:* `solr.PathHierarchyTokenizerFactory` |
| |
| *Arguments:* |
| |
| `delimiter`: (character, no default) You can specify the file path delimiter and replace it with a delimiter you provide. This can be useful for working with backslash delimiters. |
| |
| `replace`: (character, no default) Specifies the delimiter character Solr uses in the tokenized output. |
| |
| *Example:* |
| |
| [source,xml] |
| ---- |
| <fieldType name="text_path" class="solr.TextField" positionIncrementGap="100"> |
| <analyzer> |
| <tokenizer class="solr.PathHierarchyTokenizerFactory" delimiter="\" replace="/"/> |
| </analyzer> |
| </fieldType> |
| ---- |
| |
| *In:* "c:\usr\local\apache" |
| |
| *Out:* "c:", "c:/usr", "c:/usr/local", "c:/usr/local/apache" |
| |
| == Regular Expression Pattern Tokenizer |
| |
| This tokenizer uses a Java regular expression to break the input text stream into tokens. The expression provided by the pattern argument can be interpreted either as a delimiter that separates tokens, or to match patterns that should be extracted from the text as tokens. |
| |
| See {java-javadocs}java/util/regex/Pattern.html[the Javadocs for `java.util.regex.Pattern`] for more information on Java regular expression syntax. |
| |
| *Factory class:* `solr.PatternTokenizerFactory` |
| |
| *Arguments:* |
| |
| `pattern`: (Required) The regular expression, as defined by in `java.util.regex.Pattern`. |
| |
| `group`: (Optional, default -1) Specifies which regex group to extract as the token(s). The value -1 means the regex should be treated as a delimiter that separates tokens. Non-negative group numbers (>= 0) indicate that character sequences matching that regex group should be converted to tokens. Group zero refers to the entire regex, groups greater than zero refer to parenthesized sub-expressions of the regex, counted from left to right. |
| |
| *Example:* |
| |
| A comma separated list. Tokens are separated by a sequence of zero or more spaces, a comma, and zero or more spaces. |
| |
| [source,xml] |
| ---- |
| <analyzer> |
| <tokenizer class="solr.PatternTokenizerFactory" pattern="\s*,\s*"/> |
| </analyzer> |
| ---- |
| |
| *In:* "fee,fie, foe , fum, foo" |
| |
| *Out:* "fee", "fie", "foe", "fum", "foo" |
| |
| *Example:* |
| |
| Extract simple, capitalized words. A sequence of at least one capital letter followed by zero or more letters of either case is extracted as a token. |
| |
| [source,xml] |
| ---- |
| <analyzer> |
| <tokenizer class="solr.PatternTokenizerFactory" pattern="[A-Z][A-Za-z]*" group="0"/> |
| </analyzer> |
| ---- |
| |
| *In:* "Hello. My name is Inigo Montoya. You killed my father. Prepare to die." |
| |
| *Out:* "Hello", "My", "Inigo", "Montoya", "You", "Prepare" |
| |
| *Example:* |
| |
| Extract part numbers which are preceded by "SKU", "Part" or "Part Number", case sensitive, with an optional semi-colon separator. Part numbers must be all numeric digits, with an optional hyphen. Regex capture groups are numbered by counting left parenthesis from left to right. Group 3 is the subexpression "[0-9-]+", which matches one or more digits or hyphens. |
| |
| [source,xml] |
| ---- |
| <analyzer> |
| <tokenizer class="solr.PatternTokenizerFactory" pattern="(SKU|Part(\sNumber)?):?\s(\[0-9-\]+)" group="3"/> |
| </analyzer> |
| ---- |
| |
| *In:* "SKU: 1234, Part Number 5678, Part: 126-987" |
| |
| *Out:* "1234", "5678", "126-987" |
| |
| == Simplified Regular Expression Pattern Tokenizer |
| |
| This tokenizer is similar to the `PatternTokenizerFactory` described above, but uses Lucene {lucene-javadocs}/core/org/apache/lucene/util/automaton/RegExp.html[`RegExp`] pattern matching to construct distinct tokens for the input stream. The syntax is more limited than `PatternTokenizerFactory`, but the tokenization is quite a bit faster. |
| |
| *Factory class:* `solr.SimplePatternTokenizerFactory` |
| |
| *Arguments:* |
| |
| `pattern`: (Required) The regular expression, as defined by in the {lucene-javadocs}/core/org/apache/lucene/util/automaton/RegExp.html[`RegExp`] javadocs, identifying the characters to include in tokens. The matching is greedy such that the longest token matching at a given point is created. Empty tokens are never created. |
| |
| `maxDeterminizedStates`: (Optional, default 10000) the limit on total state count for the determined automaton computed from the regexp. |
| |
| *Example:* |
| |
| To match tokens delimited by simple whitespace characters: |
| |
| [source,xml] |
| ---- |
| <analyzer> |
| <tokenizer class="solr.SimplePatternTokenizerFactory" pattern="[^ \t\r\n]+"/> |
| </analyzer> |
| ---- |
| |
| == Simplified Regular Expression Pattern Splitting Tokenizer |
| |
| This tokenizer is similar to the `SimplePatternTokenizerFactory` described above, but uses Lucene {lucene-javadocs}/core/org/apache/lucene/util/automaton/RegExp.html[`RegExp`] pattern matching to identify sequences of characters that should be used to split tokens. The syntax is more limited than `PatternTokenizerFactory`, but the tokenization is quite a bit faster. |
| |
| *Factory class:* `solr.SimplePatternSplitTokenizerFactory` |
| |
| *Arguments:* |
| |
| `pattern`: (Required) The regular expression, as defined by in the {lucene-javadocs}/core/org/apache/lucene/util/automaton/RegExp.html[`RegExp`] javadocs, identifying the characters that should split tokens. The matching is greedy such that the longest token separator matching at a given point is matched. Empty tokens are never created. |
| |
| `maxDeterminizedStates`: (Optional, default 10000) the limit on total state count for the determined automaton computed from the regexp. |
| |
| *Example:* |
| |
| To match tokens delimited by simple whitespace characters: |
| |
| [source,xml] |
| ---- |
| <analyzer> |
| <tokenizer class="solr.SimplePatternSplitTokenizerFactory" pattern="[ \t\r\n]+"/> |
| </analyzer> |
| ---- |
| |
| == UAX29 URL Email Tokenizer |
| |
| This tokenizer splits the text field into tokens, treating whitespace and punctuation as delimiters. Delimiter characters are discarded, with the following exceptions: |
| |
| * Periods (dots) that are not followed by whitespace are kept as part of the token. |
| |
| * Words are split at hyphens, unless there is a number in the word, in which case the token is not split and the numbers and hyphen(s) are preserved. |
| |
| * Recognizes and preserves as single tokens the following: |
| ** Internet domain names containing top-level domains validated against the white list in the http://www.internic.net/zones/root.zone[IANA Root Zone Database] when the tokenizer was generated |
| ** email addresses |
| ** `file://`, `http(s)://`, and `ftp://` URLs |
| ** IPv4 and IPv6 addresses |
| |
| The UAX29 URL Email Tokenizer supports http://unicode.org/reports/tr29/#Word_Boundaries[Unicode standard annex UAX#29] word boundaries with the following token types: `<ALPHANUM>`, `<NUM>`, `<URL>`, `<EMAIL>`, `<SOUTHEAST_ASIAN>`, `<IDEOGRAPHIC>`, and `<HIRAGANA>`. |
| |
| *Factory class:* `solr.UAX29URLEmailTokenizerFactory` |
| |
| *Arguments:* |
| |
| `maxTokenLength`: (integer, default 255) Solr ignores tokens that exceed the number of characters specified by `maxTokenLength`. |
| |
| *Example:* |
| |
| [source,xml] |
| ---- |
| <analyzer> |
| <tokenizer class="solr.UAX29URLEmailTokenizerFactory"/> |
| </analyzer> |
| ---- |
| |
| *In:* "Visit http://accarol.com/contact.htm?from=external&a=10 or e-mail bob.cratchet@accarol.com" |
| |
| *Out:* "Visit", "http://accarol.com/contact.htm?from=external&a=10", "or", "e", "mail", "bob.cratchet@accarol.com" |
| |
| == White Space Tokenizer |
| |
| Simple tokenizer that splits the text stream on whitespace and returns sequences of non-whitespace characters as tokens. Note that any punctuation _will_ be included in the tokens. |
| |
| *Factory class:* `solr.WhitespaceTokenizerFactory` |
| |
| *Arguments:* |
| |
| `rule`:: |
| Specifies how to define whitespace for the purpose of tokenization. Valid values: |
| |
| * `java`: (Default) Uses {java-javadocs}java/lang/Character.html#isWhitespace-int-[Character.isWhitespace(int)] |
| * `unicode`: Uses Unicode's WHITESPACE property |
| |
| *Example:* |
| |
| [source,xml] |
| ---- |
| <analyzer> |
| <tokenizer class="solr.WhitespaceTokenizerFactory" rule="java" /> |
| </analyzer> |
| ---- |
| |
| *In:* "To be, or what?" |
| |
| *Out:* "To", "be,", "or", "what?" |
| |
| == OpenNLP Tokenizer and OpenNLP Filters |
| |
| See <<language-analysis.adoc#opennlp-integration,OpenNLP Integration>> for information about using the OpenNLP Tokenizer, along with information about available OpenNLP token filters. |