solr/solr-ref-guide/src/tokenizers.adoc - lucene-solr - Git at Google

 = Tokenizers
 // Licensed to the Apache Software Foundation (ASF) under one
 // or more contributor license agreements.  See the NOTICE file
 // distributed with this work for additional information
 // regarding copyright ownership.  The ASF licenses this file
 // to you under the Apache License, Version 2.0 (the
 // "License"); you may not use this file except in compliance
 // with the License.  You may obtain a copy of the License at
 //
 //   http://www.apache.org/licenses/LICENSE-2.0
 //
 // Unless required by applicable law or agreed to in writing,
 // software distributed under the License is distributed on an
 // "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 // KIND, either express or implied.  See the License for the
 // specific language governing permissions and limitations
 // under the License.

 Tokenizers are responsible for breaking field data into lexical units, or _tokens_.

 You configure the tokenizer for a text field type in `schema.xml` with a `<tokenizer>` element, as a child of `<analyzer>`:

 [source,xml]
 ----
 <fieldType name="text" class="solr.TextField">
   <analyzer type="index">
     <tokenizer class="solr.StandardTokenizerFactory"/>
     <filter class="solr.LowerCaseFilterFactory"/>
   </analyzer>
 </fieldType>
 ----

 The class attribute names a factory class that will instantiate a tokenizer object when needed. Tokenizer factory classes implement the `org.apache.solr.analysis.TokenizerFactory`. A TokenizerFactory's `create()` method accepts a Reader and returns a TokenStream. When Solr creates the tokenizer it passes a Reader object that provides the content of the text field.

 Arguments may be passed to tokenizer factories by setting attributes on the `<tokenizer>` element.

 [source,xml]
 ----
 <fieldType name="semicolonDelimited" class="solr.TextField">
   <analyzer type="query">
     <tokenizer class="solr.PatternTokenizerFactory" pattern="; "/>
   </analyzer>
 </fieldType>
 ----

 The following sections describe the tokenizer factory classes included in this release of Solr.

 == Standard Tokenizer

 This tokenizer splits the text field into tokens, treating whitespace and punctuation as delimiters. Delimiter characters are discarded, with the following exceptions:

 * Periods (dots) that are not followed by whitespace are kept as part of the token, including Internet domain names.
 * The "@" character is among the set of token-splitting punctuation, so email addresses are *not* preserved as single tokens.

 Note that words are split at hyphens.

 The Standard Tokenizer supports http://unicode.org/reports/tr29/#Word_Boundaries[Unicode standard annex UAX#29] word boundaries with the following token types: `<ALPHANUM>`, `<NUM>`, `<SOUTHEAST_ASIAN>`, `<IDEOGRAPHIC>`, and `<HIRAGANA>`.

 *Factory class:* `solr.StandardTokenizerFactory`

 *Arguments:*

 `maxTokenLength`: (integer, default 255) Solr ignores tokens that exceed the number of characters specified by `maxTokenLength`.

 *Example:*

 [source,xml]
 ----
 <analyzer>
   <tokenizer class="solr.StandardTokenizerFactory"/>
 </analyzer>
 ----

 *In:* "Please, email john.doe@foo.com by 03-09, re: m37-xq."

 *Out:* "Please", "email", "john.doe", "foo.com", "by", "03", "09", "re", "m37", "xq"

 == Classic Tokenizer

 The Classic Tokenizer preserves the same behavior as the Standard Tokenizer of Solr versions 3.1 and previous. It does not use the http://unicode.org/reports/tr29/#Word_Boundaries[Unicode standard annex UAX#29] word boundary rules that the Standard Tokenizer uses. This tokenizer splits the text field into tokens, treating whitespace and punctuation as delimiters. Delimiter characters are discarded, with the following exceptions:

 * Periods (dots) that are not followed by whitespace are kept as part of the token.

 * Words are split at hyphens, unless there is a number in the word, in which case the token is not split and the numbers and hyphen(s) are preserved.

 * Recognizes Internet domain names and email addresses and preserves them as a single token.

 *Factory class:* `solr.ClassicTokenizerFactory`

 *Arguments:*

 `maxTokenLength`: (integer, default 255) Solr ignores tokens that exceed the number of characters specified by `maxTokenLength`.

 *Example:*

 [source,xml]
 ----
 <analyzer>
   <tokenizer class="solr.ClassicTokenizerFactory"/>
 </analyzer>
 ----

 *In:* "Please, email john.doe@foo.com by 03-09, re: m37-xq."

 *Out:* "Please", "email", "john.doe@foo.com", "by", "03-09", "re", "m37-xq"

 == Keyword Tokenizer

 This tokenizer treats the entire text field as a single token.

 *Factory class:* `solr.KeywordTokenizerFactory`

 *Arguments:* None

 *Example:*

 [source,xml]
 ----
 <analyzer>
   <tokenizer class="solr.KeywordTokenizerFactory"/>
 </analyzer>
 ----

 *In:* "Please, email john.doe@foo.com by 03-09, re: m37-xq."

 *Out:* "Please, email john.doe@foo.com by 03-09, re: m37-xq."

 == Letter Tokenizer

 This tokenizer creates tokens from strings of contiguous letters, discarding all non-letter characters.

 *Factory class:* `solr.LetterTokenizerFactory`

 *Arguments:* None

 *Example:*

 [source,xml]
 ----
 <analyzer>
   <tokenizer class="solr.LetterTokenizerFactory"/>
 </analyzer>
 ----

 *In:* "I can't."

 *Out:* "I", "can", "t"

 == Lower Case Tokenizer

 Tokenizes the input stream by delimiting at non-letters and then converting all letters to lowercase. Whitespace and non-letters are discarded.

 *Factory class:* `solr.LowerCaseTokenizerFactory`

 *Arguments:* None

 *Example:*

 [source,xml]
 ----
 <analyzer>
   <tokenizer class="solr.LowerCaseTokenizerFactory"/>
 </analyzer>
 ----

 *In:* "I just \*LOVE* my iPhone!"

 *Out:* "i", "just", "love", "my", "iphone"

 == N-Gram Tokenizer

 Reads the field text and generates n-gram tokens of sizes in the given range.

 *Factory class:* `solr.NGramTokenizerFactory`

 *Arguments:*

 `minGramSize`: (integer, default 1) The minimum n-gram size, must be > 0.

 `maxGramSize`: (integer, default 2) The maximum n-gram size, must be >= `minGramSize`.

 *Example:*

 Default behavior. Note that this tokenizer operates over the whole field. It does not break the field at whitespace. As a result, the space character is included in the encoding.

 [source,xml]
 ----
 <analyzer>
   <tokenizer class="solr.NGramTokenizerFactory"/>
 </analyzer>
 ----

 *In:* "hey man"

 *Out:* "h", "e", "y", " ", "m", "a", "n", "he", "ey", "y ", " m", "ma", "an"

 *Example:*

 With an n-gram size range of 4 to 5:

 [source,xml]
 ----
 <analyzer>
   <tokenizer class="solr.NGramTokenizerFactory" minGramSize="4" maxGramSize="5"/>
 </analyzer>
 ----

 *In:* "bicycle"

 *Out:* "bicy", "bicyc", "icyc", "icycl", "cycl", "cycle", "ycle"

 == Edge N-Gram Tokenizer

 Reads the field text and generates edge n-gram tokens of sizes in the given range.

 *Factory class:* `solr.EdgeNGramTokenizerFactory`

 *Arguments:*

 `minGramSize`: (integer, default is 1) The minimum n-gram size, must be > 0.

 `maxGramSize`: (integer, default is 1) The maximum n-gram size, must be >= `minGramSize`.

 *Example:*

 Default behavior (min and max default to 1):

 [source,xml]
 ----
 <analyzer>
   <tokenizer class="solr.EdgeNGramTokenizerFactory"/>
 </analyzer>
 ----

 *In:* "babaloo"

 *Out:* "b"

 *Example:*

 Edge n-gram range of 2 to 5

 [source,xml]
 ----
 <analyzer>
   <tokenizer class="solr.EdgeNGramTokenizerFactory" minGramSize="2" maxGramSize="5"/>
 </analyzer>
 ----

 *In:* "babaloo"

 **Out:**"ba", "bab", "baba", "babal"

 == ICU Tokenizer

 This tokenizer processes multilingual text and tokenizes it appropriately based on its script attribute.

 You can customize this tokenizer's behavior by specifying http://userguide.icu-project.org/boundaryanalysis#TOC-RBBI-Rules[per-script rule files]. To add per-script rules, add a `rulefiles` argument, which should contain a comma-separated list of `code:rulefile` pairs in the following format: four-letter ISO 15924 script code, followed by a colon, then a resource path. For example, to specify rules for Latin (script code "Latn") and Cyrillic (script code "Cyrl"), you would enter `Latn:my.Latin.rules.rbbi,Cyrl:my.Cyrillic.rules.rbbi`.

 The default configuration for `solr.ICUTokenizerFactory` provides UAX#29 word break rules tokenization (like `solr.StandardTokenizer`), but also includes custom tailorings for Hebrew (specializing handling of double and single quotation marks), for syllable tokenization for Khmer, Lao, and Myanmar, and dictionary-based word segmentation for CJK characters.

 *Factory class:* `solr.ICUTokenizerFactory`

 *Arguments:*

 `rulefile`: a comma-separated list of `code:rulefile` pairs in the following format: four-letter ISO 15924 script code, followed by a colon, then a resource path.

 *Example:*

 [source,xml]
 ----
 <analyzer>
   <!-- no customization -->
   <tokenizer class="solr.ICUTokenizerFactory"/>
 </analyzer>
 ----

 [source,xml]
 ----
 <analyzer>
   <tokenizer class="solr.ICUTokenizerFactory"
              rulefiles="Latn:my.Latin.rules.rbbi,Cyrl:my.Cyrillic.rules.rbbi"/>
 </analyzer>
 ----

 [IMPORTANT]
 ====

 To use this tokenizer, you must add additional .jars to Solr's classpath (as described in the section <<solr-plugins.adoc#installing-plugins,Solr Plugins>>). See the `solr/contrib/analysis-extras/README.txt` for information on which jars you need to add.

 ====

 == Path Hierarchy Tokenizer

 This tokenizer creates synonyms from file path hierarchies.

 *Factory class:* `solr.PathHierarchyTokenizerFactory`

 *Arguments:*

 `delimiter`: (character, no default) You can specify the file path delimiter and replace it with a delimiter you provide. This can be useful for working with backslash delimiters.

 `replace`: (character, no default) Specifies the delimiter character Solr uses in the tokenized output.

 *Example:*

 [source,xml]
 ----
 <fieldType name="text_path" class="solr.TextField" positionIncrementGap="100">
   <analyzer>
     <tokenizer class="solr.PathHierarchyTokenizerFactory" delimiter="\" replace="/"/>
   </analyzer>
 </fieldType>
 ----

 *In:* "c:\usr\local\apache"

 *Out:* "c:", "c:/usr", "c:/usr/local", "c:/usr/local/apache"

 == Regular Expression Pattern Tokenizer

 This tokenizer uses a Java regular expression to break the input text stream into tokens. The expression provided by the pattern argument can be interpreted either as a delimiter that separates tokens, or to match patterns that should be extracted from the text as tokens.

 See {java-javadocs}java/util/regex/Pattern.html[the Javadocs for `java.util.regex.Pattern`] for more information on Java regular expression syntax.

 *Factory class:* `solr.PatternTokenizerFactory`

 *Arguments:*

 `pattern`: (Required) The regular expression, as defined by in `java.util.regex.Pattern`.

 `group`: (Optional, default -1) Specifies which regex group to extract as the token(s). The value -1 means the regex should be treated as a delimiter that separates tokens. Non-negative group numbers (>= 0) indicate that character sequences matching that regex group should be converted to tokens. Group zero refers to the entire regex, groups greater than zero refer to parenthesized sub-expressions of the regex, counted from left to right.

 *Example:*

 A comma separated list. Tokens are separated by a sequence of zero or more spaces, a comma, and zero or more spaces.

 [source,xml]
 ----
 <analyzer>
   <tokenizer class="solr.PatternTokenizerFactory" pattern="\s*,\s*"/>
 </analyzer>
 ----

 *In:* "fee,fie, foe , fum, foo"

 *Out:* "fee", "fie", "foe", "fum", "foo"

 *Example:*

 Extract simple, capitalized words. A sequence of at least one capital letter followed by zero or more letters of either case is extracted as a token.

 [source,xml]
 ----
 <analyzer>
   <tokenizer class="solr.PatternTokenizerFactory" pattern="[A-Z][A-Za-z]*" group="0"/>
 </analyzer>
 ----

 *In:* "Hello. My name is Inigo Montoya. You killed my father. Prepare to die."

 *Out:* "Hello", "My", "Inigo", "Montoya", "You", "Prepare"

 *Example:*

 Extract part numbers which are preceded by "SKU", "Part" or "Part Number", case sensitive, with an optional semi-colon separator. Part numbers must be all numeric digits, with an optional hyphen. Regex capture groups are numbered by counting left parenthesis from left to right. Group 3 is the subexpression "[0-9-]+", which matches one or more digits or hyphens.

 [source,xml]
 ----
 <analyzer>
   <tokenizer class="solr.PatternTokenizerFactory" pattern="(SKU|Part(\sNumber)?):?\s(\[0-9-\]+)" group="3"/>
 </analyzer>
 ----

 *In:* "SKU: 1234, Part Number 5678, Part: 126-987"

 *Out:* "1234", "5678", "126-987"

 == Simplified Regular Expression Pattern Tokenizer

 This tokenizer is similar to the `PatternTokenizerFactory` described above, but uses Lucene {lucene-javadocs}/core/org/apache/lucene/util/automaton/RegExp.html[`RegExp`] pattern matching to construct distinct tokens for the input stream. The syntax is more limited than `PatternTokenizerFactory`, but the tokenization is quite a bit faster.

 *Factory class:* `solr.SimplePatternTokenizerFactory`

 *Arguments:*

 `pattern`: (Required) The regular expression, as defined by in the {lucene-javadocs}/core/org/apache/lucene/util/automaton/RegExp.html[`RegExp`] javadocs, identifying the characters to include in tokens. The matching is greedy such that the longest token matching at a given point is created. Empty tokens are never created.

 `maxDeterminizedStates`: (Optional, default 10000) the limit on total state count for the determined automaton computed from the regexp.

 *Example:*

 To match tokens delimited by simple whitespace characters:

 [source,xml]
 ----
 <analyzer>
   <tokenizer class="solr.SimplePatternTokenizerFactory" pattern="[^ \t\r\n]+"/>
 </analyzer>
 ----

 == Simplified Regular Expression Pattern Splitting Tokenizer

 This tokenizer is similar to the `SimplePatternTokenizerFactory` described above, but uses Lucene {lucene-javadocs}/core/org/apache/lucene/util/automaton/RegExp.html[`RegExp`] pattern matching to identify sequences of characters that should be used to split tokens. The syntax is more limited than `PatternTokenizerFactory`, but the tokenization is quite a bit faster.

 *Factory class:* `solr.SimplePatternSplitTokenizerFactory`

 *Arguments:*

 `pattern`: (Required) The regular expression, as defined by in the {lucene-javadocs}/core/org/apache/lucene/util/automaton/RegExp.html[`RegExp`] javadocs, identifying the characters that should split tokens. The matching is greedy such that the longest token separator matching at a given point is matched. Empty tokens are never created.

 `maxDeterminizedStates`: (Optional, default 10000) the limit on total state count for the determined automaton computed from the regexp.

 *Example:*

 To match tokens delimited by simple whitespace characters:

 [source,xml]
 ----
 <analyzer>
   <tokenizer class="solr.SimplePatternSplitTokenizerFactory" pattern="[ \t\r\n]+"/>
 </analyzer>
 ----

 == UAX29 URL Email Tokenizer

 This tokenizer splits the text field into tokens, treating whitespace and punctuation as delimiters. Delimiter characters are discarded, with the following exceptions:

 * Periods (dots) that are not followed by whitespace are kept as part of the token.

 * Words are split at hyphens, unless there is a number in the word, in which case the token is not split and the numbers and hyphen(s) are preserved.

 * Recognizes and preserves as single tokens the following:
 ** Internet domain names containing top-level domains validated against the white list in the http://www.internic.net/zones/root.zone[IANA Root Zone Database] when the tokenizer was generated
 ** email addresses
 ** `file://`, `http(s)://`, and `ftp://` URLs
 ** IPv4 and IPv6 addresses

 The UAX29 URL Email Tokenizer supports http://unicode.org/reports/tr29/#Word_Boundaries[Unicode standard annex UAX#29] word boundaries with the following token types: `<ALPHANUM>`, `<NUM>`, `<URL>`, `<EMAIL>`, `<SOUTHEAST_ASIAN>`, `<IDEOGRAPHIC>`, and `<HIRAGANA>`.

 *Factory class:* `solr.UAX29URLEmailTokenizerFactory`

 *Arguments:*

 `maxTokenLength`: (integer, default 255) Solr ignores tokens that exceed the number of characters specified by `maxTokenLength`.

 *Example:*

 [source,xml]
 ----
 <analyzer>
   <tokenizer class="solr.UAX29URLEmailTokenizerFactory"/>
 </analyzer>
 ----

 *In:* "Visit http://accarol.com/contact.htm?from=external&a=10 or e-mail bob.cratchet@accarol.com"

 *Out:* "Visit", "http://accarol.com/contact.htm?from=external&a=10", "or", "e", "mail", "bob.cratchet@accarol.com"

 == White Space Tokenizer

 Simple tokenizer that splits the text stream on whitespace and returns sequences of non-whitespace characters as tokens. Note that any punctuation _will_ be included in the tokens.

 *Factory class:* `solr.WhitespaceTokenizerFactory`

 *Arguments:*

 `rule`::
 Specifies how to define whitespace for the purpose of tokenization. Valid values:

 * `java`: (Default) Uses {java-javadocs}java/lang/Character.html#isWhitespace-int-[Character.isWhitespace(int)]
 * `unicode`: Uses Unicode's WHITESPACE property

 *Example:*

 [source,xml]
 ----
 <analyzer>
   <tokenizer class="solr.WhitespaceTokenizerFactory" rule="java" />
 </analyzer>
 ----

 *In:* "To be, or what?"

 *Out:* "To", "be,", "or", "what?"

 == OpenNLP Tokenizer and OpenNLP Filters

 See <<language-analysis.adoc#opennlp-integration,OpenNLP Integration>> for information about using the OpenNLP Tokenizer, along with information about available OpenNLP token filters.
	= Tokenizers
	// Licensed to the Apache Software Foundation (ASF) under one
	// or more contributor license agreements. See the NOTICE file
	// distributed with this work for additional information
	// regarding copyright ownership. The ASF licenses this file
	// to you under the Apache License, Version 2.0 (the
	// "License"); you may not use this file except in compliance
	// with the License. You may obtain a copy of the License at
	//
	// http://www.apache.org/licenses/LICENSE-2.0
	//
	// Unless required by applicable law or agreed to in writing,
	// software distributed under the License is distributed on an
	// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	// KIND, either express or implied. See the License for the
	// specific language governing permissions and limitations
	// under the License.

	Tokenizers are responsible for breaking field data into lexical units, or _tokens_.

	You configure the tokenizer for a text field type in `schema.xml` with a `<tokenizer>` element, as a child of `<analyzer>`:

	[source,xml]
	----
	<fieldType name="text" class="solr.TextField">
	<analyzer type="index">
	<tokenizer class="solr.StandardTokenizerFactory"/>
	<filter class="solr.LowerCaseFilterFactory"/>
	</analyzer>
	</fieldType>
	----

	The class attribute names a factory class that will instantiate a tokenizer object when needed. Tokenizer factory classes implement the `org.apache.solr.analysis.TokenizerFactory`. A TokenizerFactory's `create()` method accepts a Reader and returns a TokenStream. When Solr creates the tokenizer it passes a Reader object that provides the content of the text field.

	Arguments may be passed to tokenizer factories by setting attributes on the `<tokenizer>` element.

	[source,xml]
	----
	<fieldType name="semicolonDelimited" class="solr.TextField">
	<analyzer type="query">
	<tokenizer class="solr.PatternTokenizerFactory" pattern="; "/>
	</analyzer>
	</fieldType>
	----

	The following sections describe the tokenizer factory classes included in this release of Solr.

	== Standard Tokenizer

	This tokenizer splits the text field into tokens, treating whitespace and punctuation as delimiters. Delimiter characters are discarded, with the following exceptions:

	* Periods (dots) that are not followed by whitespace are kept as part of the token, including Internet domain names.
	* The "@" character is among the set of token-splitting punctuation, so email addresses are not preserved as single tokens.

	Note that words are split at hyphens.

	The Standard Tokenizer supports http://unicode.org/reports/tr29/#Word_Boundaries[Unicode standard annex UAX#29] word boundaries with the following token types: `<ALPHANUM>`, `<NUM>`, `<SOUTHEAST_ASIAN>`, `<IDEOGRAPHIC>`, and `<HIRAGANA>`.

	Factory class: `solr.StandardTokenizerFactory`

	Arguments:

	`maxTokenLength`: (integer, default 255) Solr ignores tokens that exceed the number of characters specified by `maxTokenLength`.

	Example:

	[source,xml]
	----
	<analyzer>
	<tokenizer class="solr.StandardTokenizerFactory"/>
	</analyzer>
	----

	In: "Please, email john.doe@foo.com by 03-09, re: m37-xq."

	Out: "Please", "email", "john.doe", "foo.com", "by", "03", "09", "re", "m37", "xq"

	== Classic Tokenizer

	The Classic Tokenizer preserves the same behavior as the Standard Tokenizer of Solr versions 3.1 and previous. It does not use the http://unicode.org/reports/tr29/#Word_Boundaries[Unicode standard annex UAX#29] word boundary rules that the Standard Tokenizer uses. This tokenizer splits the text field into tokens, treating whitespace and punctuation as delimiters. Delimiter characters are discarded, with the following exceptions:

	* Periods (dots) that are not followed by whitespace are kept as part of the token.

	* Words are split at hyphens, unless there is a number in the word, in which case the token is not split and the numbers and hyphen(s) are preserved.

	* Recognizes Internet domain names and email addresses and preserves them as a single token.

	Factory class: `solr.ClassicTokenizerFactory`

	Arguments:

	`maxTokenLength`: (integer, default 255) Solr ignores tokens that exceed the number of characters specified by `maxTokenLength`.

	Example:

	[source,xml]
	----
	<analyzer>
	<tokenizer class="solr.ClassicTokenizerFactory"/>
	</analyzer>
	----

	In: "Please, email john.doe@foo.com by 03-09, re: m37-xq."

	Out: "Please", "email", "john.doe@foo.com", "by", "03-09", "re", "m37-xq"

	== Keyword Tokenizer

	This tokenizer treats the entire text field as a single token.

	Factory class: `solr.KeywordTokenizerFactory`

	Arguments: None

	Example:

	[source,xml]
	----
	<analyzer>
	<tokenizer class="solr.KeywordTokenizerFactory"/>
	</analyzer>
	----

	In: "Please, email john.doe@foo.com by 03-09, re: m37-xq."

	Out: "Please, email john.doe@foo.com by 03-09, re: m37-xq."

	== Letter Tokenizer

	This tokenizer creates tokens from strings of contiguous letters, discarding all non-letter characters.

	Factory class: `solr.LetterTokenizerFactory`

	Arguments: None

	Example:

	[source,xml]
	----
	<analyzer>
	<tokenizer class="solr.LetterTokenizerFactory"/>
	</analyzer>
	----

	In: "I can't."

	Out: "I", "can", "t"

	== Lower Case Tokenizer

	Tokenizes the input stream by delimiting at non-letters and then converting all letters to lowercase. Whitespace and non-letters are discarded.

	Factory class: `solr.LowerCaseTokenizerFactory`

	Arguments: None

	Example:

	[source,xml]
	----
	<analyzer>
	<tokenizer class="solr.LowerCaseTokenizerFactory"/>
	</analyzer>
	----

	In: "I just \LOVE my iPhone!"

	Out: "i", "just", "love", "my", "iphone"

	== N-Gram Tokenizer

	Reads the field text and generates n-gram tokens of sizes in the given range.

	Factory class: `solr.NGramTokenizerFactory`

	Arguments:

	`minGramSize`: (integer, default 1) The minimum n-gram size, must be > 0.

	`maxGramSize`: (integer, default 2) The maximum n-gram size, must be >= `minGramSize`.

	Example:

	Default behavior. Note that this tokenizer operates over the whole field. It does not break the field at whitespace. As a result, the space character is included in the encoding.

	[source,xml]
	----
	<analyzer>
	<tokenizer class="solr.NGramTokenizerFactory"/>
	</analyzer>
	----

	In: "hey man"

	Out: "h", "e", "y", " ", "m", "a", "n", "he", "ey", "y ", " m", "ma", "an"

	Example:

	With an n-gram size range of 4 to 5:

	[source,xml]
	----
	<analyzer>
	<tokenizer class="solr.NGramTokenizerFactory" minGramSize="4" maxGramSize="5"/>
	</analyzer>
	----

	In: "bicycle"

	Out: "bicy", "bicyc", "icyc", "icycl", "cycl", "cycle", "ycle"

	== Edge N-Gram Tokenizer

	Reads the field text and generates edge n-gram tokens of sizes in the given range.

	Factory class: `solr.EdgeNGramTokenizerFactory`

	Arguments:

	`minGramSize`: (integer, default is 1) The minimum n-gram size, must be > 0.

	`maxGramSize`: (integer, default is 1) The maximum n-gram size, must be >= `minGramSize`.

	Example:

	Default behavior (min and max default to 1):

	[source,xml]
	----
	<analyzer>
	<tokenizer class="solr.EdgeNGramTokenizerFactory"/>
	</analyzer>
	----

	In: "babaloo"

	Out: "b"

	Example:

	Edge n-gram range of 2 to 5

	[source,xml]
	----
	<analyzer>
	<tokenizer class="solr.EdgeNGramTokenizerFactory" minGramSize="2" maxGramSize="5"/>
	</analyzer>
	----

	In: "babaloo"

	Out:"ba", "bab", "baba", "babal"

	== ICU Tokenizer

	This tokenizer processes multilingual text and tokenizes it appropriately based on its script attribute.

	You can customize this tokenizer's behavior by specifying http://userguide.icu-project.org/boundaryanalysis#TOC-RBBI-Rules[per-script rule files]. To add per-script rules, add a `rulefiles` argument, which should contain a comma-separated list of `code:rulefile` pairs in the following format: four-letter ISO 15924 script code, followed by a colon, then a resource path. For example, to specify rules for Latin (script code "Latn") and Cyrillic (script code "Cyrl"), you would enter `Latn:my.Latin.rules.rbbi,Cyrl:my.Cyrillic.rules.rbbi`.

	The default configuration for `solr.ICUTokenizerFactory` provides UAX#29 word break rules tokenization (like `solr.StandardTokenizer`), but also includes custom tailorings for Hebrew (specializing handling of double and single quotation marks), for syllable tokenization for Khmer, Lao, and Myanmar, and dictionary-based word segmentation for CJK characters.

	Factory class: `solr.ICUTokenizerFactory`

	Arguments:

	`rulefile`: a comma-separated list of `code:rulefile` pairs in the following format: four-letter ISO 15924 script code, followed by a colon, then a resource path.

	Example:

	[source,xml]
	----
	<analyzer>
	<!-- no customization -->
	<tokenizer class="solr.ICUTokenizerFactory"/>
	</analyzer>
	----

	[source,xml]
	----
	<analyzer>
	<tokenizer class="solr.ICUTokenizerFactory"
	rulefiles="Latn:my.Latin.rules.rbbi,Cyrl:my.Cyrillic.rules.rbbi"/>
	</analyzer>
	----

	[IMPORTANT]
	====

	To use this tokenizer, you must add additional .jars to Solr's classpath (as described in the section <<solr-plugins.adoc#installing-plugins,Solr Plugins>>). See the `solr/contrib/analysis-extras/README.txt` for information on which jars you need to add.

	====

	== Path Hierarchy Tokenizer

	This tokenizer creates synonyms from file path hierarchies.

	Factory class: `solr.PathHierarchyTokenizerFactory`

	Arguments:

	`delimiter`: (character, no default) You can specify the file path delimiter and replace it with a delimiter you provide. This can be useful for working with backslash delimiters.

	`replace`: (character, no default) Specifies the delimiter character Solr uses in the tokenized output.

	Example:

	[source,xml]
	----
	<fieldType name="text_path" class="solr.TextField" positionIncrementGap="100">
	<analyzer>
	<tokenizer class="solr.PathHierarchyTokenizerFactory" delimiter="\" replace="/"/>
	</analyzer>
	</fieldType>
	----

	In: "c:\usr\local\apache"

	Out: "c:", "c:/usr", "c:/usr/local", "c:/usr/local/apache"

	== Regular Expression Pattern Tokenizer

	This tokenizer uses a Java regular expression to break the input text stream into tokens. The expression provided by the pattern argument can be interpreted either as a delimiter that separates tokens, or to match patterns that should be extracted from the text as tokens.

	See {java-javadocs}java/util/regex/Pattern.html[the Javadocs for `java.util.regex.Pattern`] for more information on Java regular expression syntax.

	Factory class: `solr.PatternTokenizerFactory`

	Arguments:

	`pattern`: (Required) The regular expression, as defined by in `java.util.regex.Pattern`.

	`group`: (Optional, default -1) Specifies which regex group to extract as the token(s). The value -1 means the regex should be treated as a delimiter that separates tokens. Non-negative group numbers (>= 0) indicate that character sequences matching that regex group should be converted to tokens. Group zero refers to the entire regex, groups greater than zero refer to parenthesized sub-expressions of the regex, counted from left to right.

	Example:

	A comma separated list. Tokens are separated by a sequence of zero or more spaces, a comma, and zero or more spaces.

	[source,xml]
	----
	<analyzer>
	<tokenizer class="solr.PatternTokenizerFactory" pattern="\s,\s"/>
	</analyzer>
	----

	In: "fee,fie, foe , fum, foo"

	Out: "fee", "fie", "foe", "fum", "foo"

	Example:

	Extract simple, capitalized words. A sequence of at least one capital letter followed by zero or more letters of either case is extracted as a token.

	[source,xml]
	----
	<analyzer>
	<tokenizer class="solr.PatternTokenizerFactory" pattern="[A-Z][A-Za-z]*" group="0"/>
	</analyzer>
	----

	In: "Hello. My name is Inigo Montoya. You killed my father. Prepare to die."

	Out: "Hello", "My", "Inigo", "Montoya", "You", "Prepare"

	Example:

	Extract part numbers which are preceded by "SKU", "Part" or "Part Number", case sensitive, with an optional semi-colon separator. Part numbers must be all numeric digits, with an optional hyphen. Regex capture groups are numbered by counting left parenthesis from left to right. Group 3 is the subexpression "[0-9-]+", which matches one or more digits or hyphens.

	[source,xml]
	----
	<analyzer>
	<tokenizer class="solr.PatternTokenizerFactory" pattern="(SKU\|Part(\sNumber)?):?\s(\[0-9-\]+)" group="3"/>
	</analyzer>
	----

	In: "SKU: 1234, Part Number 5678, Part: 126-987"

	Out: "1234", "5678", "126-987"

	== Simplified Regular Expression Pattern Tokenizer

	This tokenizer is similar to the `PatternTokenizerFactory` described above, but uses Lucene {lucene-javadocs}/core/org/apache/lucene/util/automaton/RegExp.html[`RegExp`] pattern matching to construct distinct tokens for the input stream. The syntax is more limited than `PatternTokenizerFactory`, but the tokenization is quite a bit faster.

	Factory class: `solr.SimplePatternTokenizerFactory`

	Arguments:

	`pattern`: (Required) The regular expression, as defined by in the {lucene-javadocs}/core/org/apache/lucene/util/automaton/RegExp.html[`RegExp`] javadocs, identifying the characters to include in tokens. The matching is greedy such that the longest token matching at a given point is created. Empty tokens are never created.

	`maxDeterminizedStates`: (Optional, default 10000) the limit on total state count for the determined automaton computed from the regexp.

	Example:

	To match tokens delimited by simple whitespace characters:

	[source,xml]
	----
	<analyzer>
	<tokenizer class="solr.SimplePatternTokenizerFactory" pattern="[^ \t\r\n]+"/>
	</analyzer>
	----

	== Simplified Regular Expression Pattern Splitting Tokenizer

	This tokenizer is similar to the `SimplePatternTokenizerFactory` described above, but uses Lucene {lucene-javadocs}/core/org/apache/lucene/util/automaton/RegExp.html[`RegExp`] pattern matching to identify sequences of characters that should be used to split tokens. The syntax is more limited than `PatternTokenizerFactory`, but the tokenization is quite a bit faster.

	Factory class: `solr.SimplePatternSplitTokenizerFactory`

	Arguments:

	`pattern`: (Required) The regular expression, as defined by in the {lucene-javadocs}/core/org/apache/lucene/util/automaton/RegExp.html[`RegExp`] javadocs, identifying the characters that should split tokens. The matching is greedy such that the longest token separator matching at a given point is matched. Empty tokens are never created.

	`maxDeterminizedStates`: (Optional, default 10000) the limit on total state count for the determined automaton computed from the regexp.

	Example:

	To match tokens delimited by simple whitespace characters:

	[source,xml]
	----
	<analyzer>
	<tokenizer class="solr.SimplePatternSplitTokenizerFactory" pattern="[ \t\r\n]+"/>
	</analyzer>
	----

	== UAX29 URL Email Tokenizer

	This tokenizer splits the text field into tokens, treating whitespace and punctuation as delimiters. Delimiter characters are discarded, with the following exceptions:

	* Periods (dots) that are not followed by whitespace are kept as part of the token.

	* Words are split at hyphens, unless there is a number in the word, in which case the token is not split and the numbers and hyphen(s) are preserved.

	* Recognizes and preserves as single tokens the following:
	** Internet domain names containing top-level domains validated against the white list in the http://www.internic.net/zones/root.zone[IANA Root Zone Database] when the tokenizer was generated
	** email addresses
	** `file://`, `http(s)://`, and `ftp://` URLs
	** IPv4 and IPv6 addresses

	The UAX29 URL Email Tokenizer supports http://unicode.org/reports/tr29/#Word_Boundaries[Unicode standard annex UAX#29] word boundaries with the following token types: `<ALPHANUM>`, `<NUM>`, `<URL>`, `<EMAIL>`, `<SOUTHEAST_ASIAN>`, `<IDEOGRAPHIC>`, and `<HIRAGANA>`.

	Factory class: `solr.UAX29URLEmailTokenizerFactory`

	Arguments:

	`maxTokenLength`: (integer, default 255) Solr ignores tokens that exceed the number of characters specified by `maxTokenLength`.

	Example:

	[source,xml]
	----
	<analyzer>
	<tokenizer class="solr.UAX29URLEmailTokenizerFactory"/>
	</analyzer>
	----

	In: "Visit http://accarol.com/contact.htm?from=external&a=10 or e-mail bob.cratchet@accarol.com"

	Out: "Visit", "http://accarol.com/contact.htm?from=external&a=10", "or", "e", "mail", "bob.cratchet@accarol.com"

	== White Space Tokenizer

	Simple tokenizer that splits the text stream on whitespace and returns sequences of non-whitespace characters as tokens. Note that any punctuation _will_ be included in the tokens.

	Factory class: `solr.WhitespaceTokenizerFactory`

	Arguments:

	`rule`::
	Specifies how to define whitespace for the purpose of tokenization. Valid values:

	* `java`: (Default) Uses {java-javadocs}java/lang/Character.html#isWhitespace-int-[Character.isWhitespace(int)]
	* `unicode`: Uses Unicode's WHITESPACE property

	Example:

	[source,xml]
	----
	<analyzer>
	<tokenizer class="solr.WhitespaceTokenizerFactory" rule="java" />
	</analyzer>
	----

	In: "To be, or what?"

	Out: "To", "be,", "or", "what?"

	== OpenNLP Tokenizer and OpenNLP Filters

	See <<language-analysis.adoc#opennlp-integration,OpenNLP Integration>> for information about using the OpenNLP Tokenizer, along with information about available OpenNLP token filters.