solr/solr-ref-guide/src/understanding-analyzers-tokenizers-and-filters.adoc - lucene-solr - Git at Google

 = Understanding Analyzers, Tokenizers, and Filters
 :page-children: analyzers, about-tokenizers, about-filters, tokenizers, filter-descriptions, charfilterfactories, language-analysis, phonetic-matching, running-your-analyzer
 // Licensed to the Apache Software Foundation (ASF) under one
 // or more contributor license agreements.  See the NOTICE file
 // distributed with this work for additional information
 // regarding copyright ownership.  The ASF licenses this file
 // to you under the Apache License, Version 2.0 (the
 // "License"); you may not use this file except in compliance
 // with the License.  You may obtain a copy of the License at
 //
 //   http://www.apache.org/licenses/LICENSE-2.0
 //
 // Unless required by applicable law or agreed to in writing,
 // software distributed under the License is distributed on an
 // "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 // KIND, either express or implied.  See the License for the
 // specific language governing permissions and limitations
 // under the License.

 The following sections describe how Solr breaks down and works with textual data. There are three main concepts to understand: analyzers, tokenizers, and filters.

 * <<analyzers.adoc#,Field analyzers>> are used both during ingestion, when a document is indexed, and at query time. An analyzer examines the text of fields and generates a token stream. Analyzers may be a single class or they may be composed of a series of tokenizer and filter classes.
 * <<about-tokenizers.adoc#,Tokenizers>> break field data into lexical units, or _tokens_.
 * <<about-filters.adoc#,Filters>> examine a stream of tokens and keep them, transform or discard them, or create new ones. Tokenizers and filters may be combined to form pipelines, or _chains_, where the output of one is input to the next. Such a sequence of tokenizers and filters is called an _analyzer_ and the resulting output of an analyzer is used to match query results or build indices.

 == Using Analyzers, Tokenizers, and Filters

 Although the analysis process is used for both indexing and querying, the same analysis process need not be used for both operations. For indexing, you often want to simplify, or normalize, words. For example, setting all letters to lowercase, eliminating punctuation and accents, mapping words to their stems, and so on. Doing so can increase recall because, for example, "ram", "Ram" and "RAM" would all match a query for "ram". To increase query-time precision, a filter could be employed to narrow the matches by, for example, ignoring all-cap acronyms if you're interested in male sheep, but not Random Access Memory.

 The tokens output by the analysis process define the values, or _terms_, of that field and are used either to build an index of those terms when a new document is added, or to identify which documents contain the terms you are querying for.

 === For More Information

 These sections will show you how to configure field analyzers and also serves as a reference for the details of configuring each of the available tokenizer and filter classes. It also serves as a guide so that you can configure your own analysis classes if you have special needs that cannot be met with the included filters or tokenizers.

 *For Analyzers, see:*

 * <<analyzers.adoc#,Analyzers>>: Detailed conceptual information about Solr analyzers.
 * <<running-your-analyzer.adoc#,Running Your Analyzer>>: Detailed information about testing and running your Solr analyzer.

 *For Tokenizers, see:*

 * <<about-tokenizers.adoc#,About Tokenizers>>: Detailed conceptual information about Solr tokenizers.
 * <<tokenizers.adoc#,Tokenizers>>: Information about configuring tokenizers, and about the tokenizer factory classes included in this distribution of Solr.

 *For Filters, see:*

 * <<about-filters.adoc#,About Filters>>: Detailed conceptual information about Solr filters.
 * <<filter-descriptions.adoc#,Filter Descriptions>>: Information about configuring filters, and about the filter factory classes included in this distribution of Solr.
 * <<charfilterfactories.adoc#,CharFilterFactories>>: Information about filters for pre-processing input characters.

 *To find out how to use Tokenizers and Filters with various languages, see:*

 * <<language-analysis.adoc#,Language Analysis>>: Information about tokenizers and filters for character set conversion or for use with specific languages.
	= Understanding Analyzers, Tokenizers, and Filters
	:page-children: analyzers, about-tokenizers, about-filters, tokenizers, filter-descriptions, charfilterfactories, language-analysis, phonetic-matching, running-your-analyzer
	// Licensed to the Apache Software Foundation (ASF) under one
	// or more contributor license agreements. See the NOTICE file
	// distributed with this work for additional information
	// regarding copyright ownership. The ASF licenses this file
	// to you under the Apache License, Version 2.0 (the
	// "License"); you may not use this file except in compliance
	// with the License. You may obtain a copy of the License at
	//
	// http://www.apache.org/licenses/LICENSE-2.0
	//
	// Unless required by applicable law or agreed to in writing,
	// software distributed under the License is distributed on an
	// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	// KIND, either express or implied. See the License for the
	// specific language governing permissions and limitations
	// under the License.

	The following sections describe how Solr breaks down and works with textual data. There are three main concepts to understand: analyzers, tokenizers, and filters.

	* <<analyzers.adoc#,Field analyzers>> are used both during ingestion, when a document is indexed, and at query time. An analyzer examines the text of fields and generates a token stream. Analyzers may be a single class or they may be composed of a series of tokenizer and filter classes.
	* <<about-tokenizers.adoc#,Tokenizers>> break field data into lexical units, or _tokens_.
	* <<about-filters.adoc#,Filters>> examine a stream of tokens and keep them, transform or discard them, or create new ones. Tokenizers and filters may be combined to form pipelines, or _chains_, where the output of one is input to the next. Such a sequence of tokenizers and filters is called an _analyzer_ and the resulting output of an analyzer is used to match query results or build indices.

	== Using Analyzers, Tokenizers, and Filters

	Although the analysis process is used for both indexing and querying, the same analysis process need not be used for both operations. For indexing, you often want to simplify, or normalize, words. For example, setting all letters to lowercase, eliminating punctuation and accents, mapping words to their stems, and so on. Doing so can increase recall because, for example, "ram", "Ram" and "RAM" would all match a query for "ram". To increase query-time precision, a filter could be employed to narrow the matches by, for example, ignoring all-cap acronyms if you're interested in male sheep, but not Random Access Memory.

	The tokens output by the analysis process define the values, or _terms_, of that field and are used either to build an index of those terms when a new document is added, or to identify which documents contain the terms you are querying for.

	=== For More Information

	These sections will show you how to configure field analyzers and also serves as a reference for the details of configuring each of the available tokenizer and filter classes. It also serves as a guide so that you can configure your own analysis classes if you have special needs that cannot be met with the included filters or tokenizers.

	For Analyzers, see:

	* <<analyzers.adoc#,Analyzers>>: Detailed conceptual information about Solr analyzers.
	* <<running-your-analyzer.adoc#,Running Your Analyzer>>: Detailed information about testing and running your Solr analyzer.

	For Tokenizers, see:

	* <<about-tokenizers.adoc#,About Tokenizers>>: Detailed conceptual information about Solr tokenizers.
	* <<tokenizers.adoc#,Tokenizers>>: Information about configuring tokenizers, and about the tokenizer factory classes included in this distribution of Solr.

	For Filters, see:

	* <<about-filters.adoc#,About Filters>>: Detailed conceptual information about Solr filters.
	* <<filter-descriptions.adoc#,Filter Descriptions>>: Information about configuring filters, and about the filter factory classes included in this distribution of Solr.
	* <<charfilterfactories.adoc#,CharFilterFactories>>: Information about filters for pre-processing input characters.

	To find out how to use Tokenizers and Filters with various languages, see:

	* <<language-analysis.adoc#,Language Analysis>>: Information about tokenizers and filters for character set conversion or for use with specific languages.