solr/solr-ref-guide/src/detecting-languages-during-indexing.adoc - lucene-solr - Git at Google

 = Detecting Languages During Indexing
 :page-shortname: detecting-languages-during-indexing
 :page-permalink: detecting-languages-during-indexing.html
 // Licensed to the Apache Software Foundation (ASF) under one
 // or more contributor license agreements.  See the NOTICE file
 // distributed with this work for additional information
 // regarding copyright ownership.  The ASF licenses this file
 // to you under the Apache License, Version 2.0 (the
 // "License"); you may not use this file except in compliance
 // with the License.  You may obtain a copy of the License at
 //
 //   http://www.apache.org/licenses/LICENSE-2.0
 //
 // Unless required by applicable law or agreed to in writing,
 // software distributed under the License is distributed on an
 // "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 // KIND, either express or implied.  See the License for the
 // specific language governing permissions and limitations
 // under the License.

 Solr can identify languages and map text to language-specific fields during indexing using the `langid` UpdateRequestProcessor.

 Solr supports two implementations of this feature:

 * Tika's language detection feature: http://tika.apache.org/0.10/detection.html
 * LangDetect language detection: https://github.com/shuyo/language-detection

 You can see a comparison between the two implementations here: http://blog.mikemccandless.com/2011/10/accuracy-and-performance-of-googles.html. In general, the LangDetect implementation supports more languages with higher performance.

 For specific information on each of these language identification implementations, including a list of supported languages for each, see the relevant project websites.

 For more information about language analysis in Solr, see <<language-analysis.adoc#language-analysis,Language Analysis>>.

 [[DetectingLanguagesDuringIndexing-ConfiguringLanguageDetection]]
 == Configuring Language Detection

 You can configure the `langid` UpdateRequestProcessor in `solrconfig.xml`. Both implementations take the same parameters, which are described in the following section. At a minimum, you must specify the fields for language identification and a field for the resulting language code.

 [[DetectingLanguagesDuringIndexing-ConfiguringTikaLanguageDetection]]
 === Configuring Tika Language Detection

 Here is an example of a minimal Tika `langid` configuration in `solrconfig.xml`:

 [source,xml]
 ----
 <processor class="org.apache.solr.update.processor.TikaLanguageIdentifierUpdateProcessorFactory">
   <lst name="defaults">
     <str name="langid.fl">title,subject,text,keywords</str>
     <str name="langid.langField">language_s</str>
   </lst>
 </processor>
 ----

 [[DetectingLanguagesDuringIndexing-ConfiguringLangDetectLanguageDetection]]
 === Configuring LangDetect Language Detection

 Here is an example of a minimal LangDetect `langid` configuration in `solrconfig.xml`:

 [source,xml]
 ----
 <processor class="org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessorFactory">
   <lst name="defaults">
     <str name="langid.fl">title,subject,text,keywords</str>
     <str name="langid.langField">language_s</str>
   </lst>
 </processor>
 ----

 [[DetectingLanguagesDuringIndexing-langidParameters]]
 == `langid` Parameters

 As previously mentioned, both implementations of the `langid` UpdateRequestProcessor take the same parameters.

 // TODO: Change column width to %autowidth.spread when https://github.com/asciidoctor/asciidoctor-pdf/issues/599 is fixed

 [cols="30,10,10,10,40",options="header"]
 |===
 |Parameter |Type |Default |Required |Description
 |langid |Boolean |true |no |Enables and disables language detection.
 |langid.fl |string |none |yes |A comma- or space-delimited list of fields to be processed by `langid`.
 |langid.langField |string |none |yes |Specifies the field for the returned language code.
 |langid.langsField |multivalued string |none |no |Specifies the field for a list of returned language codes. If you use `langid.map.individual`, each detected language will be added to this field.
 |langid.overwrite |Boolean |false |no |Specifies whether the content of the `langField` and `langsField` fields will be overwritten if they already contain values.
 |langid.lcmap |string |none |false |A space-separated list specifying colon delimited language code mappings to apply to the detected languages. For example, you might use this to map Chinese, Japanese, and Korean to a common `cjk` code, and map both American and British English to a single `en` code by using `langid.lcmap=ja:cjk zh:cjk ko:cjk en_GB:en en_US:en`. This affects both the values put into the `langField` and `langsField` fields, as well as the field suffixes when using `langid.map`, unless overridden by `langid.map.lcmap`
 |langid.threshold |float |0.5 |no |Specifies a threshold value between 0 and 1 that the language identification score must reach before `langid` accepts it. With longer text fields, a high threshold such at 0.8 will give good results. For shorter text fields, you may need to lower the threshold for language identification, though you will be risking somewhat lower quality results. We recommend experimenting with your data to tune your results.
 |langid.whitelist |string |none |no |Specifies a list of allowed language identification codes. Use this in combination with `langid.map` to ensure that you only index documents into fields that are in your schema.
 |langid.map |Boolean |false |no |Enables field name mapping. If true, Solr will map field names for all fields listed in `langid.fl`.
 |langid.map.fl |string |none |no |A comma-separated list of fields for `langid.map` that is different than the fields specified in `langid.fl`.
 |langid.map.keepOrig |Boolean |false |no |If true, Solr will copy the field during the field name mapping process, leaving the original field in place.
 |langid.map.individual |Boolean |false |no |If true, Solr will detect and map languages for each field individually.
 |langid.map.individual.fl |string |none |no |A comma-separated list of fields for use with `langid.map.individual` that is different than the fields specified in `langid.fl`.
 |langid.fallbackFields |string |none |no |If no language is detected that meets the `langid.threshold` score, or if the detected language is not on the `langid.whitelist`, this field specifies language codes to be used as fallback values. If no appropriate fallback languages are found, Solr will use the language code specified in `langid.fallback`.
 |langid.fallback |string |none |no |Specifies a language code to use if no language is detected or specified in `langid.fallbackFields`.
 |langid.map.lcmap |string |determined by `langid.lcmap` |no |A space-separated list specifying colon delimited language code mappings to use when mapping field names. For example, you might use this to make Chinese, Japanese, and Korean language fields use a common `*_cjk` suffix, and map both American and British English fields to a single `*_en` by using `langid.map.lcmap=ja:cjk zh:cjk ko:cjk en_GB:en en_US:en`.
 |langid.map.pattern |Java regular expression |none |no |By default, fields are mapped as <field>_<language>. To change this pattern, you can specify a Java regular expression in this parameter.
 |langid.map.replace |Java replace |none |no |By default, fields are mapped as <field>_<language>. To change this pattern, you can specify a Java replace in this parameter.
 |langid.enforceSchema |Boolean |true |no |If false, the `langid` processor does not validate field names against your schema. This may be useful if you plan to rename or delete fields later in the UpdateChain.
 |===
	= Detecting Languages During Indexing
	:page-shortname: detecting-languages-during-indexing
	:page-permalink: detecting-languages-during-indexing.html
	// Licensed to the Apache Software Foundation (ASF) under one
	// or more contributor license agreements. See the NOTICE file
	// distributed with this work for additional information
	// regarding copyright ownership. The ASF licenses this file
	// to you under the Apache License, Version 2.0 (the
	// "License"); you may not use this file except in compliance
	// with the License. You may obtain a copy of the License at
	//
	// http://www.apache.org/licenses/LICENSE-2.0
	//
	// Unless required by applicable law or agreed to in writing,
	// software distributed under the License is distributed on an
	// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	// KIND, either express or implied. See the License for the
	// specific language governing permissions and limitations
	// under the License.

	Solr can identify languages and map text to language-specific fields during indexing using the `langid` UpdateRequestProcessor.

	Solr supports two implementations of this feature:

	* Tika's language detection feature: http://tika.apache.org/0.10/detection.html
	* LangDetect language detection: https://github.com/shuyo/language-detection

	You can see a comparison between the two implementations here: http://blog.mikemccandless.com/2011/10/accuracy-and-performance-of-googles.html. In general, the LangDetect implementation supports more languages with higher performance.

	For specific information on each of these language identification implementations, including a list of supported languages for each, see the relevant project websites.

	For more information about language analysis in Solr, see <<language-analysis.adoc#language-analysis,Language Analysis>>.

	[[DetectingLanguagesDuringIndexing-ConfiguringLanguageDetection]]
	== Configuring Language Detection

	You can configure the `langid` UpdateRequestProcessor in `solrconfig.xml`. Both implementations take the same parameters, which are described in the following section. At a minimum, you must specify the fields for language identification and a field for the resulting language code.

	[[DetectingLanguagesDuringIndexing-ConfiguringTikaLanguageDetection]]
	=== Configuring Tika Language Detection

	Here is an example of a minimal Tika `langid` configuration in `solrconfig.xml`:

	[source,xml]
	----
	<processor class="org.apache.solr.update.processor.TikaLanguageIdentifierUpdateProcessorFactory">
	<lst name="defaults">
	<str name="langid.fl">title,subject,text,keywords</str>
	<str name="langid.langField">language_s</str>
	</lst>
	</processor>
	----

	[[DetectingLanguagesDuringIndexing-ConfiguringLangDetectLanguageDetection]]
	=== Configuring LangDetect Language Detection

	Here is an example of a minimal LangDetect `langid` configuration in `solrconfig.xml`:

	[source,xml]
	----
	<processor class="org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessorFactory">
	<lst name="defaults">
	<str name="langid.fl">title,subject,text,keywords</str>
	<str name="langid.langField">language_s</str>
	</lst>
	</processor>
	----

	[[DetectingLanguagesDuringIndexing-langidParameters]]
	== `langid` Parameters

	As previously mentioned, both implementations of the `langid` UpdateRequestProcessor take the same parameters.

	// TODO: Change column width to %autowidth.spread when https://github.com/asciidoctor/asciidoctor-pdf/issues/599 is fixed

	[cols="30,10,10,10,40",options="header"]
	\|===
	\|Parameter \|Type \|Default \|Required \|Description
	\|langid \|Boolean \|true \|no \|Enables and disables language detection.
	\|langid.fl \|string \|none \|yes \|A comma- or space-delimited list of fields to be processed by `langid`.
	\|langid.langField \|string \|none \|yes \|Specifies the field for the returned language code.
	\|langid.langsField \|multivalued string \|none \|no \|Specifies the field for a list of returned language codes. If you use `langid.map.individual`, each detected language will be added to this field.
	\|langid.overwrite \|Boolean \|false \|no \|Specifies whether the content of the `langField` and `langsField` fields will be overwritten if they already contain values.
	\|langid.lcmap \|string \|none \|false \|A space-separated list specifying colon delimited language code mappings to apply to the detected languages. For example, you might use this to map Chinese, Japanese, and Korean to a common `cjk` code, and map both American and British English to a single `en` code by using `langid.lcmap=ja:cjk zh:cjk ko:cjk en_GB:en en_US:en`. This affects both the values put into the `langField` and `langsField` fields, as well as the field suffixes when using `langid.map`, unless overridden by `langid.map.lcmap`
	\|langid.threshold \|float \|0.5 \|no \|Specifies a threshold value between 0 and 1 that the language identification score must reach before `langid` accepts it. With longer text fields, a high threshold such at 0.8 will give good results. For shorter text fields, you may need to lower the threshold for language identification, though you will be risking somewhat lower quality results. We recommend experimenting with your data to tune your results.
	\|langid.whitelist \|string \|none \|no \|Specifies a list of allowed language identification codes. Use this in combination with `langid.map` to ensure that you only index documents into fields that are in your schema.
	\|langid.map \|Boolean \|false \|no \|Enables field name mapping. If true, Solr will map field names for all fields listed in `langid.fl`.
	\|langid.map.fl \|string \|none \|no \|A comma-separated list of fields for `langid.map` that is different than the fields specified in `langid.fl`.
	\|langid.map.keepOrig \|Boolean \|false \|no \|If true, Solr will copy the field during the field name mapping process, leaving the original field in place.
	\|langid.map.individual \|Boolean \|false \|no \|If true, Solr will detect and map languages for each field individually.
	\|langid.map.individual.fl \|string \|none \|no \|A comma-separated list of fields for use with `langid.map.individual` that is different than the fields specified in `langid.fl`.
	\|langid.fallbackFields \|string \|none \|no \|If no language is detected that meets the `langid.threshold` score, or if the detected language is not on the `langid.whitelist`, this field specifies language codes to be used as fallback values. If no appropriate fallback languages are found, Solr will use the language code specified in `langid.fallback`.
	\|langid.fallback \|string \|none \|no \|Specifies a language code to use if no language is detected or specified in `langid.fallbackFields`.
	\|langid.map.lcmap \|string \|determined by `langid.lcmap` \|no \|A space-separated list specifying colon delimited language code mappings to use when mapping field names. For example, you might use this to make Chinese, Japanese, and Korean language fields use a common `_cjk` suffix, and map both American and British English fields to a single `_en` by using `langid.map.lcmap=ja:cjk zh:cjk ko:cjk en_GB:en en_US:en`.
	\|langid.map.pattern \|Java regular expression \|none \|no \|By default, fields are mapped as <field>_<language>. To change this pattern, you can specify a Java regular expression in this parameter.
	\|langid.map.replace \|Java replace \|none \|no \|By default, fields are mapped as <field>_<language>. To change this pattern, you can specify a Java replace in this parameter.
	\|langid.enforceSchema \|Boolean \|true \|no \|If false, the `langid` processor does not validate field names against your schema. This may be useful if you plan to rename or delete fields later in the UpdateChain.
	\|===