| <!-- |
| Licensed to the Apache Software Foundation (ASF) under one or more |
| contributor license agreements. See the NOTICE file distributed with |
| this work for additional information regarding copyright ownership. |
| The ASF licenses this file to You under the Apache License, Version 2.0 |
| (the "License"); you may not use this file except in compliance with |
| the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License. |
| --> |
| <!-- :Post-Release-Update-Version.LUCENE_XY: - several mentions in this file --> |
| <html> |
| <head> |
| <META http-equiv="Content-Type" content="text/html; charset=UTF-8"> |
| <title> |
| Apache Lucene ICU integration module |
| </title> |
| </head> |
| <body> |
| <p> |
| This module exposes functionality from |
| <a href="http://site.icu-project.org/">ICU</a> to Apache Lucene. ICU4J is a Java |
| library that enhances Java's internationalization support by improving |
| performance, keeping current with the Unicode Standard, and providing richer |
| APIs. |
| <p> |
| For an introduction to Lucene's analysis API, see the {@link org.apache.lucene.analysis} package documentation. |
| <p> |
| This module exposes the following functionality: |
| </p> |
| <ul> |
| <li><a href="#segmentation">Text Segmentation</a>: Tokenizes text based on |
| properties and rules defined in Unicode.</li> |
| <li><a href="#collation">Collation</a>: Compare strings according to the |
| conventions and standards of a particular language, region or country.</li> |
| <li><a href="#normalization">Normalization</a>: Converts text to a unique, |
| equivalent form.</li> |
| <li><a href="#casefolding">Case Folding</a>: Removes case distinctions with |
| Unicode's Default Caseless Matching algorithm.</li> |
| <li><a href="#searchfolding">Search Term Folding</a>: Removes distinctions |
| (such as accent marks) between similar characters for a loose or fuzzy search.</li> |
| <li><a href="#transform">Text Transformation</a>: Transforms Unicode text in |
| a context-sensitive fashion: e.g. mapping Traditional to Simplified Chinese</li> |
| </ul> |
| <hr/> |
| <h1><a name="segmentation">Text Segmentation</a></h1> |
| <p> |
| Text Segmentation (Tokenization) divides document and query text into index terms |
| (typically words). Unicode provides special properties and rules so that this can |
| be done in a manner that works well with most languages. |
| </p> |
| <p> |
| Text Segmentation implements the word segmentation specified in |
| <a href="http://unicode.org/reports/tr29/">Unicode Text Segmentation</a>. |
| Additionally the algorithm can be tailored based on writing system, for example |
| text in the Thai script is automatically delegated to a dictionary-based segmentation |
| algorithm. |
| </p> |
| <h2>Use Cases</h2> |
| <ul> |
| <li> |
| As a more thorough replacement for StandardTokenizer that works well for |
| most languages. |
| </li> |
| </ul> |
| <h2>Example Usages</h2> |
| <h3>Tokenizing multilanguage text</h3> |
| <pre class="prettyprint"> |
| /** |
| * This tokenizer will work well in general for most languages. |
| */ |
| Tokenizer tokenizer = new ICUTokenizer(reader); |
| </pre> |
| <hr/> |
| <h1><a name="collation">Collation</a></h1> |
| <p> |
| <code>ICUCollationKeyAnalyzer</code> |
| converts each token into its binary <code>CollationKey</code> using the |
| provided <code>Collator</code>, allowing it to be |
| stored as an index term. |
| </p> |
| <p> |
| <code>ICUCollationKeyAnalyzer</code> depends on ICU4J to produce the |
| <code>CollationKey</code>s. |
| </p> |
| |
| <h2>Use Cases</h2> |
| |
| <ul> |
| <li> |
| Efficient sorting of terms in languages that use non-Unicode character |
| orderings. (Lucene Sort using a Locale can be very slow.) |
| </li> |
| <li> |
| Efficient range queries over fields that contain terms in languages that |
| use non-Unicode character orderings. (Range queries using a Locale can be |
| very slow.) |
| </li> |
| <li> |
| Effective Locale-specific normalization (case differences, diacritics, etc.). |
| ({@link org.apache.lucene.analysis.core.LowerCaseFilter} and |
| {@link org.apache.lucene.analysis.miscellaneous.ASCIIFoldingFilter} provide these services |
| in a generic way that doesn't take into account locale-specific needs.) |
| </li> |
| </ul> |
| |
| <h2>Example Usages</h2> |
| |
| <h3>Farsi Range Queries</h3> |
| <pre class="prettyprint"> |
| Collator collator = Collator.getInstance(new ULocale("ar")); |
| ICUCollationKeyAnalyzer analyzer = new ICUCollationKeyAnalyzer(Version.LUCENE_5_0, collator); |
| RAMDirectory ramDir = new RAMDirectory(); |
| IndexWriter writer = new IndexWriter(ramDir, new IndexWriterConfig(Version.LUCENE_5_0, analyzer)); |
| Document doc = new Document(); |
| doc.add(new Field("content", "\u0633\u0627\u0628", |
| Field.Store.YES, Field.Index.ANALYZED)); |
| writer.addDocument(doc); |
| writer.close(); |
| IndexSearcher is = new IndexSearcher(ramDir, true); |
| |
| QueryParser aqp = new QueryParser(Version.LUCENE_5_0, "content", analyzer); |
| aqp.setAnalyzeRangeTerms(true); |
| |
| // Unicode order would include U+0633 in [ U+062F - U+0698 ], but Farsi |
| // orders the U+0698 character before the U+0633 character, so the single |
| // indexed Term above should NOT be returned by a ConstantScoreRangeQuery |
| // with a Farsi Collator (or an Arabic one for the case when Farsi is not |
| // supported). |
| ScoreDoc[] result |
| = is.search(aqp.parse("[ \u062F TO \u0698 ]"), null, 1000).scoreDocs; |
| assertEquals("The index Term should not be included.", 0, result.length); |
| </pre> |
| |
| <h3>Danish Sorting</h3> |
| <pre class="prettyprint"> |
| Analyzer analyzer |
| = new ICUCollationKeyAnalyzer(Version.LUCENE_5_0, Collator.getInstance(new ULocale("da", "dk"))); |
| RAMDirectory indexStore = new RAMDirectory(); |
| IndexWriter writer = new IndexWriter(indexStore, new IndexWriterConfig(Version.LUCENE_5_0, analyzer)); |
| String[] tracer = new String[] { "A", "B", "C", "D", "E" }; |
| String[] data = new String[] { "HAT", "HUT", "H\u00C5T", "H\u00D8T", "HOT" }; |
| String[] sortedTracerOrder = new String[] { "A", "E", "B", "D", "C" }; |
| for (int i = 0 ; i < data.length ; ++i) { |
| Document doc = new Document(); |
| doc.add(new Field("tracer", tracer[i], Field.Store.YES, Field.Index.NO)); |
| doc.add(new Field("contents", data[i], Field.Store.NO, Field.Index.ANALYZED)); |
| writer.addDocument(doc); |
| } |
| writer.close(); |
| IndexSearcher searcher = new IndexSearcher(indexStore, true); |
| Sort sort = new Sort(); |
| sort.setSort(new SortField("contents", SortField.STRING)); |
| Query query = new MatchAllDocsQuery(); |
| ScoreDoc[] result = searcher.search(query, null, 1000, sort).scoreDocs; |
| for (int i = 0 ; i < result.length ; ++i) { |
| Document doc = searcher.doc(result[i].doc); |
| assertEquals(sortedTracerOrder[i], doc.getValues("tracer")[0]); |
| } |
| </pre> |
| |
| <h3>Turkish Case Normalization</h3> |
| <pre class="prettyprint"> |
| Collator collator = Collator.getInstance(new ULocale("tr", "TR")); |
| collator.setStrength(Collator.PRIMARY); |
| Analyzer analyzer = new ICUCollationKeyAnalyzer(Version.LUCENE_5_0, collator); |
| RAMDirectory ramDir = new RAMDirectory(); |
| IndexWriter writer = new IndexWriter(ramDir, new IndexWriterConfig(Version.LUCENE_5_0, analyzer)); |
| Document doc = new Document(); |
| doc.add(new Field("contents", "DIGY", Field.Store.NO, Field.Index.ANALYZED)); |
| writer.addDocument(doc); |
| writer.close(); |
| IndexSearcher is = new IndexSearcher(ramDir, true); |
| QueryParser parser = new QueryParser(Version.LUCENE_5_0, "contents", analyzer); |
| Query query = parser.parse("d\u0131gy"); // U+0131: dotless i |
| ScoreDoc[] result = is.search(query, null, 1000).scoreDocs; |
| assertEquals("The index Term should be included.", 1, result.length); |
| </pre> |
| |
| <h2>Caveats and Comparisons</h2> |
| <p> |
| <strong>WARNING:</strong> Make sure you use exactly the same |
| <code>Collator</code> at index and query time -- <code>CollationKey</code>s |
| are only comparable when produced by |
| the same <code>Collator</code>. Since {@link java.text.RuleBasedCollator}s |
| are not independently versioned, it is unsafe to search against stored |
| <code>CollationKey</code>s unless the following are exactly the same (best |
| practice is to store this information with the index and check that they |
| remain the same at query time): |
| </p> |
| <ol> |
| <li>JVM vendor</li> |
| <li>JVM version, including patch version</li> |
| <li> |
| The language (and country and variant, if specified) of the Locale |
| used when constructing the collator via |
| {@link java.text.Collator#getInstance(java.util.Locale)}. |
| </li> |
| <li> |
| The collation strength used - see {@link java.text.Collator#setStrength(int)} |
| </li> |
| </ol> |
| <p> |
| <code>ICUCollationKeyAnalyzer</code> uses ICU4J's <code>Collator</code>, which |
| makes its version available, thus allowing collation to be versioned |
| independently from the JVM. <code>ICUCollationKeyAnalyzer</code> is also |
| significantly faster and generates significantly shorter keys than |
| <code>CollationKeyAnalyzer</code>. See |
| <a href="http://site.icu-project.org/charts/collation-icu4j-sun" |
| >http://site.icu-project.org/charts/collation-icu4j-sun</a> for key |
| generation timing and key length comparisons between ICU4J and |
| <code>java.text.Collator</code> over several languages. |
| </p> |
| <p> |
| <code>CollationKey</code>s generated by <code>java.text.Collator</code>s are |
| not compatible with those those generated by ICU Collators. Specifically, if |
| you use <code>CollationKeyAnalyzer</code> to generate index terms, do not use |
| <code>ICUCollationKeyAnalyzer</code> on the query side, or vice versa. |
| </p> |
| <hr/> |
| <h1><a name="normalization">Normalization</a></h1> |
| <p> |
| <code>ICUNormalizer2Filter</code> normalizes term text to a |
| <a href="http://unicode.org/reports/tr15/">Unicode Normalization Form</a>, so |
| that <a href="http://en.wikipedia.org/wiki/Unicode_equivalence">equivalent</a> |
| forms are standardized to a unique form. |
| </p> |
| <h2>Use Cases</h2> |
| <ul> |
| <li> Removing differences in width for Asian-language text. |
| </li> |
| <li> Standardizing complex text with non-spacing marks so that characters are |
| ordered consistently. |
| </li> |
| </ul> |
| <h2>Example Usages</h2> |
| <h3>Normalizing text to NFC</h3> |
| <pre class="prettyprint"> |
| /** |
| * Normalizer2 objects are unmodifiable and immutable. |
| */ |
| Normalizer2 normalizer = Normalizer2.getInstance(null, "nfc", Normalizer2.Mode.COMPOSE); |
| /** |
| * This filter will normalize to NFC. |
| */ |
| TokenStream tokenstream = new ICUNormalizer2Filter(tokenizer, normalizer); |
| </pre> |
| <hr/> |
| <h1><a name="casefolding">Case Folding</a></h1> |
| <p> |
| Default caseless matching, or case-folding is more than just conversion to |
| lowercase. For example, it handles cases such as the Greek sigma, so that |
| "Μάϊος" and "ΜΆΪΟΣ" will match correctly. |
| </p> |
| <p> |
| Case-folding is still only an approximation of the language-specific rules |
| governing case. If the specific language is known, consider using |
| ICUCollationKeyFilter and indexing collation keys instead. This implementation |
| performs the "full" case-folding specified in the Unicode standard, and this |
| may change the length of the term. For example, the German ß is case-folded |
| to the string 'ss'. |
| </p> |
| <p> |
| Case folding is related to normalization, and as such is coupled with it in |
| this integration. To perform case-folding, you use normalization with the form |
| "nfkc_cf" (which is the default). |
| </p> |
| <h2>Use Cases</h2> |
| <ul> |
| <li> |
| As a more thorough replacement for LowerCaseFilter that has good behavior |
| for most languages. |
| </li> |
| </ul> |
| <h2>Example Usages</h2> |
| <h3>Lowercasing text</h3> |
| <pre class="prettyprint"> |
| /** |
| * This filter will case-fold and normalize to NFKC. |
| */ |
| TokenStream tokenstream = new ICUNormalizer2Filter(tokenizer); |
| </pre> |
| <hr/> |
| <h1><a name="searchfolding">Search Term Folding</a></h1> |
| <p> |
| Search term folding removes distinctions (such as accent marks) between |
| similar characters. It is useful for a fuzzy or loose search. |
| </p> |
| <p> |
| Search term folding implements many of the foldings specified in |
| <a href="http://www.unicode.org/reports/tr30/tr30-4.html">Character Foldings</a> |
| as a special normalization form. This folding applies NFKC, Case Folding, and |
| many character foldings recursively. |
| </p> |
| <h2>Use Cases</h2> |
| <ul> |
| <li> |
| As a more thorough replacement for ASCIIFoldingFilter and LowerCaseFilter |
| that applies the same ideas to many more languages. |
| </li> |
| </ul> |
| <h2>Example Usages</h2> |
| <h3>Removing accents</h3> |
| <pre class="prettyprint"> |
| /** |
| * This filter will case-fold, remove accents and other distinctions, and |
| * normalize to NFKC. |
| */ |
| TokenStream tokenstream = new ICUFoldingFilter(tokenizer); |
| </pre> |
| <hr/> |
| <h1><a name="transform">Text Transformation</a></h1> |
| <p> |
| ICU provides text-transformation functionality via its Transliteration API. This allows |
| you to transform text in a variety of ways, taking context into account. |
| </p> |
| <p> |
| For more information, see the |
| <a href="http://userguide.icu-project.org/transforms/general">User's Guide</a> |
| and |
| <a href="http://userguide.icu-project.org/transforms/general/rules">Rule Tutorial</a>. |
| </p> |
| <h2>Use Cases</h2> |
| <ul> |
| <li> |
| Convert Traditional to Simplified |
| </li> |
| <li> |
| Transliterate between different writing systems: e.g. Romanization |
| </li> |
| </ul> |
| <h2>Example Usages</h2> |
| <h3>Convert Traditional to Simplified</h3> |
| <pre class="prettyprint"> |
| /** |
| * This filter will map Traditional Chinese to Simplified Chinese |
| */ |
| TokenStream tokenstream = new ICUTransformFilter(tokenizer, Transliterator.getInstance("Traditional-Simplified")); |
| </pre> |
| <h3>Transliterate Serbian Cyrillic to Serbian Latin</h3> |
| <pre class="prettyprint"> |
| /** |
| * This filter will map Serbian Cyrillic to Serbian Latin according to BGN rules |
| */ |
| TokenStream tokenstream = new ICUTransformFilter(tokenizer, Transliterator.getInstance("Serbian-Latin/BGN")); |
| </pre> |
| <hr/> |
| <h1><a name="backcompat">Backwards Compatibility</a></h1> |
| <p> |
| This module exists to provide up-to-date Unicode functionality that supports |
| the most recent version of Unicode (currently 6.3). However, some users who wish |
| for stronger backwards compatibility can restrict |
| {@link org.apache.lucene.analysis.icu.ICUNormalizer2Filter} to operate on only |
| a specific Unicode Version by using a {@link com.ibm.icu.text.FilteredNormalizer2}. |
| </p> |
| <h2>Example Usages</h2> |
| <h3>Restricting normalization to Unicode 5.0</h3> |
| <pre class="prettyprint"> |
| /** |
| * This filter will do NFC normalization, but will ignore any characters that |
| * did not exist as of Unicode 5.0. Because of the normalization stability policy |
| * of Unicode, this is an easy way to force normalization to a specific version. |
| */ |
| Normalizer2 normalizer = Normalizer2.getInstance(null, "nfc", Normalizer2.Mode.COMPOSE); |
| UnicodeSet set = new UnicodeSet("[:age=5.0:]"); |
| // see FilteredNormalizer2 docs, the set should be frozen or performance will suffer |
| set.freeze(); |
| FilteredNormalizer2 unicode50 = new FilteredNormalizer2(normalizer, set); |
| TokenStream tokenstream = new ICUNormalizer2Filter(tokenizer, unicode50); |
| </pre> |
| </body> |
| </html> |