lucene/analysis/icu/src/java/overview.html - lucene-solr - Git at Google

 <!--
  Licensed to the Apache Software Foundation (ASF) under one or more
  contributor license agreements.  See the NOTICE file distributed with
  this work for additional information regarding copyright ownership.
  The ASF licenses this file to You under the Apache License, Version 2.0
  (the "License"); you may not use this file except in compliance with
  the License.  You may obtain a copy of the License at

      http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License.
 -->
 <!-- :Post-Release-Update-Version.LUCENE_XY: - several mentions in this file -->
 <html>
   <head>
     <META http-equiv="Content-Type" content="text/html; charset=UTF-8">
     <title>
       Apache Lucene ICU integration module
     </title>
   </head>
 <body>
 <p>
 This module exposes functionality from
 <a href="http://site.icu-project.org/">ICU</a> to Apache Lucene. ICU4J is a Java
 library that enhances Java's internationalization support by improving
 performance, keeping current with the Unicode Standard, and providing richer
 APIs.
 <p>
 For an introduction to Lucene's analysis API, see the {@link org.apache.lucene.analysis} package documentation.
 <p>
 This module exposes the following functionality:
 </p>
 <ul>
   <li><a href="#segmentation">Text Segmentation</a>: Tokenizes text based on
   properties and rules defined in Unicode.</li>
   <li><a href="#collation">Collation</a>: Compare strings according to the
   conventions and standards of a particular language, region or country.</li>
   <li><a href="#normalization">Normalization</a>: Converts text to a unique,
   equivalent form.</li>
   <li><a href="#casefolding">Case Folding</a>: Removes case distinctions with
   Unicode's Default Caseless Matching algorithm.</li>
   <li><a href="#searchfolding">Search Term Folding</a>: Removes distinctions
   (such as accent marks) between similar characters for a loose or fuzzy search.</li>
   <li><a href="#transform">Text Transformation</a>: Transforms Unicode text in
   a context-sensitive fashion: e.g. mapping Traditional to Simplified Chinese</li>
 </ul>
 <hr/>
 <h1><a name="segmentation">Text Segmentation</a></h1>
 <p>
 Text Segmentation (Tokenization) divides document and query text into index terms
 (typically words). Unicode provides special properties and rules so that this can
 be done in a manner that works well with most languages.
 </p>
 <p>
 Text Segmentation implements the word segmentation specified in
 <a href="http://unicode.org/reports/tr29/">Unicode Text Segmentation</a>.
 Additionally the algorithm can be tailored based on writing system, for example
 text in the Thai script is automatically delegated to a dictionary-based segmentation
 algorithm.
 </p>
 <h2>Use Cases</h2>
 <ul>
   <li>
     As a more thorough replacement for StandardTokenizer that works well for
     most languages.
   </li>
 </ul>
 <h2>Example Usages</h2>
 <h3>Tokenizing multilanguage text</h3>
 <pre class="prettyprint">
   /**
    * This tokenizer will work well in general for most languages.
    */
   Tokenizer tokenizer = new ICUTokenizer(reader);
 </pre>
 <hr/>
 <h1><a name="collation">Collation</a></h1>
 <p>
   <code>ICUCollationKeyAnalyzer</code>
   converts each token into its binary <code>CollationKey</code> using the
   provided <code>Collator</code>, allowing it to be
   stored as an index term.
 </p>
 <p>
   <code>ICUCollationKeyAnalyzer</code> depends on ICU4J to produce the
   <code>CollationKey</code>s.
 </p>

 <h2>Use Cases</h2>

 <ul>
   <li>
     Efficient sorting of terms in languages that use non-Unicode character
     orderings.  (Lucene Sort using a Locale can be very slow.)
   </li>
   <li>
     Efficient range queries over fields that contain terms in languages that
     use non-Unicode character orderings.  (Range queries using a Locale can be
     very slow.)
   </li>
   <li>
     Effective Locale-specific normalization (case differences, diacritics, etc.).
     ({@link org.apache.lucene.analysis.core.LowerCaseFilter} and
     {@link org.apache.lucene.analysis.miscellaneous.ASCIIFoldingFilter} provide these services
     in a generic way that doesn't take into account locale-specific needs.)
   </li>
 </ul>

 <h2>Example Usages</h2>

 <h3>Farsi Range Queries</h3>
 <pre class="prettyprint">
   Collator collator = Collator.getInstance(new ULocale("ar"));
   ICUCollationKeyAnalyzer analyzer = new ICUCollationKeyAnalyzer(Version.LUCENE_5_0, collator);
   RAMDirectory ramDir = new RAMDirectory();
   IndexWriter writer = new IndexWriter(ramDir, new IndexWriterConfig(Version.LUCENE_5_0, analyzer));
   Document doc = new Document();
   doc.add(new Field("content", "\u0633\u0627\u0628",
                     Field.Store.YES, Field.Index.ANALYZED));
   writer.addDocument(doc);
   writer.close();
   IndexSearcher is = new IndexSearcher(ramDir, true);

   QueryParser aqp = new QueryParser(Version.LUCENE_5_0, "content", analyzer);
   aqp.setAnalyzeRangeTerms(true);

   // Unicode order would include U+0633 in [ U+062F - U+0698 ], but Farsi
   // orders the U+0698 character before the U+0633 character, so the single
   // indexed Term above should NOT be returned by a ConstantScoreRangeQuery
   // with a Farsi Collator (or an Arabic one for the case when Farsi is not
   // supported).
   ScoreDoc[] result
     = is.search(aqp.parse("[ \u062F TO \u0698 ]"), null, 1000).scoreDocs;
   assertEquals("The index Term should not be included.", 0, result.length);
 </pre>

 <h3>Danish Sorting</h3>
 <pre class="prettyprint">
   Analyzer analyzer
     = new ICUCollationKeyAnalyzer(Version.LUCENE_5_0, Collator.getInstance(new ULocale("da", "dk")));
   RAMDirectory indexStore = new RAMDirectory();
   IndexWriter writer = new IndexWriter(indexStore, new IndexWriterConfig(Version.LUCENE_5_0, analyzer));
   String[] tracer = new String[] { "A", "B", "C", "D", "E" };
   String[] data = new String[] { "HAT", "HUT", "H\u00C5T", "H\u00D8T", "HOT" };
   String[] sortedTracerOrder = new String[] { "A", "E", "B", "D", "C" };
   for (int i = 0 ; i < data.length ; ++i) {
     Document doc = new Document();
     doc.add(new Field("tracer", tracer[i], Field.Store.YES, Field.Index.NO));
     doc.add(new Field("contents", data[i], Field.Store.NO, Field.Index.ANALYZED));
     writer.addDocument(doc);
   }
   writer.close();
   IndexSearcher searcher = new IndexSearcher(indexStore, true);
   Sort sort = new Sort();
   sort.setSort(new SortField("contents", SortField.STRING));
   Query query = new MatchAllDocsQuery();
   ScoreDoc[] result = searcher.search(query, null, 1000, sort).scoreDocs;
   for (int i = 0 ; i < result.length ; ++i) {
     Document doc = searcher.doc(result[i].doc);
     assertEquals(sortedTracerOrder[i], doc.getValues("tracer")[0]);
   }
 </pre>

 <h3>Turkish Case Normalization</h3>
 <pre class="prettyprint">
   Collator collator = Collator.getInstance(new ULocale("tr", "TR"));
   collator.setStrength(Collator.PRIMARY);
   Analyzer analyzer = new ICUCollationKeyAnalyzer(Version.LUCENE_5_0, collator);
   RAMDirectory ramDir = new RAMDirectory();
   IndexWriter writer = new IndexWriter(ramDir, new IndexWriterConfig(Version.LUCENE_5_0, analyzer));
   Document doc = new Document();
   doc.add(new Field("contents", "DIGY", Field.Store.NO, Field.Index.ANALYZED));
   writer.addDocument(doc);
   writer.close();
   IndexSearcher is = new IndexSearcher(ramDir, true);
   QueryParser parser = new QueryParser(Version.LUCENE_5_0, "contents", analyzer);
   Query query = parser.parse("d\u0131gy");   // U+0131: dotless i
   ScoreDoc[] result = is.search(query, null, 1000).scoreDocs;
   assertEquals("The index Term should be included.", 1, result.length);
 </pre>

 <h2>Caveats and Comparisons</h2>
 <p>
   <strong>WARNING:</strong> Make sure you use exactly the same
   <code>Collator</code> at index and query time -- <code>CollationKey</code>s
   are only comparable when produced by
   the same <code>Collator</code>.  Since {@link java.text.RuleBasedCollator}s
   are not independently versioned, it is unsafe to search against stored
   <code>CollationKey</code>s unless the following are exactly the same (best
   practice is to store this information with the index and check that they
   remain the same at query time):
 </p>
 <ol>
   <li>JVM vendor</li>
   <li>JVM version, including patch version</li>
   <li>
     The language (and country and variant, if specified) of the Locale
     used when constructing the collator via
     {@link java.text.Collator#getInstance(java.util.Locale)}.
   </li>
   <li>
     The collation strength used - see {@link java.text.Collator#setStrength(int)}
   </li>
 </ol>
 <p>
   <code>ICUCollationKeyAnalyzer</code> uses ICU4J's <code>Collator</code>, which
   makes its version available, thus allowing collation to be versioned
   independently from the JVM.  <code>ICUCollationKeyAnalyzer</code> is also
   significantly faster and generates significantly shorter keys than
   <code>CollationKeyAnalyzer</code>.  See
   <a href="http://site.icu-project.org/charts/collation-icu4j-sun"
     >http://site.icu-project.org/charts/collation-icu4j-sun</a> for key
   generation timing and key length comparisons between ICU4J and
   <code>java.text.Collator</code> over several languages.
 </p>
 <p>
   <code>CollationKey</code>s generated by <code>java.text.Collator</code>s are
   not compatible with those those generated by ICU Collators.  Specifically, if
   you use <code>CollationKeyAnalyzer</code> to generate index terms, do not use
   <code>ICUCollationKeyAnalyzer</code> on the query side, or vice versa.
 </p>
 <hr/>
 <h1><a name="normalization">Normalization</a></h1>
 <p>
   <code>ICUNormalizer2Filter</code> normalizes term text to a
   <a href="http://unicode.org/reports/tr15/">Unicode Normalization Form</a>, so
   that <a href="http://en.wikipedia.org/wiki/Unicode_equivalence">equivalent</a>
   forms are standardized to a unique form.
 </p>
 <h2>Use Cases</h2>
 <ul>
   <li> Removing differences in width for Asian-language text.
   </li>
   <li> Standardizing complex text with non-spacing marks so that characters are
   ordered consistently.
   </li>
 </ul>
 <h2>Example Usages</h2>
 <h3>Normalizing text to NFC</h3>
 <pre class="prettyprint">
   /**
    * Normalizer2 objects are unmodifiable and immutable.
    */
   Normalizer2 normalizer = Normalizer2.getInstance(null, "nfc", Normalizer2.Mode.COMPOSE);
   /**
    * This filter will normalize to NFC.
    */
   TokenStream tokenstream = new ICUNormalizer2Filter(tokenizer, normalizer);
 </pre>
 <hr/>
 <h1><a name="casefolding">Case Folding</a></h1>
 <p>
 Default caseless matching, or case-folding is more than just conversion to
 lowercase. For example, it handles cases such as the Greek sigma, so that
 "Μάϊος" and "ΜΆΪΟΣ" will match correctly.
 </p>
 <p>
 Case-folding is still only an approximation of the language-specific rules
 governing case. If the specific language is known, consider using
 ICUCollationKeyFilter and indexing collation keys instead. This implementation
 performs the "full" case-folding specified in the Unicode standard, and this
 may change the length of the term. For example, the German ß is case-folded
 to the string 'ss'.
 </p>
 <p>
 Case folding is related to normalization, and as such is coupled with it in
 this integration. To perform case-folding, you use normalization with the form
 "nfkc_cf" (which is the default).
 </p>
 <h2>Use Cases</h2>
 <ul>
   <li>
     As a more thorough replacement for LowerCaseFilter that has good behavior
     for most languages.
   </li>
 </ul>
 <h2>Example Usages</h2>
 <h3>Lowercasing text</h3>
 <pre class="prettyprint">
   /**
    * This filter will case-fold and normalize to NFKC.
    */
   TokenStream tokenstream = new ICUNormalizer2Filter(tokenizer);
 </pre>
 <hr/>
 <h1><a name="searchfolding">Search Term Folding</a></h1>
 <p>
 Search term folding removes distinctions (such as accent marks) between
 similar characters. It is useful for a fuzzy or loose search.
 </p>
 <p>
 Search term folding implements many of the foldings specified in
 <a href="http://www.unicode.org/reports/tr30/tr30-4.html">Character Foldings</a>
 as a special normalization form.  This folding applies NFKC, Case Folding, and
 many character foldings recursively.
 </p>
 <h2>Use Cases</h2>
 <ul>
   <li>
     As a more thorough replacement for ASCIIFoldingFilter and LowerCaseFilter
     that applies the same ideas to many more languages.
   </li>
 </ul>
 <h2>Example Usages</h2>
 <h3>Removing accents</h3>
 <pre class="prettyprint">
   /**
    * This filter will case-fold, remove accents and other distinctions, and
    * normalize to NFKC.
    */
   TokenStream tokenstream = new ICUFoldingFilter(tokenizer);
 </pre>
 <hr/>
 <h1><a name="transform">Text Transformation</a></h1>
 <p>
 ICU provides text-transformation functionality via its Transliteration API. This allows
 you to transform text in a variety of ways, taking context into account.
 </p>
 <p>
 For more information, see the
 <a href="http://userguide.icu-project.org/transforms/general">User's Guide</a>
 and
 <a href="http://userguide.icu-project.org/transforms/general/rules">Rule Tutorial</a>.
 </p>
 <h2>Use Cases</h2>
 <ul>
   <li>
     Convert Traditional to Simplified
   </li>
   <li>
     Transliterate between different writing systems: e.g. Romanization
   </li>
 </ul>
 <h2>Example Usages</h2>
 <h3>Convert Traditional to Simplified</h3>
 <pre class="prettyprint">
   /**
    * This filter will map Traditional Chinese to Simplified Chinese
    */
   TokenStream tokenstream = new ICUTransformFilter(tokenizer, Transliterator.getInstance("Traditional-Simplified"));
 </pre>
 <h3>Transliterate Serbian Cyrillic to Serbian Latin</h3>
 <pre class="prettyprint">
   /**
    * This filter will map Serbian Cyrillic to Serbian Latin according to BGN rules
    */
   TokenStream tokenstream = new ICUTransformFilter(tokenizer, Transliterator.getInstance("Serbian-Latin/BGN"));
 </pre>
 <hr/>
 <h1><a name="backcompat">Backwards Compatibility</a></h1>
 <p>
 This module exists to provide up-to-date Unicode functionality that supports
 the most recent version of Unicode (currently 6.3). However, some users who wish
 for stronger backwards compatibility can restrict
 {@link org.apache.lucene.analysis.icu.ICUNormalizer2Filter} to operate on only
 a specific Unicode Version by using a {@link com.ibm.icu.text.FilteredNormalizer2}.
 </p>
 <h2>Example Usages</h2>
 <h3>Restricting normalization to Unicode 5.0</h3>
 <pre class="prettyprint">
   /**
    * This filter will do NFC normalization, but will ignore any characters that
    * did not exist as of Unicode 5.0. Because of the normalization stability policy
    * of Unicode, this is an easy way to force normalization to a specific version.
    */
     Normalizer2 normalizer = Normalizer2.getInstance(null, "nfc", Normalizer2.Mode.COMPOSE);
     UnicodeSet set = new UnicodeSet("[:age=5.0:]");
     // see FilteredNormalizer2 docs, the set should be frozen or performance will suffer
     set.freeze();
     FilteredNormalizer2 unicode50 = new FilteredNormalizer2(normalizer, set);
     TokenStream tokenstream = new ICUNormalizer2Filter(tokenizer, unicode50);
 </pre>
 </body>
 </html>
	<!--
	Licensed to the Apache Software Foundation (ASF) under one or more
	contributor license agreements. See the NOTICE file distributed with
	this work for additional information regarding copyright ownership.
	The ASF licenses this file to You under the Apache License, Version 2.0
	(the "License"); you may not use this file except in compliance with
	the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing, software
	distributed under the License is distributed on an "AS IS" BASIS,
	WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	See the License for the specific language governing permissions and
	limitations under the License.
	-->
	<!-- :Post-Release-Update-Version.LUCENE_XY: - several mentions in this file -->
	<html>
	<head>
	<META http-equiv="Content-Type" content="text/html; charset=UTF-8">
	<title>
	Apache Lucene ICU integration module
	</title>
	</head>
	<body>
	<p>
	This module exposes functionality from
	<a href="http://site.icu-project.org/">ICU</a> to Apache Lucene. ICU4J is a Java
	library that enhances Java's internationalization support by improving
	performance, keeping current with the Unicode Standard, and providing richer
	APIs.
	<p>
	For an introduction to Lucene's analysis API, see the {@link org.apache.lucene.analysis} package documentation.
	<p>
	This module exposes the following functionality:
	</p>
	<ul>
	<li><a href="#segmentation">Text Segmentation</a>: Tokenizes text based on
	properties and rules defined in Unicode.</li>
	<li><a href="#collation">Collation</a>: Compare strings according to the
	conventions and standards of a particular language, region or country.</li>
	<li><a href="#normalization">Normalization</a>: Converts text to a unique,
	equivalent form.</li>
	<li><a href="#casefolding">Case Folding</a>: Removes case distinctions with
	Unicode's Default Caseless Matching algorithm.</li>
	<li><a href="#searchfolding">Search Term Folding</a>: Removes distinctions
	(such as accent marks) between similar characters for a loose or fuzzy search.</li>
	<li><a href="#transform">Text Transformation</a>: Transforms Unicode text in
	a context-sensitive fashion: e.g. mapping Traditional to Simplified Chinese</li>
	</ul>
	<hr/>
	<h1><a name="segmentation">Text Segmentation</a></h1>
	<p>
	Text Segmentation (Tokenization) divides document and query text into index terms
	(typically words). Unicode provides special properties and rules so that this can
	be done in a manner that works well with most languages.
	</p>
	<p>
	Text Segmentation implements the word segmentation specified in
	<a href="http://unicode.org/reports/tr29/">Unicode Text Segmentation</a>.
	Additionally the algorithm can be tailored based on writing system, for example
	text in the Thai script is automatically delegated to a dictionary-based segmentation
	algorithm.
	</p>
	<h2>Use Cases</h2>
	<ul>
	<li>
	As a more thorough replacement for StandardTokenizer that works well for
	most languages.
	</li>
	</ul>
	<h2>Example Usages</h2>
	<h3>Tokenizing multilanguage text</h3>
	<pre class="prettyprint">
	/**
	* This tokenizer will work well in general for most languages.
	*/
	Tokenizer tokenizer = new ICUTokenizer(reader);
	</pre>
	<hr/>
	<h1><a name="collation">Collation</a></h1>
	<p>
	<code>ICUCollationKeyAnalyzer</code>
	converts each token into its binary <code>CollationKey</code> using the
	provided <code>Collator</code>, allowing it to be
	stored as an index term.
	</p>
	<p>
	<code>ICUCollationKeyAnalyzer</code> depends on ICU4J to produce the
	<code>CollationKey</code>s.
	</p>

	<h2>Use Cases</h2>

	<ul>
	<li>
	Efficient sorting of terms in languages that use non-Unicode character
	orderings. (Lucene Sort using a Locale can be very slow.)
	</li>
	<li>
	Efficient range queries over fields that contain terms in languages that
	use non-Unicode character orderings. (Range queries using a Locale can be
	very slow.)
	</li>
	<li>
	Effective Locale-specific normalization (case differences, diacritics, etc.).
	({@link org.apache.lucene.analysis.core.LowerCaseFilter} and
	{@link org.apache.lucene.analysis.miscellaneous.ASCIIFoldingFilter} provide these services
	in a generic way that doesn't take into account locale-specific needs.)
	</li>
	</ul>

	<h2>Example Usages</h2>

	<h3>Farsi Range Queries</h3>
	<pre class="prettyprint">
	Collator collator = Collator.getInstance(new ULocale("ar"));
	ICUCollationKeyAnalyzer analyzer = new ICUCollationKeyAnalyzer(Version.LUCENE_5_0, collator);
	RAMDirectory ramDir = new RAMDirectory();
	IndexWriter writer = new IndexWriter(ramDir, new IndexWriterConfig(Version.LUCENE_5_0, analyzer));
	Document doc = new Document();
	doc.add(new Field("content", "\u0633\u0627\u0628",
	Field.Store.YES, Field.Index.ANALYZED));
	writer.addDocument(doc);
	writer.close();
	IndexSearcher is = new IndexSearcher(ramDir, true);

	QueryParser aqp = new QueryParser(Version.LUCENE_5_0, "content", analyzer);
	aqp.setAnalyzeRangeTerms(true);

	// Unicode order would include U+0633 in [ U+062F - U+0698 ], but Farsi
	// orders the U+0698 character before the U+0633 character, so the single
	// indexed Term above should NOT be returned by a ConstantScoreRangeQuery
	// with a Farsi Collator (or an Arabic one for the case when Farsi is not
	// supported).
	ScoreDoc[] result
	= is.search(aqp.parse("[ \u062F TO \u0698 ]"), null, 1000).scoreDocs;
	assertEquals("The index Term should not be included.", 0, result.length);
	</pre>

	<h3>Danish Sorting</h3>
	<pre class="prettyprint">
	Analyzer analyzer
	= new ICUCollationKeyAnalyzer(Version.LUCENE_5_0, Collator.getInstance(new ULocale("da", "dk")));
	RAMDirectory indexStore = new RAMDirectory();
	IndexWriter writer = new IndexWriter(indexStore, new IndexWriterConfig(Version.LUCENE_5_0, analyzer));
	String[] tracer = new String[] { "A", "B", "C", "D", "E" };
	String[] data = new String[] { "HAT", "HUT", "H\u00C5T", "H\u00D8T", "HOT" };
	String[] sortedTracerOrder = new String[] { "A", "E", "B", "D", "C" };
	for (int i = 0 ; i < data.length ; ++i) {
	Document doc = new Document();
	doc.add(new Field("tracer", tracer[i], Field.Store.YES, Field.Index.NO));
	doc.add(new Field("contents", data[i], Field.Store.NO, Field.Index.ANALYZED));
	writer.addDocument(doc);
	}
	writer.close();
	IndexSearcher searcher = new IndexSearcher(indexStore, true);
	Sort sort = new Sort();
	sort.setSort(new SortField("contents", SortField.STRING));
	Query query = new MatchAllDocsQuery();
	ScoreDoc[] result = searcher.search(query, null, 1000, sort).scoreDocs;
	for (int i = 0 ; i < result.length ; ++i) {
	Document doc = searcher.doc(result[i].doc);
	assertEquals(sortedTracerOrder[i], doc.getValues("tracer")[0]);
	}
	</pre>

	<h3>Turkish Case Normalization</h3>
	<pre class="prettyprint">
	Collator collator = Collator.getInstance(new ULocale("tr", "TR"));
	collator.setStrength(Collator.PRIMARY);
	Analyzer analyzer = new ICUCollationKeyAnalyzer(Version.LUCENE_5_0, collator);
	RAMDirectory ramDir = new RAMDirectory();
	IndexWriter writer = new IndexWriter(ramDir, new IndexWriterConfig(Version.LUCENE_5_0, analyzer));
	Document doc = new Document();
	doc.add(new Field("contents", "DIGY", Field.Store.NO, Field.Index.ANALYZED));
	writer.addDocument(doc);
	writer.close();
	IndexSearcher is = new IndexSearcher(ramDir, true);
	QueryParser parser = new QueryParser(Version.LUCENE_5_0, "contents", analyzer);
	Query query = parser.parse("d\u0131gy"); // U+0131: dotless i
	ScoreDoc[] result = is.search(query, null, 1000).scoreDocs;
	assertEquals("The index Term should be included.", 1, result.length);
	</pre>

	<h2>Caveats and Comparisons</h2>
	<p>
	<strong>WARNING:</strong> Make sure you use exactly the same
	<code>Collator</code> at index and query time -- <code>CollationKey</code>s
	are only comparable when produced by
	the same <code>Collator</code>. Since {@link java.text.RuleBasedCollator}s
	are not independently versioned, it is unsafe to search against stored
	<code>CollationKey</code>s unless the following are exactly the same (best
	practice is to store this information with the index and check that they
	remain the same at query time):
	</p>
	<ol>
	<li>JVM vendor</li>
	<li>JVM version, including patch version</li>
	<li>
	The language (and country and variant, if specified) of the Locale
	used when constructing the collator via
	{@link java.text.Collator#getInstance(java.util.Locale)}.
	</li>
	<li>
	The collation strength used - see {@link java.text.Collator#setStrength(int)}
	</li>
	</ol>
	<p>
	<code>ICUCollationKeyAnalyzer</code> uses ICU4J's <code>Collator</code>, which
	makes its version available, thus allowing collation to be versioned
	independently from the JVM. <code>ICUCollationKeyAnalyzer</code> is also
	significantly faster and generates significantly shorter keys than
	<code>CollationKeyAnalyzer</code>. See
	<a href="http://site.icu-project.org/charts/collation-icu4j-sun"
	>http://site.icu-project.org/charts/collation-icu4j-sun</a> for key
	generation timing and key length comparisons between ICU4J and
	<code>java.text.Collator</code> over several languages.
	</p>
	<p>
	<code>CollationKey</code>s generated by <code>java.text.Collator</code>s are
	not compatible with those those generated by ICU Collators. Specifically, if
	you use <code>CollationKeyAnalyzer</code> to generate index terms, do not use
	<code>ICUCollationKeyAnalyzer</code> on the query side, or vice versa.
	</p>
	<hr/>
	<h1><a name="normalization">Normalization</a></h1>
	<p>
	<code>ICUNormalizer2Filter</code> normalizes term text to a
	<a href="http://unicode.org/reports/tr15/">Unicode Normalization Form</a>, so
	that <a href="http://en.wikipedia.org/wiki/Unicode_equivalence">equivalent</a>
	forms are standardized to a unique form.
	</p>
	<h2>Use Cases</h2>
	<ul>
	<li> Removing differences in width for Asian-language text.
	</li>
	<li> Standardizing complex text with non-spacing marks so that characters are
	ordered consistently.
	</li>
	</ul>
	<h2>Example Usages</h2>
	<h3>Normalizing text to NFC</h3>
	<pre class="prettyprint">
	/**
	* Normalizer2 objects are unmodifiable and immutable.
	*/
	Normalizer2 normalizer = Normalizer2.getInstance(null, "nfc", Normalizer2.Mode.COMPOSE);
	/**
	* This filter will normalize to NFC.
	*/
	TokenStream tokenstream = new ICUNormalizer2Filter(tokenizer, normalizer);
	</pre>
	<hr/>
	<h1><a name="casefolding">Case Folding</a></h1>
	<p>
	Default caseless matching, or case-folding is more than just conversion to
	lowercase. For example, it handles cases such as the Greek sigma, so that
	"Μάϊος" and "ΜΆΪΟΣ" will match correctly.
	</p>
	<p>
	Case-folding is still only an approximation of the language-specific rules
	governing case. If the specific language is known, consider using
	ICUCollationKeyFilter and indexing collation keys instead. This implementation
	performs the "full" case-folding specified in the Unicode standard, and this
	may change the length of the term. For example, the German ß is case-folded
	to the string 'ss'.
	</p>
	<p>
	Case folding is related to normalization, and as such is coupled with it in
	this integration. To perform case-folding, you use normalization with the form
	"nfkc_cf" (which is the default).
	</p>
	<h2>Use Cases</h2>
	<ul>
	<li>
	As a more thorough replacement for LowerCaseFilter that has good behavior
	for most languages.
	</li>
	</ul>
	<h2>Example Usages</h2>
	<h3>Lowercasing text</h3>
	<pre class="prettyprint">
	/**
	* This filter will case-fold and normalize to NFKC.
	*/
	TokenStream tokenstream = new ICUNormalizer2Filter(tokenizer);
	</pre>
	<hr/>
	<h1><a name="searchfolding">Search Term Folding</a></h1>
	<p>
	Search term folding removes distinctions (such as accent marks) between
	similar characters. It is useful for a fuzzy or loose search.
	</p>
	<p>
	Search term folding implements many of the foldings specified in
	<a href="http://www.unicode.org/reports/tr30/tr30-4.html">Character Foldings</a>
	as a special normalization form. This folding applies NFKC, Case Folding, and
	many character foldings recursively.
	</p>
	<h2>Use Cases</h2>
	<ul>
	<li>
	As a more thorough replacement for ASCIIFoldingFilter and LowerCaseFilter
	that applies the same ideas to many more languages.
	</li>
	</ul>
	<h2>Example Usages</h2>
	<h3>Removing accents</h3>
	<pre class="prettyprint">
	/**
	* This filter will case-fold, remove accents and other distinctions, and
	* normalize to NFKC.
	*/
	TokenStream tokenstream = new ICUFoldingFilter(tokenizer);
	</pre>
	<hr/>
	<h1><a name="transform">Text Transformation</a></h1>
	<p>
	ICU provides text-transformation functionality via its Transliteration API. This allows
	you to transform text in a variety of ways, taking context into account.
	</p>
	<p>
	For more information, see the
	<a href="http://userguide.icu-project.org/transforms/general">User's Guide</a>
	and
	<a href="http://userguide.icu-project.org/transforms/general/rules">Rule Tutorial</a>.
	</p>
	<h2>Use Cases</h2>
	<ul>
	<li>
	Convert Traditional to Simplified
	</li>
	<li>
	Transliterate between different writing systems: e.g. Romanization
	</li>
	</ul>
	<h2>Example Usages</h2>
	<h3>Convert Traditional to Simplified</h3>
	<pre class="prettyprint">
	/**
	* This filter will map Traditional Chinese to Simplified Chinese
	*/
	TokenStream tokenstream = new ICUTransformFilter(tokenizer, Transliterator.getInstance("Traditional-Simplified"));
	</pre>
	<h3>Transliterate Serbian Cyrillic to Serbian Latin</h3>
	<pre class="prettyprint">
	/**
	* This filter will map Serbian Cyrillic to Serbian Latin according to BGN rules
	*/
	TokenStream tokenstream = new ICUTransformFilter(tokenizer, Transliterator.getInstance("Serbian-Latin/BGN"));
	</pre>
	<hr/>
	<h1><a name="backcompat">Backwards Compatibility</a></h1>
	<p>
	This module exists to provide up-to-date Unicode functionality that supports
	the most recent version of Unicode (currently 6.3). However, some users who wish
	for stronger backwards compatibility can restrict
	{@link org.apache.lucene.analysis.icu.ICUNormalizer2Filter} to operate on only
	a specific Unicode Version by using a {@link com.ibm.icu.text.FilteredNormalizer2}.
	</p>
	<h2>Example Usages</h2>
	<h3>Restricting normalization to Unicode 5.0</h3>
	<pre class="prettyprint">
	/**
	* This filter will do NFC normalization, but will ignore any characters that
	* did not exist as of Unicode 5.0. Because of the normalization stability policy
	* of Unicode, this is an easy way to force normalization to a specific version.
	*/
	Normalizer2 normalizer = Normalizer2.getInstance(null, "nfc", Normalizer2.Mode.COMPOSE);
	UnicodeSet set = new UnicodeSet("[:age=5.0:]");
	// see FilteredNormalizer2 docs, the set should be frozen or performance will suffer
	set.freeze();
	FilteredNormalizer2 unicode50 = new FilteredNormalizer2(normalizer, set);
	TokenStream tokenstream = new ICUNormalizer2Filter(tokenizer, unicode50);
	</pre>
	</body>
	</html>