| <!DOCTYPE html> |
| <!--[if IE]><![endif]--> |
| <html> |
| |
| <head> |
| <meta charset="utf-8"> |
| <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"> |
| <title>Namespace Lucene.Net.Analysis.Compound |
| | Apache Lucene.NET 4.8.0-beta00010 Documentation </title> |
| <meta name="viewport" content="width=device-width"> |
| <meta name="title" content="Namespace Lucene.Net.Analysis.Compound |
| | Apache Lucene.NET 4.8.0-beta00010 Documentation "> |
| <meta name="generator" content="docfx 2.56.0.0"> |
| |
| <link rel="shortcut icon" href="https://lucenenet.apache.org/docs/4.8.0-beta00009/logo/favicon.ico"> |
| <link rel="stylesheet" href="https://lucenenet.apache.org/docs/4.8.0-beta00009/styles/docfx.vendor.css"> |
| <link rel="stylesheet" href="https://lucenenet.apache.org/docs/4.8.0-beta00009/styles/docfx.css"> |
| <link rel="stylesheet" href="https://lucenenet.apache.org/docs/4.8.0-beta00009/styles/main.css"> |
| <meta property="docfx:navrel" content="toc.html"> |
| <meta property="docfx:tocrel" content="analysis-common/toc.html"> |
| |
| <meta property="docfx:rel" content="https://lucenenet.apache.org/docs/4.8.0-beta00009/"> |
| |
| </head> |
| <body data-spy="scroll" data-target="#affix" data-offset="120"> |
| <div id="wrapper"> |
| <header> |
| |
| <nav id="autocollapse" class="navbar ng-scope" role="navigation"> |
| <div class="container"> |
| <div class="navbar-header"> |
| <button type="button" class="navbar-toggle" data-toggle="collapse" data-target="#navbar"> |
| <span class="sr-only">Toggle navigation</span> |
| <span class="icon-bar"></span> |
| <span class="icon-bar"></span> |
| <span class="icon-bar"></span> |
| </button> |
| |
| <a class="navbar-brand" href="/"> |
| <img id="logo" class="svg" src="https://lucenenet.apache.org/docs/4.8.0-beta00009/logo/lucene-net-color.png" alt=""> |
| </a> |
| </div> |
| <div class="collapse navbar-collapse" id="navbar"> |
| <form class="navbar-form navbar-right" role="search" id="search"> |
| <div class="form-group"> |
| <input type="text" class="form-control" id="search-query" placeholder="Search" autocomplete="off"> |
| </div> |
| </form> |
| </div> |
| </div> |
| </nav> |
| |
| <div class="subnav navbar navbar-default"> |
| <div class="container hide-when-search"> |
| <ul class="level0 breadcrumb"> |
| <li> |
| <a href="https://lucenenet.apache.org/docs/4.8.0-beta00009/">API</a> |
| <span id="breadcrumb"> |
| <ul class="breadcrumb"> |
| <li></li> |
| </ul> |
| </span> |
| </li> |
| </ul> |
| </div> |
| </div> |
| </header> |
| <div class="container body-content"> |
| |
| <div id="search-results"> |
| <div class="search-list"></div> |
| <div class="sr-items"> |
| <p><i class="glyphicon glyphicon-refresh index-loading"></i></p> |
| </div> |
| <ul id="pagination"></ul> |
| </div> |
| </div> |
| <div role="main" class="container body-content hide-when-search"> |
| |
| <div class="sidenav hide-when-search"> |
| <a class="btn toc-toggle collapse" data-toggle="collapse" href="#sidetoggle" aria-expanded="false" aria-controls="sidetoggle">Show / Hide Table of Contents</a> |
| <div class="sidetoggle collapse" id="sidetoggle"> |
| <div id="sidetoc"></div> |
| </div> |
| </div> |
| <div class="article row grid-right"> |
| <div class="col-md-10"> |
| <article class="content wrap" id="_content" data-uid="Lucene.Net.Analysis.Compound"> |
| |
| <h1 id="Lucene_Net_Analysis_Compound" data-uid="Lucene.Net.Analysis.Compound" class="text-break">Namespace Lucene.Net.Analysis.Compound |
| </h1> |
| <div class="markdown level0 summary"><!-- |
| Licensed to the Apache Software Foundation (ASF) under one or more |
| contributor license agreements. See the NOTICE file distributed with |
| this work for additional information regarding copyright ownership. |
| The ASF licenses this file to You under the Apache License, Version 2.0 |
| (the "License"); you may not use this file except in compliance with |
| the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License. |
| --> |
| <p>A filter that decomposes compound words you find in many Germanic |
| languages into the word parts. This example shows what it does: |
| <table border="1"> |
| <tr> |
| <th>Input token stream</th> |
| </tr> |
| <tr> |
| <td>Rindfleischüberwachungsgesetz Drahtschere abba</td> |
| </tr> |
| </table></p> |
| <table border="1"> |
| <tr> |
| <th>Output token stream</th> |
| </tr> |
| <tr> |
| <td>(Rindfleischüberwachungsgesetz,0,29)</td> |
| </tr> |
| <tr> |
| <td>(Rind,0,4,posIncr=0)</td> |
| </tr> |
| <tr> |
| <td>(fleisch,4,11,posIncr=0)</td> |
| </tr> |
| <tr> |
| <td>(überwachung,11,22,posIncr=0)</td> |
| </tr> |
| <tr> |
| <td>(gesetz,23,29,posIncr=0)</td> |
| </tr> |
| <tr> |
| <td>(Drahtschere,30,41)</td> |
| </tr> |
| <tr> |
| <td>(Draht,30,35,posIncr=0)</td> |
| </tr> |
| <tr> |
| <td>(schere,35,41,posIncr=0)</td> |
| </tr> |
| <tr> |
| <td>(abba,42,46)</td> |
| </tr> |
| </table> |
| |
| <p>The input token is always preserved and the filters do not alter the case of word parts. There are two variants of the |
| filter available:</p> |
| <ul> |
| <li><p><em>HyphenationCompoundWordTokenFilter</em>: it uses a |
| hyphenation grammar based approach to find potential word parts of a |
| given word.</p> |
| </li> |
| <li><p><em>DictionaryCompoundWordTokenFilter</em>: it uses a |
| brute-force dictionary-only based approach to find the word parts of a given |
| word.</p> |
| </li> |
| </ul> |
| <h3 id="compound-word-token-filters">Compound word token filters</h3> |
| <h4 id="hyphenationcompoundwordtokenfilter">HyphenationCompoundWordTokenFilter</h4> |
| <p>The <a class="xref" href="Lucene.Net.Analysis.Compound.HyphenationCompoundWordTokenFilter.html"> |
| HyphenationCompoundWordTokenFilter</a> uses hyphenation grammars to find |
| potential subwords that a worth to check against the dictionary. It can be used |
| without a dictionary as well but then produces a lot of "nonword" tokens. |
| The quality of the output tokens is directly connected to the quality of the |
| grammar file you use. For languages like German they are quite good.</p> |
| <h5 id="grammar-file">Grammar file</h5> |
| <p>Unfortunately we cannot bundle the hyphenation grammar files with Lucene |
| because they do not use an ASF compatible license (they use the LaTeX |
| Project Public License instead). You can find the XML based grammar |
| files at the |
| <a href="http://offo.sourceforge.net/hyphenation/index.html">Objects |
| For Formatting Objects</a> |
| (OFFO) Sourceforge project (direct link to download the pattern files: |
| <a href="http://downloads.sourceforge.net/offo/offo-hyphenation.zip">http://downloads.sourceforge.net/offo/offo-hyphenation.zip</a> |
| ). The files you need are in the subfolder |
| <em>offo-hyphenation/hyph/</em> |
| .</p> |
| <p>Credits for the hyphenation code go to the |
| <a href="http://xmlgraphics.apache.org/fop/">Apache FOP project</a> |
| .</p> |
| <h4 id="dictionarycompoundwordtokenfilter">DictionaryCompoundWordTokenFilter</h4> |
| <p>The <a class="xref" href="Lucene.Net.Analysis.Compound.DictionaryCompoundWordTokenFilter.html"> |
| DictionaryCompoundWordTokenFilter</a> uses a dictionary-only approach to |
| find subwords in a compound word. It is much slower than the one that |
| uses the hyphenation grammars. You can use it as a first start to |
| see if your dictionary is good or not because it is much simpler in design.</p> |
| <h3 id="dictionary">Dictionary</h3> |
| <p>The output quality of both token filters is directly connected to the |
| quality of the dictionary you use. They are language dependent of course. |
| You always should use a dictionary |
| that fits to the text you want to index. If you index medical text for |
| example then you should use a dictionary that contains medical words. |
| A good start for general text are the dictionaries you find at the |
| <a href="http://wiki.services.openoffice.org/wiki/Dictionaries">OpenOffice |
| dictionaries</a> |
| Wiki.</p> |
| <h3 id="which-variant-should-i-use">Which variant should I use?</h3> |
| <p>This decision matrix should help you: |
| <table border="1"> |
| <tr> |
| <th>Token filter</th> |
| <th>Output quality</th> |
| <th>Performance</th> |
| </tr> |
| <tr> |
| <td>HyphenationCompoundWordTokenFilter</td> |
| <td>good if grammar file is good – acceptable otherwise</td> |
| <td>fast</td> |
| </tr> |
| <tr> |
| <td>DictionaryCompoundWordTokenFilter</td> |
| <td>good</td> |
| <td>slow</td> |
| </tr> |
| </table></p> |
| <h3 id="examples">Examples</h3> |
| <pre><code> public void testHyphenationCompoundWordsDE() throws Exception { |
| String[] dict = { "Rind", "Fleisch", "Draht", "Schere", "Gesetz", |
| "Aufgabe", "Überwachung" }; |
| |
| Reader reader = new FileReader("de_DR.xml"); |
| |
| HyphenationTree hyphenator = HyphenationCompoundWordTokenFilter |
| .getHyphenationTree(reader); |
| |
| HyphenationCompoundWordTokenFilter tf = new HyphenationCompoundWordTokenFilter( |
| new WhitespaceTokenizer(new StringReader( |
| "Rindfleischüberwachungsgesetz Drahtschere abba")), hyphenator, |
| dict, CompoundWordTokenFilterBase.DEFAULT_MIN_WORD_SIZE, |
| CompoundWordTokenFilterBase.DEFAULT_MIN_SUBWORD_SIZE, |
| CompoundWordTokenFilterBase.DEFAULT_MAX_SUBWORD_SIZE, false); |
| |
| CharTermAttribute t = tf.addAttribute(CharTermAttribute.class); |
| while (tf.incrementToken()) { |
| System.out.println(t); |
| } |
| } |
| </code></pre><p> public void testHyphenationCompoundWordsWithoutDictionaryDE() throws Exception { |
| Reader reader = new FileReader("de_DR.xml");</p> |
| <pre><code>HyphenationTree hyphenator = HyphenationCompoundWordTokenFilter |
| .getHyphenationTree(reader); |
| |
| HyphenationCompoundWordTokenFilter tf = new HyphenationCompoundWordTokenFilter( |
| new WhitespaceTokenizer(new StringReader( |
| "Rindfleischüberwachungsgesetz Drahtschere abba")), hyphenator); |
| |
| CharTermAttribute t = tf.addAttribute(CharTermAttribute.class); |
| while (tf.incrementToken()) { |
| System.out.println(t); |
| } |
| } |
| |
| public void testDumbCompoundWordsSE() throws Exception { |
| String[] dict = { "Bil", "Dörr", "Motor", "Tak", "Borr", "Slag", "Hammar", |
| "Pelar", "Glas", "Ögon", "Fodral", "Bas", "Fiol", "Makare", "Gesäll", |
| "Sko", "Vind", "Rute", "Torkare", "Blad" }; |
| |
| DictionaryCompoundWordTokenFilter tf = new DictionaryCompoundWordTokenFilter( |
| new WhitespaceTokenizer( |
| new StringReader( |
| "Bildörr Bilmotor Biltak Slagborr Hammarborr Pelarborr Glasögonfodral Basfiolsfodral Basfiolsfodralmakaregesäll Skomakare Vindrutetorkare Vindrutetorkarblad abba")), |
| dict); |
| CharTermAttribute t = tf.addAttribute(CharTermAttribute.class); |
| while (tf.incrementToken()) { |
| System.out.println(t); |
| } |
| } |
| </code></pre></div> |
| <div class="markdown level0 conceptual"></div> |
| <div class="markdown level0 remarks"></div> |
| <h3 id="classes">Classes |
| </h3> |
| <h4><a class="xref" href="Lucene.Net.Analysis.Compound.CompoundWordTokenFilterBase.html">CompoundWordTokenFilterBase</a></h4> |
| <section><p>Base class for decomposition token filters. |
| <p> |
| You must specify the required <span class="xref">Lucene.Net.Util.LuceneVersion</span> compatibility when creating |
| <a class="xref" href="Lucene.Net.Analysis.Compound.CompoundWordTokenFilterBase.html">CompoundWordTokenFilterBase</a>: |
| <ul><li>As of 3.1, CompoundWordTokenFilterBase correctly handles Unicode 4.0 |
| supplementary characters in strings and char arrays provided as compound word |
| dictionaries.</li><li>As of 4.4, <a class="xref" href="Lucene.Net.Analysis.Compound.CompoundWordTokenFilterBase.html">CompoundWordTokenFilterBase</a> doesn't update offsets.</li></ul></p> |
| </section> |
| <h4><a class="xref" href="Lucene.Net.Analysis.Compound.CompoundWordTokenFilterBase.CompoundToken.html">CompoundWordTokenFilterBase.CompoundToken</a></h4> |
| <section><p>Helper class to hold decompounded token information</p> |
| </section> |
| <h4><a class="xref" href="Lucene.Net.Analysis.Compound.DictionaryCompoundWordTokenFilter.html">DictionaryCompoundWordTokenFilter</a></h4> |
| <section><p>A <span class="xref">Lucene.Net.Analysis.TokenFilter</span> that decomposes compound words found in many Germanic languages. |
| <p> |
| "Donaudampfschiff" becomes Donau, dampf, schiff so that you can find |
| "Donaudampfschiff" even when you only enter "schiff". |
| It uses a brute-force algorithm to achieve this. |
| </p> |
| <p> |
| You must specify the required <span class="xref">Lucene.Net.Util.LuceneVersion</span> compatibility when creating |
| <a class="xref" href="Lucene.Net.Analysis.Compound.CompoundWordTokenFilterBase.html">CompoundWordTokenFilterBase</a>: |
| <ul><li>As of 3.1, CompoundWordTokenFilterBase correctly handles Unicode 4.0 |
| supplementary characters in strings and char arrays provided as compound word |
| dictionaries.</li></ul> |
| </p></p> |
| </section> |
| <h4><a class="xref" href="Lucene.Net.Analysis.Compound.DictionaryCompoundWordTokenFilterFactory.html">DictionaryCompoundWordTokenFilterFactory</a></h4> |
| <section><p>Factory for <a class="xref" href="Lucene.Net.Analysis.Compound.DictionaryCompoundWordTokenFilter.html">DictionaryCompoundWordTokenFilter</a>. </p> |
| <pre><code><fieldType name="text_dictcomp" class="solr.TextField" positionIncrementGap="100"> |
| <analyzer> |
| <tokenizer class="solr.WhitespaceTokenizerFactory"/> |
| <filter class="solr.DictionaryCompoundWordTokenFilterFactory" dictionary="dictionary.txt" |
| minWordSize="5" minSubwordSize="2" maxSubwordSize="15" onlyLongestMatch="true"/> |
| </analyzer> |
| </fieldType></code></pre> |
| </section> |
| <h4><a class="xref" href="Lucene.Net.Analysis.Compound.HyphenationCompoundWordTokenFilter.html">HyphenationCompoundWordTokenFilter</a></h4> |
| <section><p>A <span class="xref">Lucene.Net.Analysis.TokenFilter</span> that decomposes compound words found in many Germanic languages. |
| <p> |
| "Donaudampfschiff" becomes Donau, dampf, schiff so that you can find |
| "Donaudampfschiff" even when you only enter "schiff". It uses a hyphenation |
| grammar and a word dictionary to achieve this. |
| </p> |
| <p> |
| You must specify the required <span class="xref">Lucene.Net.Util.LuceneVersion</span> compatibility when creating |
| <a class="xref" href="Lucene.Net.Analysis.Compound.CompoundWordTokenFilterBase.html">CompoundWordTokenFilterBase</a>: |
| <ul><li>As of 3.1, CompoundWordTokenFilterBase correctly handles Unicode 4.0 |
| supplementary characters in strings and char arrays provided as compound word |
| dictionaries.</li></ul> |
| </p></p> |
| </section> |
| <h4><a class="xref" href="Lucene.Net.Analysis.Compound.HyphenationCompoundWordTokenFilterFactory.html">HyphenationCompoundWordTokenFilterFactory</a></h4> |
| <section><p>Factory for <a class="xref" href="Lucene.Net.Analysis.Compound.HyphenationCompoundWordTokenFilter.html">HyphenationCompoundWordTokenFilter</a>. |
| <p> |
| This factory accepts the following parameters: |
| <ul><li><pre><code>hyphenator</code></pre> (mandatory): path to the FOP xml hyphenation pattern. |
| See <a href="http://offo.sourceforge.net/hyphenation/">http://offo.sourceforge.net/hyphenation/</a>.</li><li><pre><code>encoding</code></pre> (optional): encoding of the xml hyphenation file. defaults to UTF-8.</li><li><pre><code>dictionary</code></pre> (optional): dictionary of words. defaults to no dictionary.</li><li><pre><code>minWordSize</code></pre> (optional): minimal word length that gets decomposed. defaults to 5.</li><li><pre><code>minSubwordSize</code></pre> (optional): minimum length of subwords. defaults to 2.</li><li><pre><code>maxSubwordSize</code></pre> (optional): maximum length of subwords. defaults to 15.</li><li><pre><code>onlyLongestMatch</code></pre> (optional): if true, adds only the longest matching subword |
| to the stream. defaults to false.</li></ul> |
| <p> |
| <pre><code><fieldType name="text_hyphncomp" class="solr.TextField" positionIncrementGap="100"> |
| <analyzer> |
| <tokenizer class="solr.WhitespaceTokenizerFactory"/> |
| <filter class="solr.HyphenationCompoundWordTokenFilterFactory" hyphenator="hyphenator.xml" encoding="UTF-8" |
| dictionary="dictionary.txt" minWordSize="5" minSubwordSize="2" maxSubwordSize="15" onlyLongestMatch="false"/> |
| </analyzer> |
| </fieldType></code></pre> |
| <p> |
| </section> |
| </article> |
| </div> |
| |
| <div class="hidden-sm col-md-2" role="complementary"> |
| <div class="sideaffix"> |
| <div class="contribution"> |
| <ul class="nav"> |
| <li> |
| <a href="https://github.com/apache/lucenenet/blob/docs/4.8.0-beta00010/src/Lucene.Net.Analysis.Common/Analysis/Compound/package.md/#L2" class="contribution-link">Improve this Doc</a> |
| </li> |
| </ul> |
| </div> |
| <nav class="bs-docs-sidebar hidden-print hidden-xs hidden-sm affix" id="affix"> |
| <!-- <p><a class="back-to-top" href="#top">Back to top</a><p> --> |
| </nav> |
| </div> |
| </div> |
| </div> |
| </div> |
| |
| <footer> |
| <div class="grad-bottom"></div> |
| <div class="footer"> |
| <div class="container"> |
| <span class="pull-right"> |
| <a href="#top">Back to top</a> |
| </span> |
| Copyright © 2020 Licensed to the Apache Software Foundation (ASF) |
| |
| </div> |
| </div> |
| </footer> |
| </div> |
| |
| <script type="text/javascript" src="https://lucenenet.apache.org/docs/4.8.0-beta00009/styles/docfx.vendor.js"></script> |
| <script type="text/javascript" src="https://lucenenet.apache.org/docs/4.8.0-beta00009/styles/docfx.js"></script> |
| <script type="text/javascript" src="https://lucenenet.apache.org/docs/4.8.0-beta00009/styles/main.js"></script> |
| </body> |
| </html> |