| <!DOCTYPE html> |
| <!--[if IE]><![endif]--> |
| <html> |
| |
| <head> |
| <meta charset="utf-8"> |
| <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"> |
| <title>Namespace Lucene.Net.Analysis.NGram |
| | Apache Lucene.NET 4.8.0-beta00014 Documentation </title> |
| <meta name="viewport" content="width=device-width"> |
| <meta name="title" content="Namespace Lucene.Net.Analysis.NGram |
| | Apache Lucene.NET 4.8.0-beta00014 Documentation "> |
| <meta name="generator" content="docfx 2.56.2.0"> |
| |
| <link rel="shortcut icon" href="https://lucenenet.apache.org/docs/4.8.0-beta00009/logo/favicon.ico"> |
| <link rel="stylesheet" href="https://lucenenet.apache.org/docs/4.8.0-beta00009/styles/docfx.vendor.css"> |
| <link rel="stylesheet" href="https://lucenenet.apache.org/docs/4.8.0-beta00009/styles/docfx.css"> |
| <link rel="stylesheet" href="https://lucenenet.apache.org/docs/4.8.0-beta00009/styles/main.css"> |
| <meta property="docfx:navrel" content="toc.html"> |
| <meta property="docfx:tocrel" content="analysis-common/toc.html"> |
| |
| <meta property="docfx:rel" content="https://lucenenet.apache.org/docs/4.8.0-beta00009/"> |
| |
| </head> |
| <body data-spy="scroll" data-target="#affix" data-offset="120"> |
| <span id="forkongithub"><a href="https://github.com/apache/lucenenet" target="_blank">Fork me on GitHub</a></span> |
| <div id="wrapper"> |
| <header> |
| |
| <nav id="autocollapse" class="navbar ng-scope" role="navigation"> |
| <div class="container"> |
| <div class="navbar-header"> |
| <button type="button" class="navbar-toggle" data-toggle="collapse" data-target="#navbar"> |
| <span class="sr-only">Toggle navigation</span> |
| <span class="icon-bar"></span> |
| <span class="icon-bar"></span> |
| <span class="icon-bar"></span> |
| </button> |
| |
| <a class="navbar-brand" href="/"> |
| <img id="logo" class="svg" src="https://lucenenet.apache.org/docs/4.8.0-beta00009/logo/lucene-net-color.png" alt=""> |
| </a> |
| </div> |
| <div class="collapse navbar-collapse" id="navbar"> |
| <form class="navbar-form navbar-right" role="search" id="search"> |
| <div class="form-group"> |
| <input type="text" class="form-control" id="search-query" placeholder="Search" autocomplete="off"> |
| </div> |
| </form> |
| </div> |
| </div> |
| </nav> |
| |
| <div class="subnav navbar navbar-default"> |
| <div class="container hide-when-search"> |
| <ul class="level0 breadcrumb"> |
| <li> |
| <a href="https://lucenenet.apache.org/docs/4.8.0-beta00014/">API</a> |
| <span id="breadcrumb"> |
| <ul class="breadcrumb"> |
| <li></li> |
| </ul> |
| </span> |
| </li> |
| </ul> |
| </div> |
| </div> |
| </header> |
| <div class="container body-content"> |
| |
| <div id="search-results"> |
| <div class="search-list"></div> |
| <div class="sr-items"> |
| <p><i class="glyphicon glyphicon-refresh index-loading"></i></p> |
| </div> |
| <ul id="pagination"></ul> |
| </div> |
| </div> |
| <div role="main" class="container body-content hide-when-search"> |
| |
| <div class="sidenav hide-when-search"> |
| <a class="btn toc-toggle collapse" data-toggle="collapse" href="#sidetoggle" aria-expanded="false" aria-controls="sidetoggle">Show / Hide Table of Contents</a> |
| <div class="sidetoggle collapse" id="sidetoggle"> |
| <div id="sidetoc"></div> |
| </div> |
| </div> |
| <div class="article row grid-right"> |
| <div class="col-md-10"> |
| <article class="content wrap" id="_content" data-uid="Lucene.Net.Analysis.NGram"> |
| |
| <h1 id="Lucene_Net_Analysis_NGram" data-uid="Lucene.Net.Analysis.NGram" class="text-break">Namespace Lucene.Net.Analysis.NGram |
| </h1> |
| <div class="markdown level0 summary"><!-- |
| Licensed to the Apache Software Foundation (ASF) under one or more |
| contributor license agreements. See the NOTICE file distributed with |
| this work for additional information regarding copyright ownership. |
| The ASF licenses this file to You under the Apache License, Version 2.0 |
| (the "License"); you may not use this file except in compliance with |
| the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License. |
| --> |
| <p>Character n-gram tokenizers and filters.</p> |
| </div> |
| <div class="markdown level0 conceptual"></div> |
| <div class="markdown level0 remarks"></div> |
| <h3 id="classes">Classes |
| </h3> |
| <h4><a class="xref" href="Lucene.Net.Analysis.NGram.EdgeNGramFilterFactory.html">EdgeNGramFilterFactory</a></h4> |
| <section><p>Creates new instances of <a class="xref" href="Lucene.Net.Analysis.NGram.EdgeNGramTokenFilter.html">EdgeNGramTokenFilter</a>.</p> |
| <pre><code><fieldType name="text_edgngrm" class="solr.TextField" positionIncrementGap="100"> |
| <analyzer> |
| <tokenizer class="solr.WhitespaceTokenizerFactory"/> |
| <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="1"/> |
| </analyzer> |
| </fieldType></code></pre> |
| </section> |
| <h4><a class="xref" href="Lucene.Net.Analysis.NGram.EdgeNGramTokenFilter.html">EdgeNGramTokenFilter</a></h4> |
| <section><p>Tokenizes the given token into n-grams of given size(s). |
| <p> |
| This <span class="xref">Lucene.Net.Analysis.TokenFilter</span> create n-grams from the beginning edge or ending edge of a input token. |
| </p> |
| <p>As of Lucene 4.4, this filter does not support |
| <a class="xref" href="Lucene.Net.Analysis.NGram.EdgeNGramTokenFilter.Side.html#Lucene_Net_Analysis_NGram_EdgeNGramTokenFilter_Side_BACK">BACK</a> (you can use <a class="xref" href="Lucene.Net.Analysis.Reverse.ReverseStringFilter.html">ReverseStringFilter</a> up-front and |
| afterward to get the same behavior), handles supplementary characters |
| correctly and does not update offsets anymore. |
| </p></p> |
| </section> |
| <h4><a class="xref" href="Lucene.Net.Analysis.NGram.EdgeNGramTokenizer.html">EdgeNGramTokenizer</a></h4> |
| <section><p>Tokenizes the input from an edge into n-grams of given size(s). |
| <p> |
| This <span class="xref">Lucene.Net.Analysis.Tokenizer</span> create n-grams from the beginning edge or ending edge of a input token. |
| </p> |
| <p>As of Lucene 4.4, this tokenizer |
| <ul><li>can handle <pre><code>maxGram</code></pre> larger than 1024 chars, but beware that this will result in increased memory usage</li><li>doesn't trim the input,</li><li>sets position increments equal to 1 instead of 1 for the first token and 0 for all other ones</li><li>doesn't support backward n-grams anymore.</li><li>supports <a class="xref" href="Lucene.Net.Analysis.Util.CharTokenizer.html#Lucene_Net_Analysis_Util_CharTokenizer_IsTokenChar_System_Int32_">IsTokenChar(Int32)</a> pre-tokenization,</li><li>correctly handles supplementary characters.</li></ul> |
| </p> |
| <p>Although <strong>highly</strong> discouraged, it is still possible |
| to use the old behavior through <a class="xref" href="Lucene.Net.Analysis.NGram.Lucene43EdgeNGramTokenizer.html">Lucene43EdgeNGramTokenizer</a>. |
| </p></p> |
| </section> |
| <h4><a class="xref" href="Lucene.Net.Analysis.NGram.EdgeNGramTokenizerFactory.html">EdgeNGramTokenizerFactory</a></h4> |
| <section><p>Creates new instances of <a class="xref" href="Lucene.Net.Analysis.NGram.EdgeNGramTokenizer.html">EdgeNGramTokenizer</a>.</p> |
| <pre><code><fieldType name="text_edgngrm" class="solr.TextField" positionIncrementGap="100"> |
| <analyzer> |
| <tokenizer class="solr.EdgeNGramTokenizerFactory" minGramSize="1" maxGramSize="1"/> |
| </analyzer> |
| </fieldType></code></pre> |
| </section> |
| <h4><a class="xref" href="Lucene.Net.Analysis.NGram.Lucene43EdgeNGramTokenizer.html">Lucene43EdgeNGramTokenizer</a></h4> |
| <section><p>Old version of <a class="xref" href="Lucene.Net.Analysis.NGram.EdgeNGramTokenizer.html">EdgeNGramTokenizer</a> which doesn't handle correctly |
| supplementary characters.</p> |
| </section> |
| <h4><a class="xref" href="Lucene.Net.Analysis.NGram.Lucene43NGramTokenizer.html">Lucene43NGramTokenizer</a></h4> |
| <section><p>Old broken version of <a class="xref" href="Lucene.Net.Analysis.NGram.NGramTokenizer.html">NGramTokenizer</a>.</p> |
| </section> |
| <h4><a class="xref" href="Lucene.Net.Analysis.NGram.NGramFilterFactory.html">NGramFilterFactory</a></h4> |
| <section><p>Factory for <a class="xref" href="Lucene.Net.Analysis.NGram.NGramTokenFilter.html">NGramTokenFilter</a>.</p> |
| <pre><code><fieldType name="text_ngrm" class="solr.TextField" positionIncrementGap="100"> |
| <analyzer> |
| <tokenizer class="solr.WhitespaceTokenizerFactory"/> |
| <filter class="solr.NGramFilterFactory" minGramSize="1" maxGramSize="2"/> |
| </analyzer> |
| </fieldType></code></pre> |
| </section> |
| <h4><a class="xref" href="Lucene.Net.Analysis.NGram.NGramTokenFilter.html">NGramTokenFilter</a></h4> |
| <section><p>Tokenizes the input into n-grams of the given size(s). |
| <p>You must specify the required <span class="xref">Lucene.Net.Util.LuceneVersion</span> compatibility when |
| creating a <a class="xref" href="Lucene.Net.Analysis.NGram.NGramTokenFilter.html">NGramTokenFilter</a>. As of Lucene 4.4, this token filters: |
| <ul><li>handles supplementary characters correctly,</li><li>emits all n-grams for the same token at the same position,</li><li>does not modify offsets,</li><li>sorts n-grams by their offset in the original token first, then |
| increasing length (meaning that "abc" will give "a", "ab", "abc", "b", "bc", |
| "c").</li></ul> |
| </p> |
| <p>You can make this filter use the old behavior by providing a version < |
| <a class="xref" href="https://lucenenet.apache.org/docs/4.8.0-beta00014/api/core/Lucene.Net.Util.LuceneVersion.html#Lucene_Net_Util_LuceneVersion_LUCENE_44">LUCENE_44</a> in the constructor but this is not recommended as |
| it will lead to broken <span class="xref">Lucene.Net.Analysis.TokenStream</span>s that will cause highlighting |
| bugs. |
| </p> |
| <p>If you were using this <span class="xref">Lucene.Net.Analysis.TokenFilter</span> to perform partial highlighting, |
| this won't work anymore since this filter doesn't update offsets. You should |
| modify your analysis chain to use <a class="xref" href="Lucene.Net.Analysis.NGram.NGramTokenizer.html">NGramTokenizer</a>, and potentially |
| override <a class="xref" href="Lucene.Net.Analysis.NGram.NGramTokenizer.html#Lucene_Net_Analysis_NGram_NGramTokenizer_IsTokenChar_System_Int32_">IsTokenChar(Int32)</a> to perform pre-tokenization. |
| </p></p> |
| </section> |
| <h4><a class="xref" href="Lucene.Net.Analysis.NGram.NGramTokenizer.html">NGramTokenizer</a></h4> |
| <section><p>Tokenizes the input into n-grams of the given size(s). |
| <p>On the contrary to <a class="xref" href="Lucene.Net.Analysis.NGram.NGramTokenFilter.html">NGramTokenFilter</a>, this class sets offsets so |
| that characters between startOffset and endOffset in the original stream are |
| the same as the term chars. |
| </p> |
| <p>For example, "abcde" would be tokenized as (minGram=2, maxGram=3): |
| <table><thead><tr><th>TermPosition incrementPosition lengthOffsets</th><th></th></tr></thead><tbody><tr><td>ab11[0,2[</td><td></td></tr><tr><td>abc11[0,3[</td><td></td></tr><tr><td>bc11[1,3[</td><td></td></tr><tr><td>bcd11[1,4[</td><td></td></tr><tr><td>cd11[2,4[</td><td></td></tr><tr><td>cde11[2,5[</td><td></td></tr><tr><td>de11[3,5[</td><td></td></tr></tbody></table> |
| </p> |
| <p>This tokenizer changed a lot in Lucene 4.4 in order to: |
| <ul><li>tokenize in a streaming fashion to support streams which are larger |
| than 1024 chars (limit of the previous version),</li><li>count grams based on unicode code points instead of java chars (and |
| never split in the middle of surrogate pairs),</li><li>give the ability to pre-tokenize the stream (<a class="xref" href="Lucene.Net.Analysis.NGram.NGramTokenizer.html#Lucene_Net_Analysis_NGram_NGramTokenizer_IsTokenChar_System_Int32_">IsTokenChar(Int32)</a>) |
| before computing n-grams.</li></ul> |
| </p> |
| <p>Additionally, this class doesn't trim trailing whitespaces and emits |
| tokens in a different order, tokens are now emitted by increasing start |
| offsets while they used to be emitted by increasing lengths (which prevented |
| from supporting large input streams). |
| </p> |
| <p>Although <strong>highly</strong> discouraged, it is still possible |
| to use the old behavior through <a class="xref" href="Lucene.Net.Analysis.NGram.Lucene43NGramTokenizer.html">Lucene43NGramTokenizer</a>. |
| </p></p> |
| </section> |
| <h4><a class="xref" href="Lucene.Net.Analysis.NGram.NGramTokenizerFactory.html">NGramTokenizerFactory</a></h4> |
| <section><p>Factory for <a class="xref" href="Lucene.Net.Analysis.NGram.NGramTokenizer.html">NGramTokenizer</a>.</p> |
| <pre><code><fieldType name="text_ngrm" class="solr.TextField" positionIncrementGap="100"> |
| <analyzer> |
| <tokenizer class="solr.NGramTokenizerFactory" minGramSize="1" maxGramSize="2"/> |
| </analyzer> |
| </fieldType></code></pre> |
| </section> |
| <h3 id="enums">Enums |
| </h3> |
| <h4><a class="xref" href="Lucene.Net.Analysis.NGram.EdgeNGramTokenFilter.Side.html">EdgeNGramTokenFilter.Side</a></h4> |
| <section><p>Specifies which side of the input the n-gram should be generated from </p> |
| </section> |
| <h4><a class="xref" href="Lucene.Net.Analysis.NGram.Lucene43EdgeNGramTokenizer.Side.html">Lucene43EdgeNGramTokenizer.Side</a></h4> |
| <section><p>Specifies which side of the input the n-gram should be generated from </p> |
| </section> |
| </article> |
| </div> |
| |
| <div class="hidden-sm col-md-2" role="complementary"> |
| <div class="sideaffix"> |
| <div class="contribution"> |
| <ul class="nav"> |
| <li> |
| <a href="https://github.com/apache/lucenenet/blob/docs/4.8.0-beta00014/src/Lucene.Net.Analysis.Common/Analysis/NGram/package.md/#L2" class="contribution-link">Improve this Doc</a> |
| </li> |
| </ul> |
| </div> |
| <nav class="bs-docs-sidebar hidden-print hidden-xs hidden-sm affix" id="affix"> |
| <!-- <p><a class="back-to-top" href="#top">Back to top</a><p> --> |
| </nav> |
| </div> |
| </div> |
| </div> |
| </div> |
| |
| <footer> |
| <div class="grad-bottom"></div> |
| <div class="footer"> |
| <div class="container"> |
| <span class="pull-right"> |
| <a href="#top">Back to top</a> |
| </span> |
| Copyright © 2021 The Apache Software Foundation, Licensed under the <a href='http://www.apache.org/licenses/LICENSE-2.0' target='_blank'>Apache License, Version 2.0</a><br> <small>Apache Lucene.Net, Lucene.Net, Apache, the Apache feather logo, and the Apache Lucene.Net project logo are trademarks of The Apache Software Foundation. <br>All other marks mentioned may be trademarks or registered trademarks of their respective owners.</small> |
| |
| </div> |
| </div> |
| </footer> |
| </div> |
| |
| <script type="text/javascript" src="https://lucenenet.apache.org/docs/4.8.0-beta00009/styles/docfx.vendor.js"></script> |
| <script type="text/javascript" src="https://lucenenet.apache.org/docs/4.8.0-beta00009/styles/docfx.js"></script> |
| <script type="text/javascript" src="https://lucenenet.apache.org/docs/4.8.0-beta00009/styles/main.js"></script> |
| </body> |
| </html> |