blob: b03ada9ac13dc6f44044424126cb4c7631dbb30b [file] [log] [blame]
<!DOCTYPE html>
<!--[if IE]><![endif]-->
<html>
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
<title>Namespace Lucene.Net.Analysis.NGram
| Apache Lucene.NET 4.8.0-beta00013 Documentation </title>
<meta name="viewport" content="width=device-width">
<meta name="title" content="Namespace Lucene.Net.Analysis.NGram
| Apache Lucene.NET 4.8.0-beta00013 Documentation ">
<meta name="generator" content="docfx 2.56.2.0">
<link rel="shortcut icon" href="https://lucenenet.apache.org/docs/4.8.0-beta00009/logo/favicon.ico">
<link rel="stylesheet" href="https://lucenenet.apache.org/docs/4.8.0-beta00009/styles/docfx.vendor.css">
<link rel="stylesheet" href="https://lucenenet.apache.org/docs/4.8.0-beta00009/styles/docfx.css">
<link rel="stylesheet" href="https://lucenenet.apache.org/docs/4.8.0-beta00009/styles/main.css">
<meta property="docfx:navrel" content="toc.html">
<meta property="docfx:tocrel" content="analysis-common/toc.html">
<meta property="docfx:rel" content="https://lucenenet.apache.org/docs/4.8.0-beta00009/">
</head>
<body data-spy="scroll" data-target="#affix" data-offset="120">
<span id="forkongithub"><a href="https://github.com/apache/lucenenet" target="_blank">Fork me on GitHub</a></span>
<div id="wrapper">
<header>
<nav id="autocollapse" class="navbar ng-scope" role="navigation">
<div class="container">
<div class="navbar-header">
<button type="button" class="navbar-toggle" data-toggle="collapse" data-target="#navbar">
<span class="sr-only">Toggle navigation</span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</button>
<a class="navbar-brand" href="/">
<img id="logo" class="svg" src="https://lucenenet.apache.org/docs/4.8.0-beta00009/logo/lucene-net-color.png" alt="">
</a>
</div>
<div class="collapse navbar-collapse" id="navbar">
<form class="navbar-form navbar-right" role="search" id="search">
<div class="form-group">
<input type="text" class="form-control" id="search-query" placeholder="Search" autocomplete="off">
</div>
</form>
</div>
</div>
</nav>
<div class="subnav navbar navbar-default">
<div class="container hide-when-search">
<ul class="level0 breadcrumb">
<li>
<a href="https://lucenenet.apache.org/docs/4.8.0-beta00009/">API</a>
<span id="breadcrumb">
<ul class="breadcrumb">
<li></li>
</ul>
</span>
</li>
</ul>
</div>
</div>
</header>
<div class="container body-content">
<div id="search-results">
<div class="search-list"></div>
<div class="sr-items">
<p><i class="glyphicon glyphicon-refresh index-loading"></i></p>
</div>
<ul id="pagination"></ul>
</div>
</div>
<div role="main" class="container body-content hide-when-search">
<div class="sidenav hide-when-search">
<a class="btn toc-toggle collapse" data-toggle="collapse" href="#sidetoggle" aria-expanded="false" aria-controls="sidetoggle">Show / Hide Table of Contents</a>
<div class="sidetoggle collapse" id="sidetoggle">
<div id="sidetoc"></div>
</div>
</div>
<div class="article row grid-right">
<div class="col-md-10">
<article class="content wrap" id="_content" data-uid="Lucene.Net.Analysis.NGram">
<h1 id="Lucene_Net_Analysis_NGram" data-uid="Lucene.Net.Analysis.NGram" class="text-break">Namespace Lucene.Net.Analysis.NGram
</h1>
<div class="markdown level0 summary"><!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<p>Character n-gram tokenizers and filters.</p>
</div>
<div class="markdown level0 conceptual"></div>
<div class="markdown level0 remarks"></div>
<h3 id="classes">Classes
</h3>
<h4><a class="xref" href="Lucene.Net.Analysis.NGram.EdgeNGramFilterFactory.html">EdgeNGramFilterFactory</a></h4>
<section><p>Creates new instances of <a class="xref" href="Lucene.Net.Analysis.NGram.EdgeNGramTokenFilter.html">EdgeNGramTokenFilter</a>.</p>
<pre><code>&lt;fieldType name=&quot;text_edgngrm&quot; class=&quot;solr.TextField&quot; positionIncrementGap=&quot;100&quot;>
&lt;analyzer>
&lt;tokenizer class=&quot;solr.WhitespaceTokenizerFactory&quot;/>
&lt;filter class=&quot;solr.EdgeNGramFilterFactory&quot; minGramSize=&quot;1&quot; maxGramSize=&quot;1&quot;/>
&lt;/analyzer>
&lt;/fieldType></code></pre>
</section>
<h4><a class="xref" href="Lucene.Net.Analysis.NGram.EdgeNGramTokenFilter.html">EdgeNGramTokenFilter</a></h4>
<section><p>Tokenizes the given token into n-grams of given size(s).
<p>
This <span class="xref">Lucene.Net.Analysis.TokenFilter</span> create n-grams from the beginning edge or ending edge of a input token.
</p>
<p>As of Lucene 4.4, this filter does not support
<a class="xref" href="Lucene.Net.Analysis.NGram.EdgeNGramTokenFilter.Side.html#Lucene_Net_Analysis_NGram_EdgeNGramTokenFilter_Side_BACK">BACK</a> (you can use <a class="xref" href="Lucene.Net.Analysis.Reverse.ReverseStringFilter.html">ReverseStringFilter</a> up-front and
afterward to get the same behavior), handles supplementary characters
correctly and does not update offsets anymore.
</p></p>
</section>
<h4><a class="xref" href="Lucene.Net.Analysis.NGram.EdgeNGramTokenizer.html">EdgeNGramTokenizer</a></h4>
<section><p>Tokenizes the input from an edge into n-grams of given size(s).
<p>
This <span class="xref">Lucene.Net.Analysis.Tokenizer</span> create n-grams from the beginning edge or ending edge of a input token.
</p>
<p>As of Lucene 4.4, this tokenizer
<ul><li>can handle <pre><code>maxGram</code></pre> larger than 1024 chars, but beware that this will result in increased memory usage</li><li>doesn&apos;t trim the input,</li><li>sets position increments equal to 1 instead of 1 for the first token and 0 for all other ones</li><li>doesn&apos;t support backward n-grams anymore.</li><li>supports <a class="xref" href="Lucene.Net.Analysis.Util.CharTokenizer.html#Lucene_Net_Analysis_Util_CharTokenizer_IsTokenChar_System_Int32_">IsTokenChar(Int32)</a> pre-tokenization,</li><li>correctly handles supplementary characters.</li></ul>
</p>
<p>Although <strong>highly</strong> discouraged, it is still possible
to use the old behavior through <a class="xref" href="Lucene.Net.Analysis.NGram.Lucene43EdgeNGramTokenizer.html">Lucene43EdgeNGramTokenizer</a>.
</p></p>
</section>
<h4><a class="xref" href="Lucene.Net.Analysis.NGram.EdgeNGramTokenizerFactory.html">EdgeNGramTokenizerFactory</a></h4>
<section><p>Creates new instances of <a class="xref" href="Lucene.Net.Analysis.NGram.EdgeNGramTokenizer.html">EdgeNGramTokenizer</a>.</p>
<pre><code>&lt;fieldType name=&quot;text_edgngrm&quot; class=&quot;solr.TextField&quot; positionIncrementGap=&quot;100&quot;>
&lt;analyzer>
&lt;tokenizer class=&quot;solr.EdgeNGramTokenizerFactory&quot; minGramSize=&quot;1&quot; maxGramSize=&quot;1&quot;/>
&lt;/analyzer>
&lt;/fieldType></code></pre>
</section>
<h4><a class="xref" href="Lucene.Net.Analysis.NGram.Lucene43EdgeNGramTokenizer.html">Lucene43EdgeNGramTokenizer</a></h4>
<section><p>Old version of <a class="xref" href="Lucene.Net.Analysis.NGram.EdgeNGramTokenizer.html">EdgeNGramTokenizer</a> which doesn&apos;t handle correctly
supplementary characters.</p>
</section>
<h4><a class="xref" href="Lucene.Net.Analysis.NGram.Lucene43NGramTokenizer.html">Lucene43NGramTokenizer</a></h4>
<section><p>Old broken version of <a class="xref" href="Lucene.Net.Analysis.NGram.NGramTokenizer.html">NGramTokenizer</a>.</p>
</section>
<h4><a class="xref" href="Lucene.Net.Analysis.NGram.NGramFilterFactory.html">NGramFilterFactory</a></h4>
<section><p>Factory for <a class="xref" href="Lucene.Net.Analysis.NGram.NGramTokenFilter.html">NGramTokenFilter</a>.</p>
<pre><code>&lt;fieldType name=&quot;text_ngrm&quot; class=&quot;solr.TextField&quot; positionIncrementGap=&quot;100&quot;>
&lt;analyzer>
&lt;tokenizer class=&quot;solr.WhitespaceTokenizerFactory&quot;/>
&lt;filter class=&quot;solr.NGramFilterFactory&quot; minGramSize=&quot;1&quot; maxGramSize=&quot;2&quot;/>
&lt;/analyzer>
&lt;/fieldType></code></pre>
</section>
<h4><a class="xref" href="Lucene.Net.Analysis.NGram.NGramTokenFilter.html">NGramTokenFilter</a></h4>
<section><p>Tokenizes the input into n-grams of the given size(s).
<p>You must specify the required <span class="xref">Lucene.Net.Util.LuceneVersion</span> compatibility when
creating a <a class="xref" href="Lucene.Net.Analysis.NGram.NGramTokenFilter.html">NGramTokenFilter</a>. As of Lucene 4.4, this token filters:
<ul><li>handles supplementary characters correctly,</li><li>emits all n-grams for the same token at the same position,</li><li>does not modify offsets,</li><li>sorts n-grams by their offset in the original token first, then
increasing length (meaning that &quot;abc&quot; will give &quot;a&quot;, &quot;ab&quot;, &quot;abc&quot;, &quot;b&quot;, &quot;bc&quot;,
&quot;c&quot;).</li></ul>
</p>
<p>You can make this filter use the old behavior by providing a version &lt;
<a class="xref" href="https://lucenenet.apache.org/docs/4.8.0-beta00013/api/core/Lucene.Net.Util.LuceneVersion.html#Lucene_Net_Util_LuceneVersion_LUCENE_44">LUCENE_44</a> in the constructor but this is not recommended as
it will lead to broken <span class="xref">Lucene.Net.Analysis.TokenStream</span>s that will cause highlighting
bugs.
</p>
<p>If you were using this <span class="xref">Lucene.Net.Analysis.TokenFilter</span> to perform partial highlighting,
this won&apos;t work anymore since this filter doesn&apos;t update offsets. You should
modify your analysis chain to use <a class="xref" href="Lucene.Net.Analysis.NGram.NGramTokenizer.html">NGramTokenizer</a>, and potentially
override <a class="xref" href="Lucene.Net.Analysis.NGram.NGramTokenizer.html#Lucene_Net_Analysis_NGram_NGramTokenizer_IsTokenChar_System_Int32_">IsTokenChar(Int32)</a> to perform pre-tokenization.
</p></p>
</section>
<h4><a class="xref" href="Lucene.Net.Analysis.NGram.NGramTokenizer.html">NGramTokenizer</a></h4>
<section><p>Tokenizes the input into n-grams of the given size(s).
<p>On the contrary to <a class="xref" href="Lucene.Net.Analysis.NGram.NGramTokenFilter.html">NGramTokenFilter</a>, this class sets offsets so
that characters between startOffset and endOffset in the original stream are
the same as the term chars.
</p>
<p>For example, &quot;abcde&quot; would be tokenized as (minGram=2, maxGram=3):
<table><thead><tr><th>TermPosition incrementPosition lengthOffsets</th><th></th></tr></thead><tbody><tr><td>ab11[0,2[</td><td></td></tr><tr><td>abc11[0,3[</td><td></td></tr><tr><td>bc11[1,3[</td><td></td></tr><tr><td>bcd11[1,4[</td><td></td></tr><tr><td>cd11[2,4[</td><td></td></tr><tr><td>cde11[2,5[</td><td></td></tr><tr><td>de11[3,5[</td><td></td></tr></tbody></table>
</p>
<p>This tokenizer changed a lot in Lucene 4.4 in order to:
<ul><li>tokenize in a streaming fashion to support streams which are larger
than 1024 chars (limit of the previous version),</li><li>count grams based on unicode code points instead of java chars (and
never split in the middle of surrogate pairs),</li><li>give the ability to pre-tokenize the stream (<a class="xref" href="Lucene.Net.Analysis.NGram.NGramTokenizer.html#Lucene_Net_Analysis_NGram_NGramTokenizer_IsTokenChar_System_Int32_">IsTokenChar(Int32)</a>)
before computing n-grams.</li></ul>
</p>
<p>Additionally, this class doesn&apos;t trim trailing whitespaces and emits
tokens in a different order, tokens are now emitted by increasing start
offsets while they used to be emitted by increasing lengths (which prevented
from supporting large input streams).
</p>
<p>Although <strong>highly</strong> discouraged, it is still possible
to use the old behavior through <a class="xref" href="Lucene.Net.Analysis.NGram.Lucene43NGramTokenizer.html">Lucene43NGramTokenizer</a>.
</p></p>
</section>
<h4><a class="xref" href="Lucene.Net.Analysis.NGram.NGramTokenizerFactory.html">NGramTokenizerFactory</a></h4>
<section><p>Factory for <a class="xref" href="Lucene.Net.Analysis.NGram.NGramTokenizer.html">NGramTokenizer</a>.</p>
<pre><code>&lt;fieldType name=&quot;text_ngrm&quot; class=&quot;solr.TextField&quot; positionIncrementGap=&quot;100&quot;>
&lt;analyzer>
&lt;tokenizer class=&quot;solr.NGramTokenizerFactory&quot; minGramSize=&quot;1&quot; maxGramSize=&quot;2&quot;/>
&lt;/analyzer>
&lt;/fieldType></code></pre>
</section>
<h3 id="enums">Enums
</h3>
<h4><a class="xref" href="Lucene.Net.Analysis.NGram.EdgeNGramTokenFilter.Side.html">EdgeNGramTokenFilter.Side</a></h4>
<section><p>Specifies which side of the input the n-gram should be generated from </p>
</section>
<h4><a class="xref" href="Lucene.Net.Analysis.NGram.Lucene43EdgeNGramTokenizer.Side.html">Lucene43EdgeNGramTokenizer.Side</a></h4>
<section><p>Specifies which side of the input the n-gram should be generated from </p>
</section>
</article>
</div>
<div class="hidden-sm col-md-2" role="complementary">
<div class="sideaffix">
<div class="contribution">
<ul class="nav">
<li>
<a href="https://github.com/apache/lucenenet/blob/docs/4.8.0-beta00013/src/Lucene.Net.Analysis.Common/Analysis/NGram/package.md/#L2" class="contribution-link">Improve this Doc</a>
</li>
</ul>
</div>
<nav class="bs-docs-sidebar hidden-print hidden-xs hidden-sm affix" id="affix">
<!-- <p><a class="back-to-top" href="#top">Back to top</a><p> -->
</nav>
</div>
</div>
</div>
</div>
<footer>
<div class="grad-bottom"></div>
<div class="footer">
<div class="container">
<span class="pull-right">
<a href="#top">Back to top</a>
</span>
Copyright © 2020 The Apache Software Foundation, Licensed under the <a href='http://www.apache.org/licenses/LICENSE-2.0' target='_blank'>Apache License, Version 2.0</a><br> <small>Apache Lucene.Net, Lucene.Net, Apache, the Apache feather logo, and the Apache Lucene.Net project logo are trademarks of The Apache Software Foundation. <br>All other marks mentioned may be trademarks or registered trademarks of their respective owners.</small>
</div>
</div>
</footer>
</div>
<script type="text/javascript" src="https://lucenenet.apache.org/docs/4.8.0-beta00009/styles/docfx.vendor.js"></script>
<script type="text/javascript" src="https://lucenenet.apache.org/docs/4.8.0-beta00009/styles/docfx.js"></script>
<script type="text/javascript" src="https://lucenenet.apache.org/docs/4.8.0-beta00009/styles/main.js"></script>
</body>
</html>