| <!DOCTYPE html> |
| <!--[if IE]><![endif]--> |
| <html> |
| |
| <head> |
| <meta charset="utf-8"> |
| <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"> |
| <title>Namespace Lucene.Net.Analysis |
| | Apache Lucene.NET 4.8.0 Documentation </title> |
| <meta name="viewport" content="width=device-width"> |
| <meta name="title" content="Namespace Lucene.Net.Analysis |
| | Apache Lucene.NET 4.8.0 Documentation "> |
| <meta name="generator" content="docfx 2.47.0.0"> |
| |
| <link rel="shortcut icon" href="../../logo/favicon.ico"> |
| <link rel="stylesheet" href="../../styles/docfx.vendor.css"> |
| <link rel="stylesheet" href="../../styles/docfx.css"> |
| <link rel="stylesheet" href="../../styles/main.css"> |
| <meta property="docfx:navrel" content="../../toc.html"> |
| <meta property="docfx:tocrel" content="../toc.html"> |
| |
| <meta property="docfx:rel" content="../../"> |
| |
| </head> |
| <body data-spy="scroll" data-target="#affix" data-offset="120"> |
| <div id="wrapper"> |
| <header> |
| |
| <nav id="autocollapse" class="navbar ng-scope" role="navigation"> |
| <div class="container"> |
| <div class="navbar-header"> |
| <button type="button" class="navbar-toggle" data-toggle="collapse" data-target="#navbar"> |
| <span class="sr-only">Toggle navigation</span> |
| <span class="icon-bar"></span> |
| <span class="icon-bar"></span> |
| <span class="icon-bar"></span> |
| </button> |
| |
| <a class="navbar-brand" href="../../index.html"> |
| <img id="logo" class="svg" src="../../logo/lucene-net-color.png" alt=""> |
| </a> |
| </div> |
| <div class="collapse navbar-collapse" id="navbar"> |
| <form class="navbar-form navbar-right" role="search" id="search"> |
| <div class="form-group"> |
| <input type="text" class="form-control" id="search-query" placeholder="Search" autocomplete="off"> |
| </div> |
| </form> |
| </div> |
| </div> |
| </nav> |
| |
| <div class="subnav navbar navbar-default"> |
| <div class="container hide-when-search" id="breadcrumb"> |
| <ul class="breadcrumb"> |
| <li></li> |
| </ul> |
| </div> |
| </div> |
| </header> |
| <div class="container body-content"> |
| |
| <div id="search-results"> |
| <div class="search-list"></div> |
| <div class="sr-items"> |
| <p><i class="glyphicon glyphicon-refresh index-loading"></i></p> |
| </div> |
| <ul id="pagination"></ul> |
| </div> |
| </div> |
| <div role="main" class="container body-content hide-when-search"> |
| |
| <div class="sidenav hide-when-search"> |
| <a class="btn toc-toggle collapse" data-toggle="collapse" href="#sidetoggle" aria-expanded="false" aria-controls="sidetoggle">Show / Hide Table of Contents</a> |
| <div class="sidetoggle collapse" id="sidetoggle"> |
| <div id="sidetoc"></div> |
| </div> |
| </div> |
| <div class="article row grid-right"> |
| <div class="col-md-10"> |
| <article class="content wrap" id="_content" data-uid="Lucene.Net.Analysis"> |
| |
| <h1 id="Lucene_Net_Analysis" data-uid="Lucene.Net.Analysis" class="text-break">Namespace Lucene.Net.Analysis |
| </h1> |
| <div class="markdown level0 summary"><!-- |
| Licensed to the Apache Software Foundation (ASF) under one or more |
| contributor license agreements. See the NOTICE file distributed with |
| this work for additional information regarding copyright ownership. |
| The ASF licenses this file to You under the Apache License, Version 2.0 |
| (the "License"); you may not use this file except in compliance with |
| the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License. |
| --> |
| <p>Support for testing analysis components.</p> |
| <p> The main classes of interest are: * <a class="xref" href="../Lucene.Net.TestFramework/Lucene.Net.Analysis.BaseTokenStreamTestCase.html">BaseTokenStreamTestCase</a>: Highly recommended to use its helper methods, (especially in conjunction with <a class="xref" href="../Lucene.Net.TestFramework/Lucene.Net.Analysis.MockAnalyzer.html">MockAnalyzer</a> or <a class="xref" href="../Lucene.Net.TestFramework/Lucene.Net.Analysis.MockTokenizer.html">MockTokenizer</a>), as it contains many assertions and checks to catch bugs. * <a class="xref" href="../Lucene.Net.TestFramework/Lucene.Net.Analysis.MockTokenizer.html">MockTokenizer</a>: Tokenizer for testing. Tokenizer that serves as a replacement for WHITESPACE, SIMPLE, and KEYWORD tokenizers. If you are writing a component such as a TokenFilter, its a great idea to test it wrapping this tokenizer instead for extra checks. * <a class="xref" href="../Lucene.Net.TestFramework/Lucene.Net.Analysis.MockAnalyzer.html">MockAnalyzer</a>: Analyzer for testing. Analyzer that uses MockTokenizer for additional verification. If you are testing a custom component such as a queryparser or analyzer-wrapper that consumes analysis streams, its a great idea to test it with this analyzer instead. </p> |
| </div> |
| <div class="markdown level0 conceptual"></div> |
| <div class="markdown level0 remarks"></div> |
| <h3 id="classes">Classes |
| </h3> |
| <h4><a class="xref" href="Lucene.Net.Analysis.Analyzer.html">Analyzer</a></h4> |
| <section><p>An <a class="xref" href="Lucene.Net.Analysis.Analyzer.html">Analyzer</a> builds <a class="xref" href="Lucene.Net.Analysis.TokenStream.html">TokenStream</a>s, which analyze text. It thus represents a |
| policy for extracting index terms from text. |
| <p> |
| In order to define what analysis is done, subclasses must define their |
| <a class="xref" href="Lucene.Net.Analysis.TokenStreamComponents.html">TokenStreamComponents</a> in <a class="xref" href="Lucene.Net.Analysis.Analyzer.html#Lucene_Net_Analysis_Analyzer_CreateComponents_System_String_System_IO_TextReader_">CreateComponents(String, TextReader)</a>. |
| The components are then reused in each call to <a class="xref" href="Lucene.Net.Analysis.Analyzer.html#Lucene_Net_Analysis_Analyzer_GetTokenStream_System_String_System_IO_TextReader_">GetTokenStream(String, TextReader)</a>. |
| <p> |
| Simple example:</p> |
| <pre><code>Analyzer analyzer = Analyzer.NewAnonymous(createComponents: (fieldName, reader) => |
| { |
| Tokenizer source = new FooTokenizer(reader); |
| TokenStream filter = new FooFilter(source); |
| filter = new BarFilter(filter); |
| return new TokenStreamComponents(source, filter); |
| });</code></pre> |
| <p>For more examples, see the <a class="xref" href="../Lucene.Net.TestFramework/Lucene.Net.Analysis.html">Lucene.Net.Analysis</a> namespace documentation. |
| <p> |
| For some concrete implementations bundled with Lucene, look in the analysis modules: |
| <ul><li>Common: |
| Analyzers for indexing content in different languages and domains.</li><li>ICU: |
| Exposes functionality from ICU to Apache Lucene.</li><li>Kuromoji: |
| Morphological analyzer for Japanese text.</li><li>Morfologik: |
| Dictionary-driven lemmatization for the Polish language.</li><li>Phonetic: |
| Analysis for indexing phonetic signatures (for sounds-alike search).</li><li>Smart Chinese: |
| Analyzer for Simplified Chinese, which indexes words.</li><li>Stempel: |
| Algorithmic Stemmer for the Polish Language.</li><li>UIMA: |
| Analysis integration with Apache UIMA.</li></ul></p> |
| </section> |
| <h4><a class="xref" href="Lucene.Net.Analysis.Analyzer.GlobalReuseStrategy.html">Analyzer.GlobalReuseStrategy</a></h4> |
| <section><p>Implementation of <a class="xref" href="Lucene.Net.Analysis.ReuseStrategy.html">ReuseStrategy</a> that reuses the same components for |
| every field. </p> |
| </section> |
| <h4><a class="xref" href="Lucene.Net.Analysis.Analyzer.PerFieldReuseStrategy.html">Analyzer.PerFieldReuseStrategy</a></h4> |
| <section><p>Implementation of <a class="xref" href="Lucene.Net.Analysis.ReuseStrategy.html">ReuseStrategy</a> that reuses components per-field by |
| maintaining a Map of <a class="xref" href="Lucene.Net.Analysis.TokenStreamComponents.html">TokenStreamComponents</a> per field name.</p> |
| </section> |
| <h4><a class="xref" href="Lucene.Net.Analysis.AnalyzerWrapper.html">AnalyzerWrapper</a></h4> |
| <section><p>Extension to <a class="xref" href="Lucene.Net.Analysis.Analyzer.html">Analyzer</a> suitable for <a class="xref" href="Lucene.Net.Analysis.Analyzer.html">Analyzer</a>s which wrap |
| other <a class="xref" href="Lucene.Net.Analysis.Analyzer.html">Analyzer</a>s. |
| <p> |
| <a class="xref" href="Lucene.Net.Analysis.AnalyzerWrapper.html#Lucene_Net_Analysis_AnalyzerWrapper_GetWrappedAnalyzer_System_String_">GetWrappedAnalyzer(String)</a> allows the <a class="xref" href="Lucene.Net.Analysis.Analyzer.html">Analyzer</a> |
| to wrap multiple <a class="xref" href="Lucene.Net.Analysis.Analyzer.html">Analyzer</a>s which are selected on a per field basis. |
| <p> |
| <a class="xref" href="Lucene.Net.Analysis.AnalyzerWrapper.html#Lucene_Net_Analysis_AnalyzerWrapper_WrapComponents_System_String_Lucene_Net_Analysis_TokenStreamComponents_">WrapComponents(String, TokenStreamComponents)</a> allows the |
| <a class="xref" href="Lucene.Net.Analysis.TokenStreamComponents.html">TokenStreamComponents</a> of the wrapped <a class="xref" href="Lucene.Net.Analysis.Analyzer.html">Analyzer</a> to then be wrapped |
| (such as adding a new <a class="xref" href="Lucene.Net.Analysis.TokenFilter.html">TokenFilter</a> to form new <a class="xref" href="Lucene.Net.Analysis.TokenStreamComponents.html">TokenStreamComponents</a>).</p> |
| </section> |
| <h4><a class="xref" href="Lucene.Net.Analysis.CachingTokenFilter.html">CachingTokenFilter</a></h4> |
| <section><p>This class can be used if the token attributes of a <a class="xref" href="Lucene.Net.Analysis.TokenStream.html">TokenStream</a> |
| are intended to be consumed more than once. It caches |
| all token attribute states locally in a List.</p> |
| <p><p><a class="xref" href="Lucene.Net.Analysis.CachingTokenFilter.html">CachingTokenFilter</a> implements the optional method |
| <a class="xref" href="Lucene.Net.Analysis.TokenStream.html#Lucene_Net_Analysis_TokenStream_Reset">Reset()</a>, which repositions the |
| stream to the first <a class="xref" href="Lucene.Net.Analysis.Token.html">Token</a>.</p> |
| </section> |
| <h4><a class="xref" href="Lucene.Net.Analysis.CharFilter.html">CharFilter</a></h4> |
| <section><p>Subclasses of <a class="xref" href="Lucene.Net.Analysis.CharFilter.html">CharFilter</a> can be chained to filter a <span class="xref">System.IO.TextReader</span> |
| They can be used as <span class="xref">System.IO.TextReader</span> with additional offset |
| correction. <a class="xref" href="Lucene.Net.Analysis.Tokenizer.html">Tokenizer</a>s will automatically use <a class="xref" href="Lucene.Net.Analysis.CharFilter.html#Lucene_Net_Analysis_CharFilter_CorrectOffset_System_Int32_">CorrectOffset(Int32)</a> |
| if a <a class="xref" href="Lucene.Net.Analysis.CharFilter.html">CharFilter</a> subclass is used. |
| <p> |
| This class is abstract: at a minimum you must implement <span class="xref">System.IO.TextReader.Read(System.Char[], System.Int32, System.Int32)</span>, |
| transforming the input in some way from <a class="xref" href="Lucene.Net.Analysis.CharFilter.html#Lucene_Net_Analysis_CharFilter_m_input">m_input</a>, and <a class="xref" href="Lucene.Net.Analysis.CharFilter.html#Lucene_Net_Analysis_CharFilter_Correct_System_Int32_">Correct(Int32)</a> |
| to adjust the offsets to match the originals. |
| <p> |
| You can optionally provide more efficient implementations of additional methods |
| like <span class="xref">System.IO.TextReader.Read()</span>, but this is not required. |
| <p> |
| For examples and integration with <a class="xref" href="Lucene.Net.Analysis.Analyzer.html">Analyzer</a>, see the |
| <a class="xref" href="../Lucene.Net.TestFramework/Lucene.Net.Analysis.html">Lucene.Net.Analysis</a> namespace documentation.</p> |
| </section> |
| <h4><a class="xref" href="Lucene.Net.Analysis.NumericTokenStream.html">NumericTokenStream</a></h4> |
| <section><p><strong>Expert:</strong> this class provides a <a class="xref" href="Lucene.Net.Analysis.TokenStream.html">TokenStream</a> |
| for indexing numeric values that can be used by <a class="xref" href="Lucene.Net.Search.NumericRangeQuery.html">NumericRangeQuery</a> |
| or <a class="xref" href="Lucene.Net.Search.NumericRangeFilter.html">NumericRangeFilter</a>.</p> |
| <p>Note that for simple usage, <a class="xref" href="Lucene.Net.Documents.Int32Field.html">Int32Field</a>, <a class="xref" href="Lucene.Net.Documents.Int64Field.html">Int64Field</a>, |
| <a class="xref" href="Lucene.Net.Documents.SingleField.html">SingleField</a> or <a class="xref" href="Lucene.Net.Documents.DoubleField.html">DoubleField</a> is |
| recommended. These fields disable norms and |
| term freqs, as they are not usually needed during |
| searching. If you need to change these settings, you |
| should use this class. |
| |
| <p>Here's an example usage, for an <span class="xref">System.Int32</span> field: |
| |
| <pre><code> IndexableFieldType fieldType = new IndexableFieldType(TextField.TYPE_NOT_STORED) |
| { |
| OmitNorms = true, |
| IndexOptions = IndexOptions.DOCS_ONLY |
| }; |
| Field field = new Field(name, new NumericTokenStream(precisionStep).SetInt32Value(value), fieldType); |
| document.Add(field);</code></pre> |
| |
| <p>For optimal performance, re-use the <a class="xref" href="Lucene.Net.Analysis.TokenStream.html">TokenStream</a> and <a class="xref" href="Lucene.Net.Documents.Field.html">Field</a> instance |
| for more than one document: |
| |
| <pre><code> NumericTokenStream stream = new NumericTokenStream(precisionStep); |
| IndexableFieldType fieldType = new IndexableFieldType(TextField.TYPE_NOT_STORED) |
| { |
| OmitNorms = true, |
| IndexOptions = IndexOptions.DOCS_ONLY |
| }; |
| Field field = new Field(name, stream, fieldType); |
| Document document = new Document(); |
| document.Add(field); |
| |
| for(all documents) |
| { |
| stream.SetInt32Value(value) |
| writer.AddDocument(document); |
| }</code></pre> |
| |
| <p>this stream is not intended to be used in analyzers; |
| it's more for iterating the different precisions during |
| indexing a specific numeric value.</p> |
| |
| <p><strong>NOTE</strong>: as token streams are only consumed once |
| the document is added to the index, if you index more |
| than one numeric field, use a separate <a class="xref" href="Lucene.Net.Analysis.NumericTokenStream.html">NumericTokenStream</a> |
| instance for each.</p> |
| |
| <p>See <a class="xref" href="Lucene.Net.Search.NumericRangeQuery.html">NumericRangeQuery</a> for more details on the |
| <code>precisionStep</code> parameter as well as how numeric fields work under the hood. |
| </p> |
| |
| <p>@since 2.9</p> |
| </section> |
| <h4><a class="xref" href="Lucene.Net.Analysis.NumericTokenStream.NumericTermAttribute.html">NumericTokenStream.NumericTermAttribute</a></h4> |
| <section><p>Implementation of <a class="xref" href="Lucene.Net.Analysis.NumericTokenStream.INumericTermAttribute.html">NumericTokenStream.INumericTermAttribute</a>. |
| @lucene.internal |
| @since 4.0</p> |
| </section> |
| <h4><a class="xref" href="Lucene.Net.Analysis.ReusableStringReader.html">ReusableStringReader</a></h4> |
| <section><p>Internal class to enable reuse of the string reader by <a class="xref" href="Lucene.Net.Analysis.Analyzer.html#Lucene_Net_Analysis_Analyzer_GetTokenStream_System_String_System_String_">GetTokenStream(String, String)</a></p> |
| </section> |
| <h4><a class="xref" href="Lucene.Net.Analysis.ReuseStrategy.html">ReuseStrategy</a></h4> |
| <section><p>Strategy defining how <a class="xref" href="Lucene.Net.Analysis.TokenStreamComponents.html">TokenStreamComponents</a> are reused per call to |
| <a class="xref" href="Lucene.Net.Analysis.Analyzer.html#Lucene_Net_Analysis_Analyzer_GetTokenStream_System_String_System_IO_TextReader_">GetTokenStream(String, TextReader)</a>.</p> |
| </section> |
| <h4><a class="xref" href="Lucene.Net.Analysis.Token.html">Token</a></h4> |
| <section><p>A <a class="xref" href="Lucene.Net.Analysis.Token.html">Token</a> is an occurrence of a term from the text of a field. It consists of |
| a term's text, the start and end offset of the term in the text of the field, |
| and a type string. |
| <p> |
| The start and end offsets permit applications to re-associate a token with |
| its source text, e.g., to display highlighted query terms in a document |
| browser, or to show matching text fragments in a KWIC (KeyWord In Context) |
| display, etc. |
| <p> |
| The type is a string, assigned by a lexical analyzer |
| (a.k.a. tokenizer), naming the lexical or syntactic class that the token |
| belongs to. For example an end of sentence marker token might be implemented |
| with type "eos". The default token type is "word". |
| <p> |
| A Token can optionally have metadata (a.k.a. payload) in the form of a variable |
| length byte array. Use <a class="xref" href="Lucene.Net.Index.DocsAndPositionsEnum.html#Lucene_Net_Index_DocsAndPositionsEnum_GetPayload">GetPayload()</a> to retrieve the |
| payloads from the index.</p> |
| <p><p> |
| |
| <p><strong>NOTE:</strong> As of 2.9, Token implements all <a class="xref" href="Lucene.Net.Util.IAttribute.html">IAttribute</a> interfaces |
| that are part of core Lucene and can be found in the <a class="xref" href="Lucene.Net.Analysis.TokenAttributes.html">Lucene.Net.Analysis.TokenAttributes</a> namespace. |
| Even though it is not necessary to use <a class="xref" href="Lucene.Net.Analysis.Token.html">Token</a> anymore, with the new <a class="xref" href="Lucene.Net.Analysis.TokenStream.html">TokenStream</a> API it can |
| be used as convenience class that implements all <a class="xref" href="Lucene.Net.Util.IAttribute.html">IAttribute</a>s, which is especially useful |
| to easily switch from the old to the new <a class="xref" href="Lucene.Net.Analysis.TokenStream.html">TokenStream</a> API. |
| |
| <p><p> |
| |
| <p><a class="xref" href="Lucene.Net.Analysis.Tokenizer.html">Tokenizer</a>s and <a class="xref" href="Lucene.Net.Analysis.TokenFilter.html">TokenFilter</a>s should try to re-use a <a class="xref" href="Lucene.Net.Analysis.Token.html">Token</a> |
| instance when possible for best performance, by |
| implementing the <a class="xref" href="Lucene.Net.Analysis.TokenStream.html#Lucene_Net_Analysis_TokenStream_IncrementToken">IncrementToken()</a> API. |
| Failing that, to create a new <a class="xref" href="Lucene.Net.Analysis.Token.html">Token</a> you should first use |
| one of the constructors that starts with null text. To load |
| the token from a char[] use <a class="xref" href="Lucene.Net.Analysis.TokenAttributes.ICharTermAttribute.html#Lucene_Net_Analysis_TokenAttributes_ICharTermAttribute_CopyBuffer_System_Char___System_Int32_System_Int32_">CopyBuffer(Char[], Int32, Int32)</a>. |
| To load from a <span class="xref">System.String</span> use <a class="xref" href="Lucene.Net.Analysis.TokenAttributes.ICharTermAttribute.html#Lucene_Net_Analysis_TokenAttributes_ICharTermAttribute_SetEmpty">SetEmpty()</a> followed by |
| <a class="xref" href="Lucene.Net.Analysis.TokenAttributes.ICharTermAttribute.html#Lucene_Net_Analysis_TokenAttributes_ICharTermAttribute_Append_System_String_">Append(String)</a> or <a class="xref" href="Lucene.Net.Analysis.TokenAttributes.ICharTermAttribute.html#Lucene_Net_Analysis_TokenAttributes_ICharTermAttribute_Append_System_String_System_Int32_System_Int32_">Append(String, Int32, Int32)</a>. |
| Alternatively you can get the <a class="xref" href="Lucene.Net.Analysis.Token.html">Token</a>'s termBuffer by calling either <a class="xref" href="Lucene.Net.Analysis.TokenAttributes.ICharTermAttribute.html#Lucene_Net_Analysis_TokenAttributes_ICharTermAttribute_Buffer">Buffer</a>, |
| if you know that your text is shorter than the capacity of the termBuffer |
| or <a class="xref" href="Lucene.Net.Analysis.TokenAttributes.ICharTermAttribute.html#Lucene_Net_Analysis_TokenAttributes_ICharTermAttribute_ResizeBuffer_System_Int32_">ResizeBuffer(Int32)</a>, if there is any possibility |
| that you may need to grow the buffer. Fill in the characters of your term into this |
| buffer, with <span class="xref">System.String.ToCharArray(System.Int32,System.Int32)</span> if loading from a string, |
| or with <span class="xref">System.Array.Copy(System.Array,System.Int32,System.Array,System.Int32,System.Int32)</span>, |
| and finally call <a class="xref" href="Lucene.Net.Analysis.TokenAttributes.ICharTermAttribute.html#Lucene_Net_Analysis_TokenAttributes_ICharTermAttribute_SetLength_System_Int32_">SetLength(Int32)</a> to |
| set the length of the term text. See <a target="_top" href="https://issues.apache.org/jira/browse/LUCENE-969">LUCENE-969</a> |
| for details.</p> |
| <p>Typical Token reuse patterns: |
| <ul><li> Copying text from a string (type is reset to <a class="xref" href="Lucene.Net.Analysis.TokenAttributes.TypeAttribute.html#Lucene_Net_Analysis_TokenAttributes_TypeAttribute_DEFAULT_TYPE">DEFAULT_TYPE</a> if not specified): |
| <pre><code> return reusableToken.Reinit(string, startOffset, endOffset[, type]);</code></pre> |
| </li><li> Copying some text from a string (type is reset to <a class="xref" href="Lucene.Net.Analysis.TokenAttributes.TypeAttribute.html#Lucene_Net_Analysis_TokenAttributes_TypeAttribute_DEFAULT_TYPE">DEFAULT_TYPE</a> if not specified): |
| <pre><code> return reusableToken.Reinit(string, 0, string.Length, startOffset, endOffset[, type]);</code></pre> |
| </li><li> Copying text from char[] buffer (type is reset to <a class="xref" href="Lucene.Net.Analysis.TokenAttributes.TypeAttribute.html#Lucene_Net_Analysis_TokenAttributes_TypeAttribute_DEFAULT_TYPE">DEFAULT_TYPE</a> if not specified): |
| <pre><code> return reusableToken.Reinit(buffer, 0, buffer.Length, startOffset, endOffset[, type]);</code></pre> |
| </li><li> Copying some text from a char[] buffer (type is reset to <a class="xref" href="Lucene.Net.Analysis.TokenAttributes.TypeAttribute.html#Lucene_Net_Analysis_TokenAttributes_TypeAttribute_DEFAULT_TYPE">DEFAULT_TYPE</a> if not specified): |
| <pre><code> return reusableToken.Reinit(buffer, start, end - start, startOffset, endOffset[, type]);</code></pre> |
| </li><li> Copying from one one <a class="xref" href="Lucene.Net.Analysis.Token.html">Token</a> to another (type is reset to <a class="xref" href="Lucene.Net.Analysis.TokenAttributes.TypeAttribute.html#Lucene_Net_Analysis_TokenAttributes_TypeAttribute_DEFAULT_TYPE">DEFAULT_TYPE</a> if not specified): |
| <pre><code> return reusableToken.Reinit(source.Buffer, 0, source.Length, source.StartOffset, source.EndOffset[, source.Type]);</code></pre> |
| </li></ul> |
| A few things to note: |
| <ul><li><a class="xref" href="Lucene.Net.Analysis.Token.html#Lucene_Net_Analysis_Token_Clear">Clear()</a> initializes all of the fields to default values. this was changed in contrast to Lucene 2.4, but should affect no one.</li><li>Because <a class="xref" href="Lucene.Net.Analysis.TokenStream.html">TokenStream</a>s can be chained, one cannot assume that the <a class="xref" href="Lucene.Net.Analysis.Token.html">Token</a>'s current type is correct.</li><li>The startOffset and endOffset represent the start and offset in the source text, so be careful in adjusting them.</li><li>When caching a reusable token, clone it. When injecting a cached token into a stream that can be reset, clone it again.</li></ul> |
| </p> |
| <p> |
| <strong>Please note:</strong> With Lucene 3.1, the <a class="xref" href="Lucene.Net.Analysis.TokenAttributes.CharTermAttribute.html#Lucene_Net_Analysis_TokenAttributes_CharTermAttribute_ToString">ToString()</a> method had to be changed to match the |
| <a class="xref" href="Lucene.Net.Support.ICharSequence.html">ICharSequence</a> interface introduced by the interface <a class="xref" href="Lucene.Net.Analysis.TokenAttributes.ICharTermAttribute.html">ICharTermAttribute</a>. |
| this method now only prints the term text, no additional information anymore. |
| </p> |
| </section> |
| <h4><a class="xref" href="Lucene.Net.Analysis.Token.TokenAttributeFactory.html">Token.TokenAttributeFactory</a></h4> |
| <section><p><strong>Expert:</strong> Creates a <a class="xref" href="Lucene.Net.Analysis.Token.TokenAttributeFactory.html">Token.TokenAttributeFactory</a> returning <a class="xref" href="Lucene.Net.Analysis.Token.html">Token</a> as instance for the basic attributes |
| and for all other attributes calls the given delegate factory. |
| @since 3.0</p> |
| </section> |
| <h4><a class="xref" href="Lucene.Net.Analysis.TokenFilter.html">TokenFilter</a></h4> |
| <section><p>A <a class="xref" href="Lucene.Net.Analysis.TokenFilter.html">TokenFilter</a> is a <a class="xref" href="Lucene.Net.Analysis.TokenStream.html">TokenStream</a> whose input is another <a class="xref" href="Lucene.Net.Analysis.TokenStream.html">TokenStream</a>. |
| <p> |
| This is an abstract class; subclasses must override <a class="xref" href="Lucene.Net.Analysis.TokenStream.html#Lucene_Net_Analysis_TokenStream_IncrementToken">IncrementToken()</a>.</p> |
| </section> |
| <h4><a class="xref" href="Lucene.Net.Analysis.Tokenizer.html">Tokenizer</a></h4> |
| <section><p>A <a class="xref" href="Lucene.Net.Analysis.Tokenizer.html">Tokenizer</a> is a <a class="xref" href="Lucene.Net.Analysis.TokenStream.html">TokenStream</a> whose input is a <span class="xref">System.IO.TextReader</span>. |
| <p> |
| This is an abstract class; subclasses must override <a class="xref" href="Lucene.Net.Analysis.TokenStream.html#Lucene_Net_Analysis_TokenStream_IncrementToken">IncrementToken()</a> |
| <p> |
| NOTE: Subclasses overriding <a class="xref" href="Lucene.Net.Analysis.TokenStream.html#Lucene_Net_Analysis_TokenStream_IncrementToken">IncrementToken()</a> must |
| call <a class="xref" href="Lucene.Net.Util.AttributeSource.html#Lucene_Net_Util_AttributeSource_ClearAttributes">ClearAttributes()</a> before |
| setting attributes.</p> |
| </section> |
| <h4><a class="xref" href="Lucene.Net.Analysis.TokenStream.html">TokenStream</a></h4> |
| <section><p>A <a class="xref" href="Lucene.Net.Analysis.TokenStream.html">TokenStream</a> enumerates the sequence of tokens, either from |
| <a class="xref" href="Lucene.Net.Documents.Field.html">Field</a>s of a <a class="xref" href="Lucene.Net.Documents.Document.html">Document</a> or from query text. |
| <p> |
| this is an abstract class; concrete subclasses are: |
| <ul><li><a class="xref" href="Lucene.Net.Analysis.Tokenizer.html">Tokenizer</a>, a <a class="xref" href="Lucene.Net.Analysis.TokenStream.html">TokenStream</a> whose input is a <span class="xref">System.IO.TextReader</span>; and</li><li><a class="xref" href="Lucene.Net.Analysis.TokenFilter.html">TokenFilter</a>, a <a class="xref" href="Lucene.Net.Analysis.TokenStream.html">TokenStream</a> whose input is another |
| <a class="xref" href="Lucene.Net.Analysis.TokenStream.html">TokenStream</a>.</li></ul> |
| A new <a class="xref" href="Lucene.Net.Analysis.TokenStream.html">TokenStream</a> API has been introduced with Lucene 2.9. this API |
| has moved from being <a class="xref" href="Lucene.Net.Analysis.Token.html">Token</a>-based to <a class="xref" href="Lucene.Net.Util.IAttribute.html">IAttribute</a>-based. While |
| <a class="xref" href="Lucene.Net.Analysis.Token.html">Token</a> still exists in 2.9 as a convenience class, the preferred way |
| to store the information of a <a class="xref" href="Lucene.Net.Analysis.Token.html">Token</a> is to use <span class="xref">System.Attribute</span>s. |
| <p> |
| <a class="xref" href="Lucene.Net.Analysis.TokenStream.html">TokenStream</a> now extends <a class="xref" href="Lucene.Net.Util.AttributeSource.html">AttributeSource</a>, which provides |
| access to all of the token <a class="xref" href="Lucene.Net.Util.IAttribute.html">IAttribute</a>s for the <a class="xref" href="Lucene.Net.Analysis.TokenStream.html">TokenStream</a>. |
| Note that only one instance per <span class="xref">System.Attribute</span> is created and reused |
| for every token. This approach reduces object creation and allows local |
| caching of references to the <span class="xref">System.Attribute</span>s. See |
| <a class="xref" href="Lucene.Net.Analysis.TokenStream.html#Lucene_Net_Analysis_TokenStream_IncrementToken">IncrementToken()</a> for further details. |
| <p> |
| <strong>The workflow of the new <a class="xref" href="Lucene.Net.Analysis.TokenStream.html">TokenStream</a> API is as follows:</strong> |
| <ol><li>Instantiation of <a class="xref" href="Lucene.Net.Analysis.TokenStream.html">TokenStream</a>/<a class="xref" href="Lucene.Net.Analysis.TokenFilter.html">TokenFilter</a>s which add/get |
| attributes to/from the <a class="xref" href="Lucene.Net.Util.AttributeSource.html">AttributeSource</a>.</li><li>The consumer calls <a class="xref" href="Lucene.Net.Analysis.TokenStream.html#Lucene_Net_Analysis_TokenStream_Reset">Reset()</a>.</li><li>The consumer retrieves attributes from the stream and stores local |
| references to all attributes it wants to access.</li><li>The consumer calls <a class="xref" href="Lucene.Net.Analysis.TokenStream.html#Lucene_Net_Analysis_TokenStream_IncrementToken">IncrementToken()</a> until it returns false |
| consuming the attributes after each call.</li><li>The consumer calls <a class="xref" href="Lucene.Net.Analysis.TokenStream.html#Lucene_Net_Analysis_TokenStream_End">End()</a> so that any end-of-stream operations |
| can be performed.</li><li>The consumer calls <a class="xref" href="Lucene.Net.Analysis.TokenStream.html#Lucene_Net_Analysis_TokenStream_Dispose">Dispose()</a> to release any resource when finished |
| using the <a class="xref" href="Lucene.Net.Analysis.TokenStream.html">TokenStream</a>.</li></ol> |
| To make sure that filters and consumers know which attributes are available, |
| the attributes must be added during instantiation. Filters and consumers are |
| not required to check for availability of attributes in |
| <a class="xref" href="Lucene.Net.Analysis.TokenStream.html#Lucene_Net_Analysis_TokenStream_IncrementToken">IncrementToken()</a>. |
| <p> |
| You can find some example code for the new API in the analysis |
| documentation. |
| <p> |
| Sometimes it is desirable to capture a current state of a <a class="xref" href="Lucene.Net.Analysis.TokenStream.html">TokenStream</a>, |
| e.g., for buffering purposes (see <a class="xref" href="Lucene.Net.Analysis.CachingTokenFilter.html">CachingTokenFilter</a>, |
| TeeSinkTokenFilter). For this usecase |
| <a class="xref" href="Lucene.Net.Util.AttributeSource.html#Lucene_Net_Util_AttributeSource_CaptureState">CaptureState()</a> and <a class="xref" href="Lucene.Net.Util.AttributeSource.html#Lucene_Net_Util_AttributeSource_RestoreState_Lucene_Net_Util_AttributeSource_State_">RestoreState(AttributeSource.State)</a> |
| can be used. |
| <p>The <a class="xref" href="Lucene.Net.Analysis.TokenStream.html">TokenStream</a>-API in Lucene is based on the decorator pattern. |
| Therefore all non-abstract subclasses must be sealed or have at least a sealed |
| implementation of <a class="xref" href="Lucene.Net.Analysis.TokenStream.html#Lucene_Net_Analysis_TokenStream_IncrementToken">IncrementToken()</a>! This is checked when assertions are enabled.</p> |
| </section> |
| <h4><a class="xref" href="Lucene.Net.Analysis.TokenStreamComponents.html">TokenStreamComponents</a></h4> |
| <section><p>This class encapsulates the outer components of a token stream. It provides |
| access to the source (<a class="xref" href="Lucene.Net.Analysis.Tokenizer.html">Tokenizer</a>) and the outer end (sink), an |
| instance of <a class="xref" href="Lucene.Net.Analysis.TokenFilter.html">TokenFilter</a> which also serves as the |
| <a class="xref" href="Lucene.Net.Analysis.TokenStream.html">TokenStream</a> returned by |
| <a class="xref" href="Lucene.Net.Analysis.Analyzer.html#Lucene_Net_Analysis_Analyzer_GetTokenStream_System_String_System_IO_TextReader_">GetTokenStream(String, TextReader)</a>.</p> |
| </section> |
| <h4><a class="xref" href="Lucene.Net.Analysis.TokenStreamToAutomaton.html">TokenStreamToAutomaton</a></h4> |
| <section><p>Consumes a <a class="xref" href="Lucene.Net.Analysis.TokenStream.html">TokenStream</a> and creates an <a class="xref" href="Lucene.Net.Util.Automaton.Automaton.html">Automaton</a> |
| where the transition labels are UTF8 bytes (or Unicode |
| code points if unicodeArcs is true) from the <a class="xref" href="Lucene.Net.Analysis.TokenAttributes.ITermToBytesRefAttribute.html">ITermToBytesRefAttribute</a>. |
| Between tokens we insert |
| <a class="xref" href="Lucene.Net.Analysis.TokenStreamToAutomaton.html#Lucene_Net_Analysis_TokenStreamToAutomaton_POS_SEP">POS_SEP</a> and for holes we insert <a class="xref" href="Lucene.Net.Analysis.TokenStreamToAutomaton.html#Lucene_Net_Analysis_TokenStreamToAutomaton_HOLE">HOLE</a>.</p> |
| <div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div></section> |
| <h3 id="interfaces">Interfaces |
| </h3> |
| <h4><a class="xref" href="Lucene.Net.Analysis.NumericTokenStream.INumericTermAttribute.html">NumericTokenStream.INumericTermAttribute</a></h4> |
| <section><p><strong>Expert:</strong> Use this attribute to get the details of the currently generated token. |
| @lucene.experimental |
| @since 4.0</p> |
| </section> |
| </article> |
| </div> |
| |
| <div class="hidden-sm col-md-2" role="complementary"> |
| <div class="sideaffix"> |
| <div class="contribution"> |
| <ul class="nav"> |
| <li> |
| <a href="https://github.com/apache/lucenenet/blob/docs-4.8.0-beta00007/src/Lucene.Net.TestFramework/Analysis/package.md/#L2" class="contribution-link">Improve this Doc</a> |
| </li> |
| </ul> |
| </div> |
| <nav class="bs-docs-sidebar hidden-print hidden-xs hidden-sm affix" id="affix"> |
| <!-- <p><a class="back-to-top" href="#top">Back to top</a><p> --> |
| </nav> |
| </div> |
| </div> |
| </div> |
| </div> |
| |
| <footer> |
| <div class="grad-bottom"></div> |
| <div class="footer"> |
| <div class="container"> |
| <span class="pull-right"> |
| <a href="#top">Back to top</a> |
| </span> |
| Copyright © 2020 Licensed to the Apache Software Foundation (ASF) |
| |
| </div> |
| </div> |
| </footer> |
| </div> |
| |
| <script type="text/javascript" src="../../styles/docfx.vendor.js"></script> |
| <script type="text/javascript" src="../../styles/docfx.js"></script> |
| <script type="text/javascript" src="../../styles/main.js"></script> |
| </body> |
| </html> |