| <!DOCTYPE html> |
| <!--[if IE]><![endif]--> |
| <html> |
| |
| <head> |
| <meta charset="utf-8"> |
| <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"> |
| <title>Namespace Lucene.Net.Analysis.Pattern |
| | Apache Lucene.NET 4.8.0-beta00010 Documentation </title> |
| <meta name="viewport" content="width=device-width"> |
| <meta name="title" content="Namespace Lucene.Net.Analysis.Pattern |
| | Apache Lucene.NET 4.8.0-beta00010 Documentation "> |
| <meta name="generator" content="docfx 2.56.0.0"> |
| |
| <link rel="shortcut icon" href="https://lucenenet.apache.org/docs/4.8.0-beta00009/logo/favicon.ico"> |
| <link rel="stylesheet" href="https://lucenenet.apache.org/docs/4.8.0-beta00009/styles/docfx.vendor.css"> |
| <link rel="stylesheet" href="https://lucenenet.apache.org/docs/4.8.0-beta00009/styles/docfx.css"> |
| <link rel="stylesheet" href="https://lucenenet.apache.org/docs/4.8.0-beta00009/styles/main.css"> |
| <meta property="docfx:navrel" content="toc.html"> |
| <meta property="docfx:tocrel" content="analysis-common/toc.html"> |
| |
| <meta property="docfx:rel" content="https://lucenenet.apache.org/docs/4.8.0-beta00009/"> |
| |
| </head> |
| <body data-spy="scroll" data-target="#affix" data-offset="120"> |
| <div id="wrapper"> |
| <header> |
| |
| <nav id="autocollapse" class="navbar ng-scope" role="navigation"> |
| <div class="container"> |
| <div class="navbar-header"> |
| <button type="button" class="navbar-toggle" data-toggle="collapse" data-target="#navbar"> |
| <span class="sr-only">Toggle navigation</span> |
| <span class="icon-bar"></span> |
| <span class="icon-bar"></span> |
| <span class="icon-bar"></span> |
| </button> |
| |
| <a class="navbar-brand" href="/"> |
| <img id="logo" class="svg" src="https://lucenenet.apache.org/docs/4.8.0-beta00009/logo/lucene-net-color.png" alt=""> |
| </a> |
| </div> |
| <div class="collapse navbar-collapse" id="navbar"> |
| <form class="navbar-form navbar-right" role="search" id="search"> |
| <div class="form-group"> |
| <input type="text" class="form-control" id="search-query" placeholder="Search" autocomplete="off"> |
| </div> |
| </form> |
| </div> |
| </div> |
| </nav> |
| |
| <div class="subnav navbar navbar-default"> |
| <div class="container hide-when-search"> |
| <ul class="level0 breadcrumb"> |
| <li> |
| <a href="https://lucenenet.apache.org/docs/4.8.0-beta00009/">API</a> |
| <span id="breadcrumb"> |
| <ul class="breadcrumb"> |
| <li></li> |
| </ul> |
| </span> |
| </li> |
| </ul> |
| </div> |
| </div> |
| </header> |
| <div class="container body-content"> |
| |
| <div id="search-results"> |
| <div class="search-list"></div> |
| <div class="sr-items"> |
| <p><i class="glyphicon glyphicon-refresh index-loading"></i></p> |
| </div> |
| <ul id="pagination"></ul> |
| </div> |
| </div> |
| <div role="main" class="container body-content hide-when-search"> |
| |
| <div class="sidenav hide-when-search"> |
| <a class="btn toc-toggle collapse" data-toggle="collapse" href="#sidetoggle" aria-expanded="false" aria-controls="sidetoggle">Show / Hide Table of Contents</a> |
| <div class="sidetoggle collapse" id="sidetoggle"> |
| <div id="sidetoc"></div> |
| </div> |
| </div> |
| <div class="article row grid-right"> |
| <div class="col-md-10"> |
| <article class="content wrap" id="_content" data-uid="Lucene.Net.Analysis.Pattern"> |
| |
| <h1 id="Lucene_Net_Analysis_Pattern" data-uid="Lucene.Net.Analysis.Pattern" class="text-break">Namespace Lucene.Net.Analysis.Pattern |
| </h1> |
| <div class="markdown level0 summary"><!-- |
| Licensed to the Apache Software Foundation (ASF) under one or more |
| contributor license agreements. See the NOTICE file distributed with |
| this work for additional information regarding copyright ownership. |
| The ASF licenses this file to You under the Apache License, Version 2.0 |
| (the "License"); you may not use this file except in compliance with |
| the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License. |
| --> |
| <p>Set of components for pattern-based (regex) analysis.</p> |
| </div> |
| <div class="markdown level0 conceptual"></div> |
| <div class="markdown level0 remarks"></div> |
| <h3 id="classes">Classes |
| </h3> |
| <h4><a class="xref" href="Lucene.Net.Analysis.Pattern.PatternCaptureGroupFilterFactory.html">PatternCaptureGroupFilterFactory</a></h4> |
| <section><p>Factory for <a class="xref" href="Lucene.Net.Analysis.Pattern.PatternCaptureGroupTokenFilter.html">PatternCaptureGroupTokenFilter</a>. </p> |
| <pre><code><fieldType name="text_ptncapturegroup" class="solr.TextField" positionIncrementGap="100"> |
| <analyzer> |
| <tokenizer class="solr.KeywordTokenizerFactory"/> |
| <filter class="solr.PatternCaptureGroupFilterFactory" pattern="([^a-z])" preserve_original="true"/> |
| </analyzer> |
| </fieldType></code></pre> |
| </section> |
| <h4><a class="xref" href="Lucene.Net.Analysis.Pattern.PatternCaptureGroupTokenFilter.html">PatternCaptureGroupTokenFilter</a></h4> |
| <section><p>CaptureGroup uses .NET regexes to emit multiple tokens - one for each capture |
| group in one or more patterns.</p> |
| <p> |
| For example, a pattern like: |
| </p> |
| |
| <p> |
| <code>"(https?://([a-zA-Z-_0-9.]+))"</code> |
| </p> |
| |
| <p> |
| when matched against the string "<a href="http://www.foo.com/index&quot">http://www.foo.com/index"</a>; would return the |
| tokens "https://www.foo.com" and "www.foo.com". |
| </p> |
| |
| <p> |
| If none of the patterns match, or if preserveOriginal is true, the original |
| token will be preserved. |
| </p> |
| <p> |
| Each pattern is matched as often as it can be, so the pattern |
| <code> "(...)"</code>, when matched against <code>"abcdefghi"</code> would |
| produce <code>["abc","def","ghi"]</code> |
| </p> |
| <p> |
| A camelCaseFilter could be written as: |
| </p> |
| <p> |
| <pre><code> "([A-Z]{2,})", |
| "(?<![A-Z])([A-Z][a-z]+)", |
| "(?:^|\\b|(?<=[0-9_])|(?<=[A-Z]{2}))([a-z]+)", |
| "([0-9]+)"</code></pre> |
| </p> |
| <p> |
| plus if <span class="xref">Lucene.Net.Analysis.Pattern.PatternCaptureGroupTokenFilter.preserveOriginal</span> is true, it would also return |
| <code>camelCaseFilter</code> |
| </p> |
| </section> |
| <h4><a class="xref" href="Lucene.Net.Analysis.Pattern.PatternReplaceCharFilter.html">PatternReplaceCharFilter</a></h4> |
| <section><p><span class="xref">Lucene.Net.Analysis.CharFilter</span> that uses a regular expression for the target of replace string. |
| The pattern match will be done in each "block" in char stream.</p> |
| <p> |
| ex1) source="aa bb aa bb", pattern="(aa)\s+(bb)" replacement="$1#$2" |
| output="aa#bb aa#bb" |
| </p> |
| |
| <p>NOTE: If you produce a phrase that has different length to source string |
| and the field is used for highlighting for a term of the phrase, you will |
| face a trouble.</p> |
| <p> |
| ex2) source="aa123bb", pattern="(aa)\d+(bb)" replacement="$1 $2" |
| output="aa bb" |
| and you want to search bb and highlight it, you will get |
| highlight snippet="aa1<em>23bb</em>" |
| </p> |
| |
| <p>@since Solr 1.5</p> |
| </section> |
| <h4><a class="xref" href="Lucene.Net.Analysis.Pattern.PatternReplaceCharFilterFactory.html">PatternReplaceCharFilterFactory</a></h4> |
| <section><p>Factory for <a class="xref" href="Lucene.Net.Analysis.Pattern.PatternReplaceCharFilter.html">PatternReplaceCharFilter</a>. </p> |
| <pre><code><fieldType name="text_ptnreplace" class="solr.TextField" positionIncrementGap="100"> |
| <analyzer> |
| <charFilter class="solr.PatternReplaceCharFilterFactory" |
| pattern="([^a-z])" replacement=""/> |
| <tokenizer class="solr.KeywordTokenizerFactory"/> |
| </analyzer> |
| </fieldType></code></pre> |
| |
| <p>@since Solr 3.1</p> |
| </section> |
| <h4><a class="xref" href="Lucene.Net.Analysis.Pattern.PatternReplaceFilter.html">PatternReplaceFilter</a></h4> |
| <section><p>A TokenFilter which applies a <span class="xref">System.Text.RegularExpressions.Regex</span> to each token in the stream, |
| replacing match occurances with the specified replacement string.</p> |
| <p> |
| <strong>Note:</strong> Depending on the input and the pattern used and the input |
| <span class="xref">Lucene.Net.Analysis.TokenStream</span>, this <span class="xref">Lucene.Net.Analysis.TokenFilter</span> may produce <span class="xref">Lucene.Net.Analysis.Token</span>s whose text is the empty |
| string. |
| </p> |
| </section> |
| <h4><a class="xref" href="Lucene.Net.Analysis.Pattern.PatternReplaceFilterFactory.html">PatternReplaceFilterFactory</a></h4> |
| <section><p>Factory for <a class="xref" href="Lucene.Net.Analysis.Pattern.PatternReplaceFilter.html">PatternReplaceFilter</a>. </p> |
| <pre><code><fieldType name="text_ptnreplace" class="solr.TextField" positionIncrementGap="100"> |
| <analyzer> |
| <tokenizer class="solr.KeywordTokenizerFactory"/> |
| <filter class="solr.PatternReplaceFilterFactory" pattern="([^a-z])" replacement="" |
| replace="all"/> |
| </analyzer> |
| </fieldType></code></pre> |
| </section> |
| <h4><a class="xref" href="Lucene.Net.Analysis.Pattern.PatternTokenizer.html">PatternTokenizer</a></h4> |
| <section><p>This tokenizer uses regex pattern matching to construct distinct tokens |
| for the input stream. It takes two arguments: "pattern" and "group". |
| <p> |
| <ul><li>"pattern" is the regular expression.</li><li>"group" says which group to extract into tokens.</li></ul> |
| <p> |
| group=-1 (the default) is equivalent to "split". In this case, the tokens will |
| be equivalent to the output from (without empty tokens): |
| <span class="xref">System.Text.RegularExpressions.Regex.Replace(System.String,System.String)</span> |
| </p> |
| <p> |
| Using group >= 0 selects the matching group as the token. For example, if you have:<br></p> |
| <pre><code> pattern = \'([^\']+)\' |
| group = 0 |
| input = aaa 'bbb' 'ccc'</code></pre> |
| <p>the output will be two tokens: 'bbb' and 'ccc' (including the ' marks). With the same input |
| but using group=1, the output would be: bbb and ccc (no ' marks) |
| </p> |
| <p>NOTE: This <span class="xref">Lucene.Net.Analysis.Tokenizer</span> does not output tokens that are of zero length.</p></p> |
| </section> |
| <h4><a class="xref" href="Lucene.Net.Analysis.Pattern.PatternTokenizerFactory.html">PatternTokenizerFactory</a></h4> |
| <section><p>Factory for <a class="xref" href="Lucene.Net.Analysis.Pattern.PatternTokenizer.html">PatternTokenizer</a>. |
| This tokenizer uses regex pattern matching to construct distinct tokens |
| for the input stream. It takes two arguments: "pattern" and "group". |
| <p> |
| <ul><li>"pattern" is the regular expression.</li><li>"group" says which group to extract into tokens.</li></ul> |
| <p> |
| group=-1 (the default) is equivalent to "split". In this case, the tokens will |
| be equivalent to the output from (without empty tokens): |
| <span class="xref">System.Text.RegularExpressions.Regex.Replace(System.String,System.String)</span> |
| </p> |
| <p> |
| Using group >= 0 selects the matching group as the token. For example, if you have:<br></p> |
| <pre><code> pattern = \'([^\']+)\' |
| group = 0 |
| input = aaa 'bbb' 'ccc'</code></pre> |
| <p>the output will be two tokens: 'bbb' and 'ccc' (including the ' marks). With the same input |
| but using group=1, the output would be: bbb and ccc (no ' marks) |
| </p> |
| <p>NOTE: This Tokenizer does not output tokens that are of zero length.</p></p> |
| <pre><code><fieldType name="text_ptn" class="solr.TextField" positionIncrementGap="100"> |
| <analyzer> |
| <tokenizer class="solr.PatternTokenizerFactory" pattern="\'([^\']+)\'" group="1"/> |
| </analyzer> |
| </fieldType></code></pre> |
| |
| <p>@since solr1.2</p> |
| </section> |
| </article> |
| </div> |
| |
| <div class="hidden-sm col-md-2" role="complementary"> |
| <div class="sideaffix"> |
| <div class="contribution"> |
| <ul class="nav"> |
| <li> |
| <a href="https://github.com/apache/lucenenet/blob/docs/4.8.0-beta00010/src/Lucene.Net.Analysis.Common/Analysis/Pattern/package.md/#L2" class="contribution-link">Improve this Doc</a> |
| </li> |
| </ul> |
| </div> |
| <nav class="bs-docs-sidebar hidden-print hidden-xs hidden-sm affix" id="affix"> |
| <!-- <p><a class="back-to-top" href="#top">Back to top</a><p> --> |
| </nav> |
| </div> |
| </div> |
| </div> |
| </div> |
| |
| <footer> |
| <div class="grad-bottom"></div> |
| <div class="footer"> |
| <div class="container"> |
| <span class="pull-right"> |
| <a href="#top">Back to top</a> |
| </span> |
| Copyright © 2020 Licensed to the Apache Software Foundation (ASF) |
| |
| </div> |
| </div> |
| </footer> |
| </div> |
| |
| <script type="text/javascript" src="https://lucenenet.apache.org/docs/4.8.0-beta00009/styles/docfx.vendor.js"></script> |
| <script type="text/javascript" src="https://lucenenet.apache.org/docs/4.8.0-beta00009/styles/docfx.js"></script> |
| <script type="text/javascript" src="https://lucenenet.apache.org/docs/4.8.0-beta00009/styles/main.js"></script> |
| </body> |
| </html> |