blob: 5d06bbf92b0fe19c7d0ebc6a38a5560dc67a8074 [file] [log] [blame]
<!DOCTYPE html>
<!--[if IE]><![endif]-->
<html>
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
<title>Namespace Lucene.Net.Analysis.Pattern
| Apache Lucene.NET 4.8.0-beta00010 Documentation </title>
<meta name="viewport" content="width=device-width">
<meta name="title" content="Namespace Lucene.Net.Analysis.Pattern
| Apache Lucene.NET 4.8.0-beta00010 Documentation ">
<meta name="generator" content="docfx 2.56.0.0">
<link rel="shortcut icon" href="https://lucenenet.apache.org/docs/4.8.0-beta00009/logo/favicon.ico">
<link rel="stylesheet" href="https://lucenenet.apache.org/docs/4.8.0-beta00009/styles/docfx.vendor.css">
<link rel="stylesheet" href="https://lucenenet.apache.org/docs/4.8.0-beta00009/styles/docfx.css">
<link rel="stylesheet" href="https://lucenenet.apache.org/docs/4.8.0-beta00009/styles/main.css">
<meta property="docfx:navrel" content="toc.html">
<meta property="docfx:tocrel" content="analysis-common/toc.html">
<meta property="docfx:rel" content="https://lucenenet.apache.org/docs/4.8.0-beta00009/">
</head>
<body data-spy="scroll" data-target="#affix" data-offset="120">
<div id="wrapper">
<header>
<nav id="autocollapse" class="navbar ng-scope" role="navigation">
<div class="container">
<div class="navbar-header">
<button type="button" class="navbar-toggle" data-toggle="collapse" data-target="#navbar">
<span class="sr-only">Toggle navigation</span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</button>
<a class="navbar-brand" href="/">
<img id="logo" class="svg" src="https://lucenenet.apache.org/docs/4.8.0-beta00009/logo/lucene-net-color.png" alt="">
</a>
</div>
<div class="collapse navbar-collapse" id="navbar">
<form class="navbar-form navbar-right" role="search" id="search">
<div class="form-group">
<input type="text" class="form-control" id="search-query" placeholder="Search" autocomplete="off">
</div>
</form>
</div>
</div>
</nav>
<div class="subnav navbar navbar-default">
<div class="container hide-when-search">
<ul class="level0 breadcrumb">
<li>
<a href="https://lucenenet.apache.org/docs/4.8.0-beta00009/">API</a>
<span id="breadcrumb">
<ul class="breadcrumb">
<li></li>
</ul>
</span>
</li>
</ul>
</div>
</div>
</header>
<div class="container body-content">
<div id="search-results">
<div class="search-list"></div>
<div class="sr-items">
<p><i class="glyphicon glyphicon-refresh index-loading"></i></p>
</div>
<ul id="pagination"></ul>
</div>
</div>
<div role="main" class="container body-content hide-when-search">
<div class="sidenav hide-when-search">
<a class="btn toc-toggle collapse" data-toggle="collapse" href="#sidetoggle" aria-expanded="false" aria-controls="sidetoggle">Show / Hide Table of Contents</a>
<div class="sidetoggle collapse" id="sidetoggle">
<div id="sidetoc"></div>
</div>
</div>
<div class="article row grid-right">
<div class="col-md-10">
<article class="content wrap" id="_content" data-uid="Lucene.Net.Analysis.Pattern">
<h1 id="Lucene_Net_Analysis_Pattern" data-uid="Lucene.Net.Analysis.Pattern" class="text-break">Namespace Lucene.Net.Analysis.Pattern
</h1>
<div class="markdown level0 summary"><!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<p>Set of components for pattern-based (regex) analysis.</p>
</div>
<div class="markdown level0 conceptual"></div>
<div class="markdown level0 remarks"></div>
<h3 id="classes">Classes
</h3>
<h4><a class="xref" href="Lucene.Net.Analysis.Pattern.PatternCaptureGroupFilterFactory.html">PatternCaptureGroupFilterFactory</a></h4>
<section><p>Factory for <a class="xref" href="Lucene.Net.Analysis.Pattern.PatternCaptureGroupTokenFilter.html">PatternCaptureGroupTokenFilter</a>. </p>
<pre><code>&lt;fieldType name=&quot;text_ptncapturegroup&quot; class=&quot;solr.TextField&quot; positionIncrementGap=&quot;100&quot;>
&lt;analyzer>
&lt;tokenizer class=&quot;solr.KeywordTokenizerFactory&quot;/>
&lt;filter class=&quot;solr.PatternCaptureGroupFilterFactory&quot; pattern=&quot;([^a-z])&quot; preserve_original=&quot;true&quot;/>
&lt;/analyzer>
&lt;/fieldType></code></pre>
</section>
<h4><a class="xref" href="Lucene.Net.Analysis.Pattern.PatternCaptureGroupTokenFilter.html">PatternCaptureGroupTokenFilter</a></h4>
<section><p>CaptureGroup uses .NET regexes to emit multiple tokens - one for each capture
group in one or more patterns.</p>
<p>
For example, a pattern like:
</p>
<p>
<code>&quot;(https?://([a-zA-Z-_0-9.]+))&quot;</code>
</p>
<p>
when matched against the string &quot;<a href="http://www.foo.com/index&amp;quot">http://www.foo.com/index&quot</a>; would return the
tokens &quot;https://www.foo.com&quot; and &quot;www.foo.com&quot;.
</p>
<p>
If none of the patterns match, or if preserveOriginal is true, the original
token will be preserved.
</p>
<p>
Each pattern is matched as often as it can be, so the pattern
<code> &quot;(...)&quot;</code>, when matched against <code>&quot;abcdefghi&quot;</code> would
produce <code>[&quot;abc&quot;,&quot;def&quot;,&quot;ghi&quot;]</code>
</p>
<p>
A camelCaseFilter could be written as:
</p>
<p>
<pre><code> &quot;([A-Z]{2,})&quot;,
&quot;(?&lt;![A-Z])([A-Z][a-z]+)&quot;,
&quot;(?:^|\\b|(?&lt;=[0-9_])|(?&lt;=[A-Z]{2}))([a-z]+)&quot;,
&quot;([0-9]+)&quot;</code></pre>
</p>
<p>
plus if <span class="xref">Lucene.Net.Analysis.Pattern.PatternCaptureGroupTokenFilter.preserveOriginal</span> is true, it would also return
<code>camelCaseFilter</code>
</p>
</section>
<h4><a class="xref" href="Lucene.Net.Analysis.Pattern.PatternReplaceCharFilter.html">PatternReplaceCharFilter</a></h4>
<section><p><span class="xref">Lucene.Net.Analysis.CharFilter</span> that uses a regular expression for the target of replace string.
The pattern match will be done in each &quot;block&quot; in char stream.</p>
<p>
ex1) source=&quot;aa bb aa bb&quot;, pattern=&quot;(aa)\s+(bb)&quot; replacement=&quot;$1#$2&quot;
output=&quot;aa#bb aa#bb&quot;
</p>
<p>NOTE: If you produce a phrase that has different length to source string
and the field is used for highlighting for a term of the phrase, you will
face a trouble.</p>
<p>
ex2) source=&quot;aa123bb&quot;, pattern=&quot;(aa)\d+(bb)&quot; replacement=&quot;$1 $2&quot;
output=&quot;aa bb&quot;
and you want to search bb and highlight it, you will get
highlight snippet=&quot;aa1&lt;em&gt;23bb&lt;/em&gt;&quot;
</p>
<p>@since Solr 1.5</p>
</section>
<h4><a class="xref" href="Lucene.Net.Analysis.Pattern.PatternReplaceCharFilterFactory.html">PatternReplaceCharFilterFactory</a></h4>
<section><p>Factory for <a class="xref" href="Lucene.Net.Analysis.Pattern.PatternReplaceCharFilter.html">PatternReplaceCharFilter</a>. </p>
<pre><code>&lt;fieldType name=&quot;text_ptnreplace&quot; class=&quot;solr.TextField&quot; positionIncrementGap=&quot;100&quot;>
&lt;analyzer>
&lt;charFilter class=&quot;solr.PatternReplaceCharFilterFactory&quot;
pattern=&quot;([^a-z])&quot; replacement=&quot;&quot;/>
&lt;tokenizer class=&quot;solr.KeywordTokenizerFactory&quot;/>
&lt;/analyzer>
&lt;/fieldType></code></pre>
<p>@since Solr 3.1</p>
</section>
<h4><a class="xref" href="Lucene.Net.Analysis.Pattern.PatternReplaceFilter.html">PatternReplaceFilter</a></h4>
<section><p>A TokenFilter which applies a <span class="xref">System.Text.RegularExpressions.Regex</span> to each token in the stream,
replacing match occurances with the specified replacement string.</p>
<p>
<strong>Note:</strong> Depending on the input and the pattern used and the input
<span class="xref">Lucene.Net.Analysis.TokenStream</span>, this <span class="xref">Lucene.Net.Analysis.TokenFilter</span> may produce <span class="xref">Lucene.Net.Analysis.Token</span>s whose text is the empty
string.
</p>
</section>
<h4><a class="xref" href="Lucene.Net.Analysis.Pattern.PatternReplaceFilterFactory.html">PatternReplaceFilterFactory</a></h4>
<section><p>Factory for <a class="xref" href="Lucene.Net.Analysis.Pattern.PatternReplaceFilter.html">PatternReplaceFilter</a>. </p>
<pre><code>&lt;fieldType name=&quot;text_ptnreplace&quot; class=&quot;solr.TextField&quot; positionIncrementGap=&quot;100&quot;>
&lt;analyzer>
&lt;tokenizer class=&quot;solr.KeywordTokenizerFactory&quot;/>
&lt;filter class=&quot;solr.PatternReplaceFilterFactory&quot; pattern=&quot;([^a-z])&quot; replacement=&quot;&quot;
replace=&quot;all&quot;/>
&lt;/analyzer>
&lt;/fieldType></code></pre>
</section>
<h4><a class="xref" href="Lucene.Net.Analysis.Pattern.PatternTokenizer.html">PatternTokenizer</a></h4>
<section><p>This tokenizer uses regex pattern matching to construct distinct tokens
for the input stream. It takes two arguments: &quot;pattern&quot; and &quot;group&quot;.
<p>
<ul><li>&quot;pattern&quot; is the regular expression.</li><li>&quot;group&quot; says which group to extract into tokens.</li></ul>
<p>
group=-1 (the default) is equivalent to &quot;split&quot;. In this case, the tokens will
be equivalent to the output from (without empty tokens):
<span class="xref">System.Text.RegularExpressions.Regex.Replace(System.String,System.String)</span>
</p>
<p>
Using group &gt;= 0 selects the matching group as the token. For example, if you have:<br></p>
<pre><code> pattern = \&apos;([^\&apos;]+)\&apos;
group = 0
input = aaa &apos;bbb&apos; &apos;ccc&apos;</code></pre>
<p>the output will be two tokens: &apos;bbb&apos; and &apos;ccc&apos; (including the &apos; marks). With the same input
but using group=1, the output would be: bbb and ccc (no &apos; marks)
</p>
<p>NOTE: This <span class="xref">Lucene.Net.Analysis.Tokenizer</span> does not output tokens that are of zero length.</p></p>
</section>
<h4><a class="xref" href="Lucene.Net.Analysis.Pattern.PatternTokenizerFactory.html">PatternTokenizerFactory</a></h4>
<section><p>Factory for <a class="xref" href="Lucene.Net.Analysis.Pattern.PatternTokenizer.html">PatternTokenizer</a>.
This tokenizer uses regex pattern matching to construct distinct tokens
for the input stream. It takes two arguments: &quot;pattern&quot; and &quot;group&quot;.
<p>
<ul><li>&quot;pattern&quot; is the regular expression.</li><li>&quot;group&quot; says which group to extract into tokens.</li></ul>
<p>
group=-1 (the default) is equivalent to &quot;split&quot;. In this case, the tokens will
be equivalent to the output from (without empty tokens):
<span class="xref">System.Text.RegularExpressions.Regex.Replace(System.String,System.String)</span>
</p>
<p>
Using group &gt;= 0 selects the matching group as the token. For example, if you have:<br></p>
<pre><code> pattern = \&apos;([^\&apos;]+)\&apos;
group = 0
input = aaa &apos;bbb&apos; &apos;ccc&apos;</code></pre>
<p>the output will be two tokens: &apos;bbb&apos; and &apos;ccc&apos; (including the &apos; marks). With the same input
but using group=1, the output would be: bbb and ccc (no &apos; marks)
</p>
<p>NOTE: This Tokenizer does not output tokens that are of zero length.</p></p>
<pre><code>&lt;fieldType name=&quot;text_ptn&quot; class=&quot;solr.TextField&quot; positionIncrementGap=&quot;100&quot;>
&lt;analyzer>
&lt;tokenizer class=&quot;solr.PatternTokenizerFactory&quot; pattern=&quot;\&apos;([^\&apos;]+)\&apos;&quot; group=&quot;1&quot;/>
&lt;/analyzer>
&lt;/fieldType></code></pre>
<p>@since solr1.2</p>
</section>
</article>
</div>
<div class="hidden-sm col-md-2" role="complementary">
<div class="sideaffix">
<div class="contribution">
<ul class="nav">
<li>
<a href="https://github.com/apache/lucenenet/blob/docs/4.8.0-beta00010/src/Lucene.Net.Analysis.Common/Analysis/Pattern/package.md/#L2" class="contribution-link">Improve this Doc</a>
</li>
</ul>
</div>
<nav class="bs-docs-sidebar hidden-print hidden-xs hidden-sm affix" id="affix">
<!-- <p><a class="back-to-top" href="#top">Back to top</a><p> -->
</nav>
</div>
</div>
</div>
</div>
<footer>
<div class="grad-bottom"></div>
<div class="footer">
<div class="container">
<span class="pull-right">
<a href="#top">Back to top</a>
</span>
Copyright © 2020 Licensed to the Apache Software Foundation (ASF)
</div>
</div>
</footer>
</div>
<script type="text/javascript" src="https://lucenenet.apache.org/docs/4.8.0-beta00009/styles/docfx.vendor.js"></script>
<script type="text/javascript" src="https://lucenenet.apache.org/docs/4.8.0-beta00009/styles/docfx.js"></script>
<script type="text/javascript" src="https://lucenenet.apache.org/docs/4.8.0-beta00009/styles/main.js"></script>
</body>
</html>