blob: 45314b3e5de1ca67e158a6bb3cf95c7de9aee9e0 [file] [log] [blame]
<!DOCTYPE html>
<!--[if IE]><![endif]-->
<html>
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
<title>Namespace Lucene.Net.Analysis.Cjk
| Apache Lucene.NET 4.8.0-beta00010 Documentation </title>
<meta name="viewport" content="width=device-width">
<meta name="title" content="Namespace Lucene.Net.Analysis.Cjk
| Apache Lucene.NET 4.8.0-beta00010 Documentation ">
<meta name="generator" content="docfx 2.56.0.0">
<link rel="shortcut icon" href="https://lucenenet.apache.org/docs/4.8.0-beta00009/logo/favicon.ico">
<link rel="stylesheet" href="https://lucenenet.apache.org/docs/4.8.0-beta00009/styles/docfx.vendor.css">
<link rel="stylesheet" href="https://lucenenet.apache.org/docs/4.8.0-beta00009/styles/docfx.css">
<link rel="stylesheet" href="https://lucenenet.apache.org/docs/4.8.0-beta00009/styles/main.css">
<meta property="docfx:navrel" content="toc.html">
<meta property="docfx:tocrel" content="analysis-common/toc.html">
<meta property="docfx:rel" content="https://lucenenet.apache.org/docs/4.8.0-beta00009/">
</head>
<body data-spy="scroll" data-target="#affix" data-offset="120">
<div id="wrapper">
<header>
<nav id="autocollapse" class="navbar ng-scope" role="navigation">
<div class="container">
<div class="navbar-header">
<button type="button" class="navbar-toggle" data-toggle="collapse" data-target="#navbar">
<span class="sr-only">Toggle navigation</span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</button>
<a class="navbar-brand" href="/">
<img id="logo" class="svg" src="https://lucenenet.apache.org/docs/4.8.0-beta00009/logo/lucene-net-color.png" alt="">
</a>
</div>
<div class="collapse navbar-collapse" id="navbar">
<form class="navbar-form navbar-right" role="search" id="search">
<div class="form-group">
<input type="text" class="form-control" id="search-query" placeholder="Search" autocomplete="off">
</div>
</form>
</div>
</div>
</nav>
<div class="subnav navbar navbar-default">
<div class="container hide-when-search">
<ul class="level0 breadcrumb">
<li>
<a href="https://lucenenet.apache.org/docs/4.8.0-beta00009/">API</a>
<span id="breadcrumb">
<ul class="breadcrumb">
<li></li>
</ul>
</span>
</li>
</ul>
</div>
</div>
</header>
<div class="container body-content">
<div id="search-results">
<div class="search-list"></div>
<div class="sr-items">
<p><i class="glyphicon glyphicon-refresh index-loading"></i></p>
</div>
<ul id="pagination"></ul>
</div>
</div>
<div role="main" class="container body-content hide-when-search">
<div class="sidenav hide-when-search">
<a class="btn toc-toggle collapse" data-toggle="collapse" href="#sidetoggle" aria-expanded="false" aria-controls="sidetoggle">Show / Hide Table of Contents</a>
<div class="sidetoggle collapse" id="sidetoggle">
<div id="sidetoc"></div>
</div>
</div>
<div class="article row grid-right">
<div class="col-md-10">
<article class="content wrap" id="_content" data-uid="Lucene.Net.Analysis.Cjk">
<h1 id="Lucene_Net_Analysis_Cjk" data-uid="Lucene.Net.Analysis.Cjk" class="text-break">Namespace Lucene.Net.Analysis.Cjk
</h1>
<div class="markdown level0 summary"><!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<p>Analyzer for Chinese, Japanese, and Korean, which indexes bigrams.
This analyzer generates bigram terms, which are overlapping groups of two adjacent Han, Hiragana, Katakana, or Hangul characters.</p>
<p> Three analyzers are provided for Chinese, each of which treats Chinese text in a different way. * ChineseAnalyzer (in the analyzers/cn package): Index unigrams (individual Chinese characters) as a token. * CJKAnalyzer (in this package): Index bigrams (overlapping groups of two adjacent Chinese characters) as tokens. * SmartChineseAnalyzer (in the analyzers/smartcn package): Index words (attempt to segment Chinese text into words) as tokens. Example phrase: &quot;我是中国人&quot; 1. ChineseAnalyzer: 我-是-中-国-人 2. CJKAnalyzer: 我是-是中-中国-国人 3. SmartChineseAnalyzer: 我-是-中国-人 </p>
</div>
<div class="markdown level0 conceptual"></div>
<div class="markdown level0 remarks"></div>
<h3 id="classes">Classes
</h3>
<h4><a class="xref" href="Lucene.Net.Analysis.Cjk.CJKAnalyzer.html">CJKAnalyzer</a></h4>
<section><p>An <span class="xref">Lucene.Net.Analysis.Analyzer</span> that tokenizes text with <a class="xref" href="Lucene.Net.Analysis.Standard.StandardTokenizer.html">StandardTokenizer</a>,
normalizes content with <a class="xref" href="Lucene.Net.Analysis.Cjk.CJKWidthFilter.html">CJKWidthFilter</a>, folds case with
<a class="xref" href="Lucene.Net.Analysis.Core.LowerCaseFilter.html">LowerCaseFilter</a>, forms bigrams of CJK with <a class="xref" href="Lucene.Net.Analysis.Cjk.CJKBigramFilter.html">CJKBigramFilter</a>,
and filters stopwords with <a class="xref" href="Lucene.Net.Analysis.Core.StopFilter.html">StopFilter</a></p>
</section>
<h4><a class="xref" href="Lucene.Net.Analysis.Cjk.CJKBigramFilter.html">CJKBigramFilter</a></h4>
<section><p>Forms bigrams of CJK terms that are generated from <a class="xref" href="Lucene.Net.Analysis.Standard.StandardTokenizer.html">StandardTokenizer</a>
or ICUTokenizer.
<p>
CJK types are set by these tokenizers, but you can also use
<a class="xref" href="Lucene.Net.Analysis.Cjk.CJKBigramFilter.html#Lucene_Net_Analysis_Cjk_CJKBigramFilter__ctor_Lucene_Net_Analysis_TokenStream_Lucene_Net_Analysis_Cjk_CJKScript_">CJKBigramFilter(TokenStream, CJKScript)</a> to explicitly control which
of the CJK scripts are turned into bigrams.
</p>
<p>
By default, when a CJK character has no adjacent characters to form
a bigram, it is output in unigram form. If you want to always output
both unigrams and bigrams, set the <pre><code>outputUnigrams</code></pre>
flag in <a class="xref" href="Lucene.Net.Analysis.Cjk.CJKBigramFilter.html#Lucene_Net_Analysis_Cjk_CJKBigramFilter__ctor_Lucene_Net_Analysis_TokenStream_Lucene_Net_Analysis_Cjk_CJKScript_System_Boolean_">CJKBigramFilter(TokenStream, CJKScript, Boolean)</a>.
This can be used for a combined unigram+bigram approach.
</p>
<p>
In all cases, all non-CJK input is passed thru unmodified.
</p></p>
</section>
<h4><a class="xref" href="Lucene.Net.Analysis.Cjk.CJKBigramFilterFactory.html">CJKBigramFilterFactory</a></h4>
<section><p>Factory for <a class="xref" href="Lucene.Net.Analysis.Cjk.CJKBigramFilter.html">CJKBigramFilter</a>.</p>
<pre><code>&lt;fieldType name=&quot;text_cjk&quot; class=&quot;solr.TextField&quot;>
&lt;analyzer>
&lt;tokenizer class=&quot;solr.StandardTokenizerFactory&quot;/>
&lt;filter class=&quot;solr.CJKWidthFilterFactory&quot;/>
&lt;filter class=&quot;solr.LowerCaseFilterFactory&quot;/>
&lt;filter class=&quot;solr.CJKBigramFilterFactory&quot;
han=&quot;true&quot; hiragana=&quot;true&quot;
katakana=&quot;true&quot; hangul=&quot;true&quot; outputUnigrams=&quot;false&quot; />
&lt;/analyzer>
&lt;/fieldType></code></pre>
</section>
<h4><a class="xref" href="Lucene.Net.Analysis.Cjk.CJKTokenizer.html">CJKTokenizer</a></h4>
<section><p>CJKTokenizer is designed for Chinese, Japanese, and Korean languages.
<p><br>The tokens returned are every two adjacent characters with overlap match.
</p>
<p>
Example: &quot;java C1C2C3C4&quot; will be segmented to: &quot;java&quot; &quot;C1C2&quot; &quot;C2C3&quot; &quot;C3C4&quot;.
</p>
Additionally, the following is applied to Latin text (such as English):
<ul><li>Text is converted to lowercase.</li><li>Numeric digits, &apos;+&apos;, &apos;#&apos;, and &apos;_&apos; are tokenized as letters.</li><li>Full-width forms are converted to half-width forms.</li></ul>
For more info on Asian language (Chinese, Japanese, and Korean) text segmentation:
please search <a href="http://www.google.com/search?q=word+chinese+segment">google</a></p>
</section>
<h4><a class="xref" href="Lucene.Net.Analysis.Cjk.CJKTokenizerFactory.html">CJKTokenizerFactory</a></h4>
<section><p>Factory for <a class="xref" href="Lucene.Net.Analysis.Cjk.CJKTokenizer.html">CJKTokenizer</a>. </p>
<pre><code>&lt;fieldType name=&quot;text_cjk&quot; class=&quot;solr.TextField&quot; positionIncrementGap=&quot;100&quot;>
&lt;analyzer>
&lt;tokenizer class=&quot;solr.CJKTokenizerFactory&quot;/>
&lt;/analyzer>
&lt;/fieldType></code></pre>
</section>
<h4><a class="xref" href="Lucene.Net.Analysis.Cjk.CJKWidthFilter.html">CJKWidthFilter</a></h4>
<section><p>A <span class="xref">Lucene.Net.Analysis.TokenFilter</span> that normalizes CJK width differences:
<ul><li>Folds fullwidth ASCII variants into the equivalent basic latin</li><li>Folds halfwidth Katakana variants into the equivalent kana</li></ul>
<p>
NOTE: this filter can be viewed as a (practical) subset of NFKC/NFKD
Unicode normalization. See the normalization support in the ICU package
for full normalization.
</p></p>
</section>
<h4><a class="xref" href="Lucene.Net.Analysis.Cjk.CJKWidthFilterFactory.html">CJKWidthFilterFactory</a></h4>
<section><p>Factory for <a class="xref" href="Lucene.Net.Analysis.Cjk.CJKWidthFilter.html">CJKWidthFilter</a>.</p>
<pre><code>&lt;fieldType name=&quot;text_cjk&quot; class=&quot;solr.TextField&quot;>
&lt;analyzer>
&lt;tokenizer class=&quot;solr.StandardTokenizerFactory&quot;/>
&lt;filter class=&quot;solr.CJKWidthFilterFactory&quot;/>
&lt;filter class=&quot;solr.LowerCaseFilterFactory&quot;/>
&lt;filter class=&quot;solr.CJKBigramFilterFactory&quot;/>
&lt;/analyzer>
&lt;/fieldType></code></pre>
</section>
<h3 id="enums">Enums
</h3>
<h4><a class="xref" href="Lucene.Net.Analysis.Cjk.CJKScript.html">CJKScript</a></h4>
<section></section>
</article>
</div>
<div class="hidden-sm col-md-2" role="complementary">
<div class="sideaffix">
<div class="contribution">
<ul class="nav">
<li>
<a href="https://github.com/apache/lucenenet/blob/docs/4.8.0-beta00010/src/Lucene.Net.Analysis.Common/Analysis/Cjk/package.md/#L2" class="contribution-link">Improve this Doc</a>
</li>
</ul>
</div>
<nav class="bs-docs-sidebar hidden-print hidden-xs hidden-sm affix" id="affix">
<!-- <p><a class="back-to-top" href="#top">Back to top</a><p> -->
</nav>
</div>
</div>
</div>
</div>
<footer>
<div class="grad-bottom"></div>
<div class="footer">
<div class="container">
<span class="pull-right">
<a href="#top">Back to top</a>
</span>
Copyright © 2020 Licensed to the Apache Software Foundation (ASF)
</div>
</div>
</footer>
</div>
<script type="text/javascript" src="https://lucenenet.apache.org/docs/4.8.0-beta00009/styles/docfx.vendor.js"></script>
<script type="text/javascript" src="https://lucenenet.apache.org/docs/4.8.0-beta00009/styles/docfx.js"></script>
<script type="text/javascript" src="https://lucenenet.apache.org/docs/4.8.0-beta00009/styles/main.js"></script>
</body>
</html>