blob: 24e987d108da9deeb41f3ae5596ec8c0a1701790 [file] [log] [blame]
<!DOCTYPE html>
<!--[if IE]><![endif]-->
<html>
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
<title>Namespace Lucene.Net.Analysis.Cn.Smart
| Apache Lucene.NET 4.8.0-beta00010 Documentation </title>
<meta name="viewport" content="width=device-width">
<meta name="title" content="Namespace Lucene.Net.Analysis.Cn.Smart
| Apache Lucene.NET 4.8.0-beta00010 Documentation ">
<meta name="generator" content="docfx 2.56.0.0">
<link rel="shortcut icon" href="https://lucenenet.apache.org/docs/4.8.0-beta00009/logo/favicon.ico">
<link rel="stylesheet" href="https://lucenenet.apache.org/docs/4.8.0-beta00009/styles/docfx.vendor.css">
<link rel="stylesheet" href="https://lucenenet.apache.org/docs/4.8.0-beta00009/styles/docfx.css">
<link rel="stylesheet" href="https://lucenenet.apache.org/docs/4.8.0-beta00009/styles/main.css">
<meta property="docfx:navrel" content="toc.html">
<meta property="docfx:tocrel" content="analysis-smartcn/toc.html">
<meta property="docfx:rel" content="https://lucenenet.apache.org/docs/4.8.0-beta00009/">
</head>
<body data-spy="scroll" data-target="#affix" data-offset="120">
<div id="wrapper">
<header>
<nav id="autocollapse" class="navbar ng-scope" role="navigation">
<div class="container">
<div class="navbar-header">
<button type="button" class="navbar-toggle" data-toggle="collapse" data-target="#navbar">
<span class="sr-only">Toggle navigation</span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</button>
<a class="navbar-brand" href="/">
<img id="logo" class="svg" src="https://lucenenet.apache.org/docs/4.8.0-beta00009/logo/lucene-net-color.png" alt="">
</a>
</div>
<div class="collapse navbar-collapse" id="navbar">
<form class="navbar-form navbar-right" role="search" id="search">
<div class="form-group">
<input type="text" class="form-control" id="search-query" placeholder="Search" autocomplete="off">
</div>
</form>
</div>
</div>
</nav>
<div class="subnav navbar navbar-default">
<div class="container hide-when-search">
<ul class="level0 breadcrumb">
<li>
<a href="https://lucenenet.apache.org/docs/4.8.0-beta00009/">API</a>
<span id="breadcrumb">
<ul class="breadcrumb">
<li></li>
</ul>
</span>
</li>
</ul>
</div>
</div>
</header>
<div class="container body-content">
<div id="search-results">
<div class="search-list"></div>
<div class="sr-items">
<p><i class="glyphicon glyphicon-refresh index-loading"></i></p>
</div>
<ul id="pagination"></ul>
</div>
</div>
<div role="main" class="container body-content hide-when-search">
<div class="sidenav hide-when-search">
<a class="btn toc-toggle collapse" data-toggle="collapse" href="#sidetoggle" aria-expanded="false" aria-controls="sidetoggle">Show / Hide Table of Contents</a>
<div class="sidetoggle collapse" id="sidetoggle">
<div id="sidetoc"></div>
</div>
</div>
<div class="article row grid-right">
<div class="col-md-10">
<article class="content wrap" id="_content" data-uid="Lucene.Net.Analysis.Cn.Smart">
<h1 id="Lucene_Net_Analysis_Cn_Smart" data-uid="Lucene.Net.Analysis.Cn.Smart" class="text-break">Namespace Lucene.Net.Analysis.Cn.Smart
</h1>
<div class="markdown level0 summary"><!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<p>Analyzer for Simplified Chinese, which indexes words.</p>
<div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div><div>
Three analyzers are provided for Chinese, each of which treats Chinese text in a different way.
* StandardAnalyzer: Index unigrams (individual Chinese characters) as a token.
* CJKAnalyzer (in the analyzers/cjk package): Index bigrams (overlapping groups of two adjacent Chinese characters) as tokens.
* SmartChineseAnalyzer (in this package): Index words (attempt to segment Chinese text into words) as tokens.
Example phrase: &quot;我是中国人&quot;
1. StandardAnalyzer: 我-是-中-国-人
2. CJKAnalyzer: 我是-是中-中国-国人
3. SmartChineseAnalyzer: 我-是-中国-人
</div></div>
<div class="markdown level0 conceptual"></div>
<div class="markdown level0 remarks"></div>
<h3 id="classes">Classes
</h3>
<h4><a class="xref" href="Lucene.Net.Analysis.Cn.Smart.AnalyzerProfile.html">AnalyzerProfile</a></h4>
<section><p>Manages analysis data configuration for <a class="xref" href="Lucene.Net.Analysis.Cn.Smart.SmartChineseAnalyzer.html">SmartChineseAnalyzer</a>
<p>
<a class="xref" href="Lucene.Net.Analysis.Cn.Smart.SmartChineseAnalyzer.html">SmartChineseAnalyzer</a> has a built-in dictionary and stopword list out-of-box.
<p>
NOTE: To use an alternate dicationary than the built-in one, put the &quot;bigramdict.dct&quot; and
&quot;coredict.dct&quot; files in a subdirectory of your application named &quot;smartcn-data&quot;. This subdirectory
can be placed in any directory up to and including the root directory (if the OS permission allows).
To place the files in an alternate location, set an environment variable named &quot;smartcn.data.dir&quot;
with the name of the directory the &quot;bigramdict.dct&quot; and &quot;coredict.dct&quot; files can be located within.
<p>
The default &quot;bigramdict.dct&quot; and &quot;coredict.dct&quot; files can be found at:
<a href="https://issues.apache.org/jira/browse/LUCENE-1629">https://issues.apache.org/jira/browse/LUCENE-1629</a>.
<p>
<div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div></section>
<h4><a class="xref" href="Lucene.Net.Analysis.Cn.Smart.HMMChineseTokenizer.html">HMMChineseTokenizer</a></h4>
<section><p>Tokenizer for Chinese or mixed Chinese-English text.
<p>
The analyzer uses probabilistic knowledge to find the optimal word segmentation for Simplified Chinese text.
The text is first broken into sentences, then each sentence is segmented into words.</p>
</section>
<h4><a class="xref" href="Lucene.Net.Analysis.Cn.Smart.HMMChineseTokenizerFactory.html">HMMChineseTokenizerFactory</a></h4>
<section><p>Factory for <a class="xref" href="Lucene.Net.Analysis.Cn.Smart.HMMChineseTokenizer.html">HMMChineseTokenizer</a>
<p>
Note: this class will currently emit tokens for punctuation. So you should either add
a <span class="xref">Lucene.Net.Analysis.Miscellaneous.WordDelimiterFilter</span> after to remove these (with concatenate off), or use the
SmartChinese stoplist with a StopFilterFactory via:</p>
<pre><code>words=&quot;org/apache/lucene/analysis/cn/smart/stopwords.txt&quot;</code></pre>
<p><p>
<div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div></section>
<h4><a class="xref" href="Lucene.Net.Analysis.Cn.Smart.SentenceTokenizer.html">SentenceTokenizer</a></h4>
<section><p>Tokenizes input text into sentences.
<p>
The output tokens can then be broken into words with <a class="xref" href="Lucene.Net.Analysis.Cn.Smart.WordTokenFilter.html">WordTokenFilter</a>
</p></p>
<div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div></section>
<h4><a class="xref" href="Lucene.Net.Analysis.Cn.Smart.SmartChineseAnalyzer.html">SmartChineseAnalyzer</a></h4>
<section><p><p>
<a class="xref" href="Lucene.Net.Analysis.Cn.Smart.SmartChineseAnalyzer.html">SmartChineseAnalyzer</a> is an analyzer for Chinese or mixed Chinese-English text.
The analyzer uses probabilistic knowledge to find the optimal word segmentation for Simplified Chinese text.
The text is first broken into sentences, then each sentence is segmented into words.
</p>
<p>
Segmentation is based upon the <a href="http://en.wikipedia.org/wiki/Hidden_Markov_Model">Hidden Markov Model</a>.
A large training corpus was used to calculate Chinese word frequency probability.
</p>
<p>
This analyzer requires a dictionary to provide statistical data.
<a class="xref" href="Lucene.Net.Analysis.Cn.Smart.SmartChineseAnalyzer.html">SmartChineseAnalyzer</a> has an included dictionary out-of-box.
</p>
<p>
The included dictionary data is from <a href="http://www.ictclas.org">ICTCLAS1.0</a>.
Thanks to ICTCLAS for their hard work, and for contributing the data under the Apache 2 License!
</p></p>
<div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div></section>
<h4><a class="xref" href="Lucene.Net.Analysis.Cn.Smart.SmartChineseSentenceTokenizerFactory.html">SmartChineseSentenceTokenizerFactory</a></h4>
<section><p>Factory for the <a class="xref" href="Lucene.Net.Analysis.Cn.Smart.SmartChineseAnalyzer.html">SmartChineseAnalyzer</a> <a class="xref" href="Lucene.Net.Analysis.Cn.Smart.SentenceTokenizer.html">SentenceTokenizer</a>
<p>
<div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div></section>
<h4><a class="xref" href="Lucene.Net.Analysis.Cn.Smart.SmartChineseWordTokenFilterFactory.html">SmartChineseWordTokenFilterFactory</a></h4>
<section><p>Factory for the <a class="xref" href="Lucene.Net.Analysis.Cn.Smart.SmartChineseAnalyzer.html">SmartChineseAnalyzer</a> <a class="xref" href="Lucene.Net.Analysis.Cn.Smart.WordTokenFilter.html">WordTokenFilter</a>
<p>
Note: this class will currently emit tokens for punctuation. So you should either add
a <span class="xref">Lucene.Net.Analysis.Miscellaneous.WordDelimiterFilter</span> after to remove these (with concatenate off), or use the
SmartChinese stoplist with a <span class="xref">Lucene.Net.Analysis.Core.StopFilterFactory</span> via:</p>
<pre><code>words=&quot;org/apache/lucene/analysis/cn/smart/stopwords.txt&quot;</code></pre>
<p>
<div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div></section>
<h4><a class="xref" href="Lucene.Net.Analysis.Cn.Smart.Utility.html">Utility</a></h4>
<section><p><a class="xref" href="Lucene.Net.Analysis.Cn.Smart.SmartChineseAnalyzer.html">SmartChineseAnalyzer</a> utility constants and methods
<p>
<div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div></section>
<h4><a class="xref" href="Lucene.Net.Analysis.Cn.Smart.WordTokenFilter.html">WordTokenFilter</a></h4>
<section><p>A <span class="xref">Lucene.Net.Analysis.TokenFilter</span> that breaks sentences into words.
<p>
<div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div></section>
<h3 id="enums">Enums
</h3>
<h4><a class="xref" href="Lucene.Net.Analysis.Cn.Smart.CharType.html">CharType</a></h4>
<section><p>Internal <a class="xref" href="Lucene.Net.Analysis.Cn.Smart.SmartChineseAnalyzer.html">SmartChineseAnalyzer</a> character type constants.
<p>
<div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div></section>
<h4><a class="xref" href="Lucene.Net.Analysis.Cn.Smart.WordType.html">WordType</a></h4>
<section><p>Internal <a class="xref" href="Lucene.Net.Analysis.Cn.Smart.SmartChineseAnalyzer.html">SmartChineseAnalyzer</a> token type constants
<p>
<div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div></section>
</article>
</div>
<div class="hidden-sm col-md-2" role="complementary">
<div class="sideaffix">
<div class="contribution">
<ul class="nav">
<li>
<a href="https://github.com/apache/lucenenet/blob/docs/4.8.0-beta00010/src/Lucene.Net.Analysis.SmartCn/package.md/#L2" class="contribution-link">Improve this Doc</a>
</li>
</ul>
</div>
<nav class="bs-docs-sidebar hidden-print hidden-xs hidden-sm affix" id="affix">
<!-- <p><a class="back-to-top" href="#top">Back to top</a><p> -->
</nav>
</div>
</div>
</div>
</div>
<footer>
<div class="grad-bottom"></div>
<div class="footer">
<div class="container">
<span class="pull-right">
<a href="#top">Back to top</a>
</span>
Copyright © 2020 Licensed to the Apache Software Foundation (ASF)
</div>
</div>
</footer>
</div>
<script type="text/javascript" src="https://lucenenet.apache.org/docs/4.8.0-beta00009/styles/docfx.vendor.js"></script>
<script type="text/javascript" src="https://lucenenet.apache.org/docs/4.8.0-beta00009/styles/docfx.js"></script>
<script type="text/javascript" src="https://lucenenet.apache.org/docs/4.8.0-beta00009/styles/main.js"></script>
</body>
</html>