| <!DOCTYPE html> |
| <!--[if IE]><![endif]--> |
| <html> |
| |
| <head> |
| <meta charset="utf-8"> |
| <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"> |
| <title>Namespace Lucene.Net.Analysis.Cn.Smart |
| | Apache Lucene.NET 4.8.0-beta00014 Documentation </title> |
| <meta name="viewport" content="width=device-width"> |
| <meta name="title" content="Namespace Lucene.Net.Analysis.Cn.Smart |
| | Apache Lucene.NET 4.8.0-beta00014 Documentation "> |
| <meta name="generator" content="docfx 2.56.2.0"> |
| |
| <link rel="shortcut icon" href="https://lucenenet.apache.org/docs/4.8.0-beta00009/logo/favicon.ico"> |
| <link rel="stylesheet" href="https://lucenenet.apache.org/docs/4.8.0-beta00009/styles/docfx.vendor.css"> |
| <link rel="stylesheet" href="https://lucenenet.apache.org/docs/4.8.0-beta00009/styles/docfx.css"> |
| <link rel="stylesheet" href="https://lucenenet.apache.org/docs/4.8.0-beta00009/styles/main.css"> |
| <meta property="docfx:navrel" content="toc.html"> |
| <meta property="docfx:tocrel" content="analysis-smartcn/toc.html"> |
| |
| <meta property="docfx:rel" content="https://lucenenet.apache.org/docs/4.8.0-beta00009/"> |
| |
| </head> |
| <body data-spy="scroll" data-target="#affix" data-offset="120"> |
| <span id="forkongithub"><a href="https://github.com/apache/lucenenet" target="_blank">Fork me on GitHub</a></span> |
| <div id="wrapper"> |
| <header> |
| |
| <nav id="autocollapse" class="navbar ng-scope" role="navigation"> |
| <div class="container"> |
| <div class="navbar-header"> |
| <button type="button" class="navbar-toggle" data-toggle="collapse" data-target="#navbar"> |
| <span class="sr-only">Toggle navigation</span> |
| <span class="icon-bar"></span> |
| <span class="icon-bar"></span> |
| <span class="icon-bar"></span> |
| </button> |
| |
| <a class="navbar-brand" href="/"> |
| <img id="logo" class="svg" src="https://lucenenet.apache.org/docs/4.8.0-beta00009/logo/lucene-net-color.png" alt=""> |
| </a> |
| </div> |
| <div class="collapse navbar-collapse" id="navbar"> |
| <form class="navbar-form navbar-right" role="search" id="search"> |
| <div class="form-group"> |
| <input type="text" class="form-control" id="search-query" placeholder="Search" autocomplete="off"> |
| </div> |
| </form> |
| </div> |
| </div> |
| </nav> |
| |
| <div class="subnav navbar navbar-default"> |
| <div class="container hide-when-search"> |
| <ul class="level0 breadcrumb"> |
| <li> |
| <a href="https://lucenenet.apache.org/docs/4.8.0-beta00014/">API</a> |
| <span id="breadcrumb"> |
| <ul class="breadcrumb"> |
| <li></li> |
| </ul> |
| </span> |
| </li> |
| </ul> |
| </div> |
| </div> |
| </header> |
| <div class="container body-content"> |
| |
| <div id="search-results"> |
| <div class="search-list"></div> |
| <div class="sr-items"> |
| <p><i class="glyphicon glyphicon-refresh index-loading"></i></p> |
| </div> |
| <ul id="pagination"></ul> |
| </div> |
| </div> |
| <div role="main" class="container body-content hide-when-search"> |
| |
| <div class="sidenav hide-when-search"> |
| <a class="btn toc-toggle collapse" data-toggle="collapse" href="#sidetoggle" aria-expanded="false" aria-controls="sidetoggle">Show / Hide Table of Contents</a> |
| <div class="sidetoggle collapse" id="sidetoggle"> |
| <div id="sidetoc"></div> |
| </div> |
| </div> |
| <div class="article row grid-right"> |
| <div class="col-md-10"> |
| <article class="content wrap" id="_content" data-uid="Lucene.Net.Analysis.Cn.Smart"> |
| |
| <h1 id="Lucene_Net_Analysis_Cn_Smart" data-uid="Lucene.Net.Analysis.Cn.Smart" class="text-break">Namespace Lucene.Net.Analysis.Cn.Smart |
| </h1> |
| <div class="markdown level0 summary"><!-- |
| Licensed to the Apache Software Foundation (ASF) under one or more |
| contributor license agreements. See the NOTICE file distributed with |
| this work for additional information regarding copyright ownership. |
| The ASF licenses this file to You under the Apache License, Version 2.0 |
| (the "License"); you may not use this file except in compliance with |
| the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License. |
| --> |
| <p>Analyzer for Simplified Chinese, which indexes words.</p> |
| <div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div><p>Three analyzers are provided for Chinese, each of which treats Chinese text in a different way.</p> |
| <ul> |
| <li><p>StandardAnalyzer: Index unigrams (individual Chinese characters) as a token.</p> |
| </li> |
| <li><p>CJKAnalyzer (in the <xref:Lucene.Net.Analysis.Cjk> namespace of <xref:Lucene.Net.Analysis.Common>): Index bigrams (overlapping groups of two adjacent Chinese characters) as tokens.</p> |
| </li> |
| <li><p>SmartChineseAnalyzer (in this package): Index words (attempt to segment Chinese text into words) as tokens.</p> |
| </li> |
| </ul> |
| <p>Example phrase: "我是中国人"</p> |
| <ol> |
| <li><p>StandardAnalyzer: 我-是-中-国-人</p> |
| </li> |
| <li><p>CJKAnalyzer: 我是-是中-中国-国人</p> |
| </li> |
| <li><p>SmartChineseAnalyzer: 我-是-中国-人</p> |
| </li> |
| </ol> |
| </div> |
| <div class="markdown level0 conceptual"></div> |
| <div class="markdown level0 remarks"></div> |
| <h3 id="classes">Classes |
| </h3> |
| <h4><a class="xref" href="Lucene.Net.Analysis.Cn.Smart.AnalyzerProfile.html">AnalyzerProfile</a></h4> |
| <section><p>Manages analysis data configuration for <a class="xref" href="Lucene.Net.Analysis.Cn.Smart.SmartChineseAnalyzer.html">SmartChineseAnalyzer</a> |
| <p> |
| <a class="xref" href="Lucene.Net.Analysis.Cn.Smart.SmartChineseAnalyzer.html">SmartChineseAnalyzer</a> has a built-in dictionary and stopword list out-of-box. |
| <p> |
| NOTE: To use an alternate dicationary than the built-in one, put the "bigramdict.dct" and |
| "coredict.dct" files in a subdirectory of your application named "smartcn-data". This subdirectory |
| can be placed in any directory up to and including the root directory (if the OS permission allows). |
| To place the files in an alternate location, set an environment variable named "smartcn.data.dir" |
| with the name of the directory the "bigramdict.dct" and "coredict.dct" files can be located within. |
| <p> |
| The default "bigramdict.dct" and "coredict.dct" files can be found at: |
| <a href="https://issues.apache.org/jira/browse/LUCENE-1629">https://issues.apache.org/jira/browse/LUCENE-1629</a>. |
| <p> |
| <div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div></section> |
| <h4><a class="xref" href="Lucene.Net.Analysis.Cn.Smart.HMMChineseTokenizer.html">HMMChineseTokenizer</a></h4> |
| <section><p>Tokenizer for Chinese or mixed Chinese-English text. |
| <p> |
| The analyzer uses probabilistic knowledge to find the optimal word segmentation for Simplified Chinese text. |
| The text is first broken into sentences, then each sentence is segmented into words.</p> |
| </section> |
| <h4><a class="xref" href="Lucene.Net.Analysis.Cn.Smart.HMMChineseTokenizerFactory.html">HMMChineseTokenizerFactory</a></h4> |
| <section><p>Factory for <a class="xref" href="Lucene.Net.Analysis.Cn.Smart.HMMChineseTokenizer.html">HMMChineseTokenizer</a> |
| <p> |
| Note: this class will currently emit tokens for punctuation. So you should either add |
| a <span class="xref">Lucene.Net.Analysis.Miscellaneous.WordDelimiterFilter</span> after to remove these (with concatenate off), or use the |
| SmartChinese stoplist with a StopFilterFactory via:</p> |
| <pre><code>words="org/apache/lucene/analysis/cn/smart/stopwords.txt"</code></pre> |
| <p><p> |
| <div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div></section> |
| <h4><a class="xref" href="Lucene.Net.Analysis.Cn.Smart.SentenceTokenizer.html">SentenceTokenizer</a></h4> |
| <section><p>Tokenizes input text into sentences. |
| <p> |
| The output tokens can then be broken into words with <a class="xref" href="Lucene.Net.Analysis.Cn.Smart.WordTokenFilter.html">WordTokenFilter</a> |
| </p></p> |
| <div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div></section> |
| <h4><a class="xref" href="Lucene.Net.Analysis.Cn.Smart.SmartChineseAnalyzer.html">SmartChineseAnalyzer</a></h4> |
| <section><p><p> |
| <a class="xref" href="Lucene.Net.Analysis.Cn.Smart.SmartChineseAnalyzer.html">SmartChineseAnalyzer</a> is an analyzer for Chinese or mixed Chinese-English text. |
| The analyzer uses probabilistic knowledge to find the optimal word segmentation for Simplified Chinese text. |
| The text is first broken into sentences, then each sentence is segmented into words. |
| </p> |
| <p> |
| Segmentation is based upon the <a href="http://en.wikipedia.org/wiki/Hidden_Markov_Model">Hidden Markov Model</a>. |
| A large training corpus was used to calculate Chinese word frequency probability. |
| </p> |
| <p> |
| This analyzer requires a dictionary to provide statistical data. |
| <a class="xref" href="Lucene.Net.Analysis.Cn.Smart.SmartChineseAnalyzer.html">SmartChineseAnalyzer</a> has an included dictionary out-of-box. |
| </p> |
| <p> |
| The included dictionary data is from <a href="http://www.ictclas.org">ICTCLAS1.0</a>. |
| Thanks to ICTCLAS for their hard work, and for contributing the data under the Apache 2 License! |
| </p></p> |
| <div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div></section> |
| <h4><a class="xref" href="Lucene.Net.Analysis.Cn.Smart.SmartChineseSentenceTokenizerFactory.html">SmartChineseSentenceTokenizerFactory</a></h4> |
| <section><p>Factory for the <a class="xref" href="Lucene.Net.Analysis.Cn.Smart.SmartChineseAnalyzer.html">SmartChineseAnalyzer</a> <a class="xref" href="Lucene.Net.Analysis.Cn.Smart.SentenceTokenizer.html">SentenceTokenizer</a> |
| <p> |
| <div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div></section> |
| <h4><a class="xref" href="Lucene.Net.Analysis.Cn.Smart.SmartChineseWordTokenFilterFactory.html">SmartChineseWordTokenFilterFactory</a></h4> |
| <section><p>Factory for the <a class="xref" href="Lucene.Net.Analysis.Cn.Smart.SmartChineseAnalyzer.html">SmartChineseAnalyzer</a> <a class="xref" href="Lucene.Net.Analysis.Cn.Smart.WordTokenFilter.html">WordTokenFilter</a> |
| <p> |
| Note: this class will currently emit tokens for punctuation. So you should either add |
| a <span class="xref">Lucene.Net.Analysis.Miscellaneous.WordDelimiterFilter</span> after to remove these (with concatenate off), or use the |
| SmartChinese stoplist with a <span class="xref">Lucene.Net.Analysis.Core.StopFilterFactory</span> via:</p> |
| <pre><code>words="org/apache/lucene/analysis/cn/smart/stopwords.txt"</code></pre> |
| <p> |
| <div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div></section> |
| <h4><a class="xref" href="Lucene.Net.Analysis.Cn.Smart.Utility.html">Utility</a></h4> |
| <section><p><a class="xref" href="Lucene.Net.Analysis.Cn.Smart.SmartChineseAnalyzer.html">SmartChineseAnalyzer</a> utility constants and methods |
| <p> |
| <div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div></section> |
| <h4><a class="xref" href="Lucene.Net.Analysis.Cn.Smart.WordTokenFilter.html">WordTokenFilter</a></h4> |
| <section><p>A <span class="xref">Lucene.Net.Analysis.TokenFilter</span> that breaks sentences into words. |
| <p> |
| <div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div></section> |
| <h3 id="enums">Enums |
| </h3> |
| <h4><a class="xref" href="Lucene.Net.Analysis.Cn.Smart.CharType.html">CharType</a></h4> |
| <section><p>Internal <a class="xref" href="Lucene.Net.Analysis.Cn.Smart.SmartChineseAnalyzer.html">SmartChineseAnalyzer</a> character type constants. |
| <p> |
| <div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div></section> |
| <h4><a class="xref" href="Lucene.Net.Analysis.Cn.Smart.WordType.html">WordType</a></h4> |
| <section><p>Internal <a class="xref" href="Lucene.Net.Analysis.Cn.Smart.SmartChineseAnalyzer.html">SmartChineseAnalyzer</a> token type constants |
| <p> |
| <div class="lucene-block lucene-experimental">This is a Lucene.NET EXPERIMENTAL API, use at your own risk</div></section> |
| </article> |
| </div> |
| |
| <div class="hidden-sm col-md-2" role="complementary"> |
| <div class="sideaffix"> |
| <div class="contribution"> |
| <ul class="nav"> |
| <li> |
| <a href="https://github.com/apache/lucenenet/blob/docs/4.8.0-beta00014/src/Lucene.Net.Analysis.SmartCn/package.md/#L2" class="contribution-link">Improve this Doc</a> |
| </li> |
| </ul> |
| </div> |
| <nav class="bs-docs-sidebar hidden-print hidden-xs hidden-sm affix" id="affix"> |
| <!-- <p><a class="back-to-top" href="#top">Back to top</a><p> --> |
| </nav> |
| </div> |
| </div> |
| </div> |
| </div> |
| |
| <footer> |
| <div class="grad-bottom"></div> |
| <div class="footer"> |
| <div class="container"> |
| <span class="pull-right"> |
| <a href="#top">Back to top</a> |
| </span> |
| Copyright © 2021 The Apache Software Foundation, Licensed under the <a href='http://www.apache.org/licenses/LICENSE-2.0' target='_blank'>Apache License, Version 2.0</a><br> <small>Apache Lucene.Net, Lucene.Net, Apache, the Apache feather logo, and the Apache Lucene.Net project logo are trademarks of The Apache Software Foundation. <br>All other marks mentioned may be trademarks or registered trademarks of their respective owners.</small> |
| |
| </div> |
| </div> |
| </footer> |
| </div> |
| |
| <script type="text/javascript" src="https://lucenenet.apache.org/docs/4.8.0-beta00009/styles/docfx.vendor.js"></script> |
| <script type="text/javascript" src="https://lucenenet.apache.org/docs/4.8.0-beta00009/styles/docfx.js"></script> |
| <script type="text/javascript" src="https://lucenenet.apache.org/docs/4.8.0-beta00009/styles/main.js"></script> |
| </body> |
| </html> |