blob: 176fe7f4c42f5cde6cf1367fda14163515a95cc5 [file] [log] [blame]
---
active_crumb: Docs
layout: documentation
id: built-in-token-parser
---
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<div class="col-md-8 second-column">
<section id="overview">
<h2 class="section-title">Overview<a href="#"><i class="top-link fas fa-fw fa-angle-double-up"></i></a></h2>
<p>
{% scaladoc NCTokenParser NCTokenParser %} trait is part of <a href="api-components.html#model-pipeline">Model Pipeline</a>.
Its implementation should parse user input plain text and split this text
into <code>tokens</code> list.
NLPCraft provides two English language token parser implementations:
<a href="#parser-opennlp">Apache OpenNLP Based Parser</a> and
<a href="#parser-stanford">Stanford NLP Based Parser</a>.
Also, project contains examples for <a href="examples/light_switch_fr.html">French</a> and
<a href="examples/light_switch_ru.html">Russia</a> languages token parser implementations.
</p>
</section>
<section id="parser-opennlp">
<h2 class="section-title">Apache OpenNLP Based Parser<a href="#"><i class="top-link fas fa-fw fa-angle-double-up"></i></a></h2>
<p>
There is {% scaladoc nlp/parsers/NCOpenNLPTokenParser NCOpenNLPTokenParser %} implementation.
This implementation is wrapper on
<a href="https://opennlp.apache.org/">Apache OpenNLP</a> project tokenizer.
</p>
</section>
<section id="parser-stanford">
<h2 class="section-title">Stanford NLP Based Parser<a href="#"><i class="top-link fas fa-fw fa-angle-double-up"></i></a></h2>
<p>
There is <code>NCStanfordNLPTokenParser</code> implementation.
This implementation is wrapper on
<a href="https://nlp.stanford.edu/">Stanford NLP</a> project tokenizer.
</p>
</section>
<section id="remarks">
<h2 class="section-title">Remarks<a href="#"><i class="top-link fas fa-fw fa-angle-double-up"></i></a></h2>
<p>
There are two different English language implementations are provided because they have some difference
in their algorithms and can provide different list of tokens for same user text input.
Some built-in components are required token parser instance as their parameter.
</p>
<ul>
<li>
If you use <a href="https://opennlp.apache.org/">Apache OpenNLP</a> based components
you should use <a href="#parser-opennlp">Apache OpenNLP based parser</a> in your model pipeline.
</li>
<li>
If you use <a href="https://nlp.stanford.edu/">Stanford NLP</a> based components
you should use <a href="#parser-stanford">Stanford based parser</a> in your model pipeline.
</li>
</ul>
</section>
</div>
<div class="col-md-2 third-column">
<ul class="side-nav">
<li class="side-nav-title">On This Page</li>
<li><a href="#overview">Overview</a></li>
<li><a href="#parser-opennlp">Apache OpenNLP Based Parser</a></li>
<li><a href="#parser-stanford">Stanford NLP Based Parser</a></li>
<li><a href="#remarks">Remarks</a></li>
{% include quick-links.html %}
</ul>
</div>