blob: 592dcff90ea1280dab26a438847ee8e1b2a3ebf7 [file] [log] [blame]
---
active_crumb: Docs
layout: documentation
id: built-in-token-enricher
---
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<div class="col-md-8 second-column">
<section id="overview">
<h2 class="section-title">Overview<a href="#"><i class="top-link fas fa-fw fa-angle-double-up"></i></a></h2>
<p>
{% scaladoc NCTokenEnricher NCTokenEnricher %}
is a component which allows to add additional properties to prepared tokens,
like part of speech, quote, stop-words flags or any other.
NLPCraft provides English language default set of token enrichers implementations.
</p>
</section>
<section id="enricher-opennlp-lemmapos">
<h2 class="section-title">Lemma And POS Enricher<a href="#"><i class="top-link fas fa-fw fa-angle-double-up"></i></a></h2>
<p>
{% scaladoc nlp/enrichers/NCOpenNLPLemmaPosTokenEnricher NCOpenNLPLemmaPosTokenEnricher %} -
this component allows to add <code>lemma</code> and <code>pos</code> values to processed token.
Look at these links fpr more details:
<a href="https://en.wikipedia.org/wiki/Lemma_(morphology)">Lemma</a> and
<a href="https://en.wikipedia.org/wiki/Part-of-speech_tagging">Part of speech</a>.
Current implementation is based on <a href="https://opennlp.apache.org/">Apache OpenNLP</a> project components.
Is uses Apache OpenNLP models, which are accessible
<a href="http://opennlp.sourceforge.net/models-1.5/">here</a> for POS taggers.
English lemmatization model is accessible <a href="https://raw.githubusercontent.com/richardwilly98/elasticsearch-opennlp-auto-tagging/master/src/main/resources/models/en-lemmatizer.dict">here</a>.
You can use any models which are compatible with Apache OpenNLP <a href="https://opennlp.apache.org/docs/2.0.0/apidocs/opennlp-tools/opennlp/tools/postag/POSTaggerME.html">POSTaggerME</a> and
<a href="https://opennlp.apache.org/docs/2.0.0/apidocs/opennlp-tools/opennlp/tools/lemmatizer/DictionaryLemmatizer.html">DictionaryLemmatizer</a> components.
</p>
</section>
<section id="enricher-opennlp-bracket">
<h2 class="section-title">Brackets Enricher<a href="#"><i class="top-link fas fa-fw fa-angle-double-up"></i></a></h2>
<p>
{% scaladoc nlp/enrichers/NCEnBracketsTokenEnricher NCEnBracketsTokenEnricher %} -
this component allows to add <code>brackets</code> boolean flag to processed token.
</p>
</section>
<section id="enricher-opennlp-quotes">
<h2 class="section-title">Quotes Enricher<a href="#"><i class="top-link fas fa-fw fa-angle-double-up"></i></a></h2>
<p>
{% scaladoc nlp/enrichers/NCEnQuotesTokenEnricher NCEnQuotesTokenEnricher %} -
this component allows to add <code>quoted</code> boolean flag to processed token.
</p>
</section>
<section id="enricher-opennlp-dict">
<h2 class="section-title">Dictionary Enricher<a href="#"><i class="top-link fas fa-fw fa-angle-double-up"></i></a></h2>
<p>
{% scaladoc nlp/enrichers/NCEnDictionaryTokenEnricher NCEnDictionaryTokenEnricher %} -
this component allows to add <code>dict</code> boolean flag to processed token.
Note that it requires already defined <code>lemma</code> token property.
You can use {% scaladoc nlp/enrichers/NCOpenNLPLemmaPosTokenEnricher NCOpenNLPLemmaPosTokenEnricher %} or any another component which sets
<code>lemma</code> into the token. Note that you have to define it in model pipilene token enricher list before
{% scaladoc nlp/enrichers/NCEnDictionaryTokenEnricher NCEnDictionaryTokenEnricher %}.
</p>
</section>
<section id="enricher-opennlp-stopword">
<h2 class="section-title">Stop-words Enricher<a href="#"><i class="top-link fas fa-fw fa-angle-double-up"></i></a></h2>
<p>
{% scaladoc nlp/enrichers/NCEnStopWordsTokenEnricher NCEnStopWordsTokenEnricher %} -
this component allows to add <code>stopword</code> boolean flag to processed token.
It is based on predefined rules for English language, but it can be also extended by custom user word list and excluded list.
Note that it requires already defined <code>lemma</code> token property.
You can use {% scaladoc nlp/enrichers/NCOpenNLPLemmaPosTokenEnricher NCOpenNLPLemmaPosTokenEnricher %} or any another component which sets
<code>lemma</code> into the toke. Note that you have to define it in model pipilene token enricher list before
{% scaladoc nlp/enrichers/NCEnStopWordsTokenEnricher NCEnStopWordsTokenEnricher %}.
</p>
</section>
<section id="enricher-opennlp-swearword">
<h2 class="section-title">Swear-words Enricher<a href="#"><i class="top-link fas fa-fw fa-angle-double-up"></i></a></h2>
<p>
{% scaladoc nlp/enrichers/NCEnSwearWordsTokenEnricher NCEnSwearWordsTokenEnricher %} -
this component allows to add <code>swear</code> boolean flag to processed token.
</p>
</section>
</div>
<div class="col-md-2 third-column">
<ul class="side-nav">
<li class="side-nav-title">On This Page</li>
<li><a href="#overview">Overvie</a></li>
<li><a href="#enricher-opennlp-lemmapos">Lemma And POS Enricher</a></li>
<li><a href="#enricher-opennlp-bracket">Brackets Enricher</a></li>
<li><a href="#enricher-opennlp-quotes">Quotes Enricher</a></li>
<li><a href="#enricher-opennlp-dict">Dictionary Enricher</a></li>
<li><a href="#enricher-opennlp-stopword">Stop-words Enricher</a></li>
<li><a href="#enricher-opennlp-swearword">Swear-words Enricher</a></li>
{% include quick-links.html %}
</ul>
</div>