blob: fb6d4034c069953b99d919478582479d4bee96cf [file] [log] [blame]
---
active_crumb: Docs
layout: documentation
id: custom-components
---
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<div class="col-md-8 second-column">
<section id="overview">
<h2 class="section-title">Custom components <a href="#"><i class="top-link fas fa-fw fa-angle-double-up"></i></a></h2>
<p>
NLPCraft provides a numeric of useful built-in components which allow to solve a wide range of tasks
without coding.
But you can need to extend provided functionality and develop your own components.
Let's look how to do it and when it can be useful for all kind of components step by step.
</p>
</section>
<section id="token-parser">
<h2 class="section-title">Token parser <a href="#"><i class="top-link fas fa-fw fa-angle-double-up"></i></a></h2>
<p>
You have to implement <a href="apis/latest/org/apache/nlpcraft/NCTokenParser.html">NCTokenParser</a> trait.
</p>
<p>
It's not often situation when you need to prepare your own language tokenizer.
Mostly it can be necessary if you want to work with some new language.
You have to prepare new implementation once and can use it for all projects on this language.
Usually you just should find open source solution and wrap it for
You have to implement <a href="apis/latest/org/apache/nlpcraft/NCTokenParser.html">NCTokenParser</a> trait.
</p>
<pre class="brush: scala, highlight: [2, 6]">
import org.apache.nlpcraft.*
import org.languagetool.tokenizers.fr.FrenchWordTokenizer
import scala.jdk.CollectionConverters.*
class NCFrTokenParser extends NCTokenParser:
private val tokenizer = new FrenchWordTokenizer
override def tokenize(text: String): List[NCToken] =
val toks = collection.mutable.ArrayBuffer.empty[NCToken]
var sumLen = 0
for ((word, idx) <- tokenizer.tokenize(text).asScala.zipWithIndex)
val start = sumLen
val end = sumLen + word.length
if word.strip.nonEmpty then
toks += new NCPropertyMapAdapter with NCToken:
override def getText: String = word
override def getIndex: Int = idx
override def getStartCharIndex: Int = start
override def getEndCharIndex: Int = end
sumLen = end
toks.toList
</pre>
<ul>
<li>
<code>NCFrTokenParser</code> is a simple wrapper which implements <code>NCTokenParser</code> based on
open source <a href="https://languagetool.org">Language Tool</a> library.
</li>
</ul>
</section>
<section id="token-enricher">
<h2 class="section-title">Token enricher <a href="#"><i class="top-link fas fa-fw fa-angle-double-up"></i></a></h2>
<p>
You have to implement <a href="apis/latest/org/apache/nlpcraft/NCTokenEnricher.html">NCTokenEnricher</a> trait.
</p>
<p>
Tokens enricher is component which allows to add additional properties to prepared tokens.
These tokens properties are used later when entities detection.
</p>
<pre class="brush: scala, highlight: [25, 26]">
import org.apache.nlpcraft.*
import org.languagetool.AnalyzedToken
import org.languagetool.tagging.ru.RussianTagger
import scala.jdk.CollectionConverters.*
class NCRuLemmaPosTokenEnricher extends NCTokenEnricher:
private def nvl(v: String, dflt : => String): String = if v != null then v else dflt
override def enrich(req: NCRequest, cfg: NCModelConfig, toks: List[NCToken]): Unit =
val tags = RussianTagger.INSTANCE.tag(toks.map(_.getText).asJava).asScala
require(toks.size == tags.size)
toks.zip(tags).foreach { case (tok, tag) =>
val readings = tag.getReadings.asScala
val (lemma, pos) = readings.size match
// No data. Lemma is word as is, POS is undefined.
case 0 => (tok.getText, "")
// Takes first. Other variants ignored.
case _ =>
val aTok: AnalyzedToken = readings.head
(nvl(aTok.getLemma, tok.getText), nvl(aTok.getPOSTag, ""))
tok.put("pos", pos)
tok.put("lemma", lemma)
() // Otherwise NPE.
}
</pre>
<ul>
<li>
<code>Lines 25 and 26</code> enriches <a href="apis/latest/org/apache/nlpcraft/NCToken.html">NCToken</a>
by two new properties which can be used for <a href="intent-matching.html">Intent matching</a> later.
</li>
</ul>
</section>
<section id="token-validator">
<h2 class="section-title">Token validator <a href="#"><i class="top-link fas fa-fw fa-angle-double-up"></i></a></h2>
<p>
You have to implement <a href="apis/latest/org/apache/nlpcraft/NCTokenValidator.html">NCTokenValidator</a> trait.
</p>
<p>
There are tokens are inspected and exception can be thrown from user code to break user input processing.
</p>
<pre class="brush: scala, highlight: [3]">
new NCTokenValidator:
override def validate(req: NCRequest, cfg: NCModelConfig, toks: List[NCToken]): Unit =
if toks.exists(_.contains("restrictionFlag"))
then throw new NCException("Sentence cannot be processed.")
</pre>
<ul>
<li>
There is anonymous instance of <a href="apis/latest/org/apache/nlpcraft/NCTokenValidator.html">NCTokenValidator</a>
created.
</li>
<li>
<code>Lines 3</code> defines the rule when exception should be thrown and sentence processing should be stopped.
</li>
</ul>
</section>
<section id="entity-parser">
<h2 class="section-title">Entity parser <a href="#"><i class="top-link fas fa-fw fa-angle-double-up"></i></a></h2>
<p>
You have to implement <a href="apis/latest/org/apache/nlpcraft/NCEntityParser.html">NCEntityParser</a> trait.
</p>
<p>
Most important component which finds user specific data.
These defined entities are input for <a href="intent-matching.html">Intent matching</a> conditions.
If built-in <a href="apis/latest/org/apache/nlpcraft/nlp/parsers/NCSemanticEntityParser.html">NCSemanticEntityParser</a>
is not enough, you can implement your own NER searching here.
There is point for potential integrations with neural networks or any other solutions which
help you find and mark your domain specific named entities.
</p>
<pre class="brush: scala, highlight: [5]">
import org.apache.nlpcraft.*
class CommentsEntityParser extends NCEntityParser :
def parse(req: NCRequest, cfg: NCModelConfig, toks: List[NCToken]): List[NCEntity] =
if req.getText.trim.startsWith("--") then
List(
new NCPropertyMapAdapter with NCEntity :
override def getTokens: List[NCToken] = toks
override def getRequestId: String = req.getRequestId
override def getId: String = "comment"
)
else
List.empty
</pre>
<ul>
<li>
In given example whole input sentence is marked as single element <code>comment</code> if
condition defined on <code>line 5</code> is <code>true</code>.
</li>
</ul>
</section>
<section id="entity-enricher">
<h2 class="section-title">Entity enricher <a href="#"><i class="top-link fas fa-fw fa-angle-double-up"></i></a></h2>
<p>
You have to implement <a href="apis/latest/org/apache/nlpcraft/NCEntityEnricher.html">NCEntityEnricher</a> trait.
</p>
<p>
Entity enricher is component which allows to add additional properties to prepared entities.
Can be useful for extending existing entity enrichers functionality.
</p>
<pre class="brush: scala, highlight: [4, 10, 11]">
import org.apache.nlpcraft.*
object CityPopulationEntityEnricher:
val citiesPopulation: Map[String, Int] = someExternalService.getCitiesPopulation()
import CityPopulationEntityEnricher.*
class CityPopulationEntityEnricher extends NCEntityEnricher :
def enrich(req: NCRequest, cfg: NCModelConfig, ents: List[NCEntity]): Unit =
ents.
filter(_.getId == "city").
foreach(e => e.put("city:population", citiesPopulation(e("city:name"))))
</pre>
<ul>
<li>
<code>Line 4</code> defines getting cities population data from some external service.
</li>
<li>
<code>Line 10</code> filters entities by <code>ID</code>.
</li>
<li>
<code>Line 11</code> enriches entities by new <code>city:population</code> property.
</li>
</ul>
</section>
<section id="entity-mapper">
<h2 class="section-title">Entity mapper<a href="#"><i class="top-link fas fa-fw fa-angle-double-up"></i></a></h2>
<p>
You have to implement <a href="apis/latest/org/apache/nlpcraft/NCEntityMapper.html">NCEntityMapper</a> trait.
</p>
<p>
Entity mapper is component which allows to map one set of entities into another after the entities
were parsed and enriched. Can be useful for building complex parsers based on existing.
</p>
<pre class="brush: scala, highlight: [4, 10, 12, 13, 14]">
import org.apache.nlpcraft.*
object CityPopulationEntityMapper:
val citiesPopulation: Map[String, Int] = externalService.getCitiesPopulation()
import CityPopulationEntityMapper.*
class CityPopulationEntityMapper extends NCEntityMapper :
def map(req: NCRequest, cfg: NCModelConfig, ents: List[NCEntity]): List[NCEntity] =
val cities = ents.filter(_.getId == "city")
ents.filterNot(_.getId == "city") ++
cities ++
cities.filter(city => citiesPopulation(city("city:name")) > 1000000).
map(city =>
new NCPropertyMapAdapter with NCEntity :
override def getTokens: List[NCToken] = city.getTokens
override def getRequestId: String = req.getRequestId
override def getId: String = "big-city"
)
</pre>
<ul>
<li>
<code>Line 4</code> defines getting cities population data from some external service.
</li>
<li>
<code>Line 10</code> filters entities by <code>ID</code>.
</li>
<li>
<code>Line 12, 13 and 14</code> define component result entities set.
It contains previously defined <code>city</code> elements, new elements <code>big-city</code> and
another not city elements.
</li>
</ul>
</section>
<section id="entity-validator">
<h2 class="section-title">Entity validator<a href="#"><i class="top-link fas fa-fw fa-angle-double-up"></i></a></h2>
<p>
You have to implement <a href="apis/latest/org/apache/nlpcraft/NCEntityValidator.html">NCEntityValidator</a> trait.
</p>
<p>
Entities validator is user defined component, where prepared entities are inspected and exceptions
can be thrown from user code to break user input processing.
</p>
<pre class="brush: scala, highlight: [3]">
new NCEntityValidator :
override def validate(req: NCRequest, cfg: NCModelConfig, ents: List[NCEntity]): Unit =
if ents.exists(_.getId == "restrictedID")
then throw new NCException("Sentence cannot be processed.")
</pre>
<ul>
<li>
There is anonymous instance of <a href="apis/latest/org/apache/nlpcraft/NCEntityValidator.html">NCEntityValidator</a>
created.
</li>
<li>
<code>Lines 3</code> defines the rule when exception should be thrown and sentence processing should be stopped.
</li>
</ul>
</section>
<section id="variant-filter">
<h2 class="section-title">Variant filter<a href="#"><i class="top-link fas fa-fw fa-angle-double-up"></i></a></h2>
<p>
You have to implement <a href="apis/latest/org/apache/nlpcraft/NCVariantFilter.html">NCVariantFilter</a> trait.
</p>
<p>
Component which allows filtering detected variants, rejecting undesirable.
</p>
<pre class="brush: scala, highlight: [3]">
new NCVariantFilter :
def filter(req: NCRequest, cfg: NCModelConfig, vars: List[NCVariant]): List[NCVariant] =
vars.filter(_.getEntities.exists(_.getId == "requiredID"))
</pre>
<ul>
<li>
There is anonymous instance of <a href="apis/latest/org/apache/nlpcraft/NCVariantFilter.html">NCVariantFilter</a>
created.
</li>
<li>
<code>Lines 3</code> defines variant's filter,
it passed only variants which contain <code>requiredID</code> elements.
</li>
</ul>
</section>
</div>
<div class="col-md-2 third-column">
<ul class="side-nav">
<li class="side-nav-title">On This Page</li>
<li><a href="#overview">Overview</a></li>
<li><a href="#token-parser">Token parser</a></li>
<li><a href="#token-enricher">Token enricher</a></li>
<li><a href="#token-validator">Token validator</a></li>
<li><a href="#entity-parser">Entity parser</a></li>
<li><a href="#entity-enricher">Entity enricher</a></li>
<li><a href="#entity-mapper">Entity mapper</a></li>
<li><a href="#entity-validator">Entity validator</a></li>
<li><a href="#variant-filter">Variant filter</a></li>
{% include quick-links.html %}
</ul>
</div>