blob: 2f082f1df95c05618b3c9074302fdd442b90f061 [file] [log] [blame]
---
active_crumb: Docs
layout: documentation
id: custom-components
---
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<div class="col-md-8 second-column">
<section id="overview">
<h2 class="section-title">Custom components <a href="#"><i class="top-link fas fa-fw fa-angle-double-up"></i></a></h2>
<p>
NLPCraft provides a numeric of useful built-in components which allow to solve a wide range of tasks
without coding.
But you can need to extend provided functionality and develop your own components.
Let's look how to do it and when it can be useful for all kind of components step by step.
</p>
</section>
<section id="token-parser">
<h2 class="section-title">Token parser <a href="#"><i class="top-link fas fa-fw fa-angle-double-up"></i></a></h2>
<p>
You have to implement {% scaladoc NCTokenParser NCTokenParser %} trait.
</p>
<p>
It's not often situation when you need to prepare your own language tokenizer.
Mostly it can be necessary if you want to work with some new language.
You have to prepare new implementation once and can use it for all projects on this language.
Usually you just should find open source solution and wrap it for.
You have to implement {% scaladoc NCTokenParser NCTokenParser %} trait.
</p>
<pre class="brush: scala, highlight: [2, 6]">
import org.apache.nlpcraft.*
import org.languagetool.tokenizers.fr.FrenchWordTokenizer
import scala.jdk.CollectionConverters.*
class NCFrTokenParser extends NCTokenParser:
private val tokenizer = new FrenchWordTokenizer
override def tokenize(text: String): List[NCToken] =
val toks = collection.mutable.ArrayBuffer.empty[NCToken]
var sumLen = 0
for ((word, idx) <- tokenizer.tokenize(text).asScala.zipWithIndex)
val start = sumLen
val end = sumLen + word.length
if word.strip.nonEmpty then
toks += new NCPropertyMapAdapter with NCToken:
override def getText: String = word
override def getIndex: Int = idx
override def getStartCharIndex: Int = start
override def getEndCharIndex: Int = end
sumLen = end
toks.toList
</pre>
<ul>
<li>
<code>NCFrTokenParser</code> is a simple wrapper which implements
{% scaladoc NCTokenParser NCTokenParser %} methods based on
open source <a href="https://languagetool.org">Language Tool</a> library.
</li>
</ul>
</section>
<section id="token-enricher">
<h2 class="section-title">Token enricher <a href="#"><i class="top-link fas fa-fw fa-angle-double-up"></i></a></h2>
<p>
You have to implement {% scaladoc NCTokenEnricher NCTokenEnricher %} trait.
</p>
<p>
Tokens enricher is component which allows to add additional properties to prepared tokens.
On next pipeline processing steps you can define entities detection conditions based on these tokens properties.
</p>
<pre class="brush: scala, highlight: [25, 26]">
import org.apache.nlpcraft.*
import org.languagetool.AnalyzedToken
import org.languagetool.tagging.ru.RussianTagger
import scala.jdk.CollectionConverters.*
class NCRuLemmaPosTokenEnricher extends NCTokenEnricher:
private def nvl(v: String, dflt : => String): String = if v != null then v else dflt
override def enrich(req: NCRequest, cfg: NCModelConfig, toks: List[NCToken]): Unit =
val tags = RussianTagger.INSTANCE.tag(toks.map(_.getText).asJava).asScala
require(toks.size == tags.size)
toks.zip(tags).foreach { case (tok, tag) =>
val readings = tag.getReadings.asScala
val (lemma, pos) = readings.size match
// No data. Lemma is word as is, POS is undefined.
case 0 => (tok.getText, "")
// Takes first. Other variants ignored.
case _ =>
val aTok: AnalyzedToken = readings.head
(nvl(aTok.getLemma, tok.getText), nvl(aTok.getPOSTag, ""))
tok.put("pos", pos)
tok.put("lemma", lemma)
() // Otherwise NPE.
}
</pre>
<ul>
<li>
<code>Lines 25 and 26</code> enriches {% scaladoc NCToken NCToken %}
by two new properties which can be used for <a href="intent-matching.html">Intent matching</a> later.
</li>
</ul>
</section>
<section id="token-validator">
<h2 class="section-title">Token validator<a href="#"><i class="top-link fas fa-fw fa-angle-double-up"></i></a></h2>
<p>
You have to implement {% scaladoc NCTokenValidator NCTokenValidator %} trait.
</p>
<p>
This component is designed for tokens inspection, an exception can be thrown from user code to break user input processing.
</p>
<pre class="brush: scala, highlight: [3]">
new NCTokenValidator:
override def validate(req: NCRequest, cfg: NCModelConfig, toks: List[NCToken]): Unit =
if toks.exists(_.contains("restrictionFlag"))
then throw new NCException("Sentence cannot be processed.")
</pre>
<ul>
<li>
There is anonymous instance of {% scaladoc NCTokenValidator NCTokenValidator %}
created.
</li>
<li>
<code>Lines 3</code> defines the rule when exception should be thrown and sentence processing should be stopped.
</li>
</ul>
</section>
<section id="entity-parser">
<h2 class="section-title">Entity parser <a href="#"><i class="top-link fas fa-fw fa-angle-double-up"></i></a></h2>
<p>
You have to implement {% scaladoc NCEntityParser NCEntityParser %} trait.
</p>
<p>
Most important component which finds user specific data.
These defined entities are input for <a href="intent-matching.html">Intent matching</a> conditions.
You can implement your own custom logic for named entities detection here.
Also, there is a point for potential integrations with neural networks or any other solutions which
help you find and mark your domain specific named entities.
</p>
<pre class="brush: scala, highlight: [5]">
import org.apache.nlpcraft.*
class CommentsEntityParser extends NCEntityParser :
def parse(req: NCRequest, cfg: NCModelConfig, toks: List[NCToken]): List[NCEntity] =
if req.getText.trim.startsWith("--") then
List(
new NCPropertyMapAdapter with NCEntity :
override def getTokens: List[NCToken] = toks
override def getRequestId: String = req.getRequestId
override def getId: String = "comment"
)
else
List.empty
</pre>
<ul>
<li>
In given example whole input sentence is marked as single element <code>comment</code> if
condition defined on <code>line 5</code> is <code>true</code>.
</li>
</ul>
</section>
<section id="entity-enricher">
<h2 class="section-title">Entity enricher <a href="#"><i class="top-link fas fa-fw fa-angle-double-up"></i></a></h2>
<p>
You have to implement {% scaladoc NCEntityEnricher NCEntityEnricher %} trait.
</p>
<p>
Entity enricher is component which allows to add additional properties to prepared entities.
Can be useful for extending existing entity enrichers functionality.
</p>
<pre class="brush: scala, highlight: [4, 11, 12]">
import org.apache.nlpcraft.*
object CityPopulationEntityEnricher:
val citiesPopulation: Map[String, Int] = someExternalService.getCitiesPopulation()
import CityPopulationEntityEnricher.*
class CityPopulationEntityEnricher extends NCEntityEnricher :
def enrich(req: NCRequest, cfg: NCModelConfig, ents: List[NCEntity]): Unit =
ents.
filter(_.getId == "city").
foreach(e => e.put("city:population", citiesPopulation(e("city:name"))))
</pre>
<ul>
<li>
<code>Line 4</code> defines getting cities population data from some external service.
</li>
<li>
<code>Line 11</code> filters entities by <code>ID</code>.
</li>
<li>
<code>Line 12</code> enriches entities by new <code>city:population</code> property.
</li>
</ul>
</section>
<section id="entity-mapper">
<h2 class="section-title">Entity mapper<a href="#"><i class="top-link fas fa-fw fa-angle-double-up"></i></a></h2>
<p>
You have to implement {% scaladoc NCEntityMapper NCEntityMapper %} trait.
</p>
<p>
Entity mapper is component which allows to map one set of entities to another after the entities
were parsed and enriched. Can be useful for building complex parsers based on existing.
</p>
<pre class="brush: scala, highlight: [4, 10, 12, 13, 14]">
import org.apache.nlpcraft.*
object CityPopulationEntityMapper:
val citiesPopulation: Map[String, Int] = externalService.getCitiesPopulation()
import CityPopulationEntityMapper.*
class CityPopulationEntityMapper extends NCEntityMapper :
def map(req: NCRequest, cfg: NCModelConfig, ents: List[NCEntity]): List[NCEntity] =
val cities = ents.filter(_.getId == "city")
ents.filterNot(_.getId == "city") ++
cities ++
cities.filter(city => citiesPopulation(city("city:name")) > 1000000).
map(city =>
new NCPropertyMapAdapter with NCEntity :
override def getTokens: List[NCToken] = city.getTokens
override def getRequestId: String = req.getRequestId
override def getId: String = "big-city"
)
</pre>
<ul>
<li>
<code>Line 4</code> defines getting cities population data from some external service.
</li>
<li>
<code>Line 10</code> filters entities by <code>ID</code>.
</li>
<li>
<code>Line 12, 13 and 14</code> define component result entities set.
It contains previously defined <code>city</code> elements, new elements <code>big-city</code> and
another not city elements.
</li>
</ul>
</section>
<section id="entity-validator">
<h2 class="section-title">Entity validator<a href="#"><i class="top-link fas fa-fw fa-angle-double-up"></i></a></h2>
<p>
You have to implement {% scaladoc NCEntityValidator NCEntityValidator %} trait.
</p>
<p>
This component is designed for entities inspection, an exception can be thrown from user code to break user input processing.
</p>
<pre class="brush: scala, highlight: [3]">
new NCEntityValidator :
override def validate(req: NCRequest, cfg: NCModelConfig, ents: List[NCEntity]): Unit =
if ents.exists(_.getId == "restrictedID")
then throw new NCException("Sentence cannot be processed.")
</pre>
<ul>
<li>
There is anonymous instance of {% scaladoc NCEntityValidator NCEntityValidator %}
created.
</li>
<li>
<code>Lines 3</code> defines the rule when exception should be thrown and sentence processing should be stopped.
</li>
</ul>
</section>
<section id="variant-filter">
<h2 class="section-title">Variant filter<a href="#"><i class="top-link fas fa-fw fa-angle-double-up"></i></a></h2>
<p>
You have to implement {% scaladoc NCVariantFilter NCVariantFilter %} trait.
</p>
<p>
Component which allows filtering detected variants, rejecting undesirable.
</p>
<pre class="brush: scala, highlight: [3]">
new NCVariantFilter :
def filter(req: NCRequest, cfg: NCModelConfig, vars: List[NCVariant]): List[NCVariant] =
vars.filter(_.getEntities.exists(_.getId == "requiredID"))
</pre>
<ul>
<li>
There is anonymous instance of {% scaladoc NCVariantFilter NCVariantFilter %}
created.
</li>
<li>
<code>Lines 3</code> defines variant's filter,
it passed only variants which contain <code>requiredID</code> elements.
</li>
</ul>
</section>
</div>
<div class="col-md-2 third-column">
<ul class="side-nav">
<li class="side-nav-title">On This Page</li>
<li><a href="#overview">Overview</a></li>
<li><a href="#token-parser">Token parser</a></li>
<li><a href="#token-enricher">Token enricher</a></li>
<li><a href="#token-validator">Token validator</a></li>
<li><a href="#entity-parser">Entity parser</a></li>
<li><a href="#entity-enricher">Entity enricher</a></li>
<li><a href="#entity-mapper">Entity mapper</a></li>
<li><a href="#entity-validator">Entity validator</a></li>
<li><a href="#variant-filter">Variant filter</a></li>
{% include quick-links.html %}
</ul>
</div>