| --- |
| active_crumb: Docs |
| layout: documentation |
| id: pipeline-components |
| --- |
| |
| <!-- |
| Licensed to the Apache Software Foundation (ASF) under one or more |
| contributor license agreements. See the NOTICE file distributed with |
| this work for additional information regarding copyright ownership. |
| The ASF licenses this file to You under the Apache License, Version 2.0 |
| (the "License"); you may not use this file except in compliance with |
| the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License. |
| --> |
| |
| <div class="col-md-8 second-column"> |
| <section id="overview"> |
| <h2 class="section-title">Overview <a href="#"><i class="top-link fas fa-fw fa-angle-double-up"></i></a></h2> |
| |
| <p> |
| <a href="api-components.html#model-pipeline">Model Pipeline</a> contains a chain of component traits |
| which are responsible for sentence processing. |
| NLPCraft provides a numeric of useful <a href="/built-in-overview.html">built-in components</a> which allow to solve a wide range of tasks |
| without coding. |
| But you can need to extend provided functionality and develop your own components. |
| Let's look how to do it and when it can be useful for all kind of components step by step. |
| </p> |
| </section> |
| <section id="token-parsers"> |
| <h2 class="section-title">Token parsers<a href="#"><i class="top-link fas fa-fw fa-angle-double-up"></i></a></h2> |
| <p> |
| You have to implement {% scaladoc NCTokenParser NCTokenParser %} trait. |
| </p> |
| <p> |
| It's not often situation when you need to prepare your own language tokenizer. |
| Mostly it can be necessary if you want to work with some new language. |
| You have to prepare new implementation once and can use it for all projects on this language. |
| Usually you just should find open source solution and wrap it for. |
| You have to implement {% scaladoc NCTokenParser NCTokenParser %} trait. |
| </p> |
| <pre class="brush: scala, highlight: [2, 6]"> |
| import org.apache.nlpcraft.* |
| import org.languagetool.tokenizers.fr.FrenchWordTokenizer |
| import scala.jdk.CollectionConverters.* |
| |
| class NCFrTokenParser extends NCTokenParser: |
| private val tokenizer = new FrenchWordTokenizer |
| |
| override def tokenize(text: String): List[NCToken] = |
| val toks = collection.mutable.ArrayBuffer.empty[NCToken] |
| var sumLen = 0 |
| |
| for ((word, idx) <- tokenizer.tokenize(text).asScala.zipWithIndex) |
| val start = sumLen |
| val end = sumLen + word.length |
| |
| if word.strip.nonEmpty then |
| toks += new NCPropertyMapAdapter with NCToken: |
| override def getText: String = word |
| override def getIndex: Int = idx |
| override def getStartCharIndex: Int = start |
| override def getEndCharIndex: Int = end |
| |
| sumLen = end |
| |
| toks.toList |
| </pre> |
| <ul> |
| <li> |
| <code>NCFrTokenParser</code> is a simple wrapper which implements |
| {% scaladoc NCTokenParser NCTokenParser %} methods based on |
| open source <a href="https://languagetool.org">Language Tool</a> library. |
| </li> |
| </ul> |
| </section> |
| |
| <section id="token-enrichers"> |
| <h2 class="section-title">Token enrichers<a href="#"><i class="top-link fas fa-fw fa-angle-double-up"></i></a></h2> |
| <p> |
| You have to implement {% scaladoc NCTokenEnricher NCTokenEnricher %} trait. |
| </p> |
| <p> |
| Tokens enricher is component which allows to add additional properties to prepared tokens. |
| On next pipeline processing steps you can define entities detection conditions based on these tokens properties. |
| </p> |
| <pre class="brush: scala, highlight: [25, 26]"> |
| import org.apache.nlpcraft.* |
| import org.languagetool.AnalyzedToken |
| import org.languagetool.tagging.ru.RussianTagger |
| import scala.jdk.CollectionConverters.* |
| |
| class NCRuLemmaPosTokenEnricher extends NCTokenEnricher: |
| private def nvl(v: String, dflt : => String): String = if v != null then v else dflt |
| |
| override def enrich(req: NCRequest, cfg: NCModelConfig, toks: List[NCToken]): Unit = |
| val tags = RussianTagger.INSTANCE.tag(toks.map(_.getText).asJava).asScala |
| |
| require(toks.size == tags.size) |
| |
| toks.zip(tags).foreach { case (tok, tag) => |
| val readings = tag.getReadings.asScala |
| |
| val (lemma, pos) = readings.size match |
| // No data. Lemma is word as is, POS is undefined. |
| case 0 => (tok.getText, "") |
| // Takes first. Other variants ignored. |
| case _ => |
| val aTok: AnalyzedToken = readings.head |
| (nvl(aTok.getLemma, tok.getText), nvl(aTok.getPOSTag, "")) |
| |
| tok.put("pos", pos) |
| tok.put("lemma", lemma) |
| |
| () // Otherwise NPE. |
| } |
| </pre> |
| <ul> |
| <li> |
| <code>Lines 25 and 26</code> enriches {% scaladoc NCToken NCToken %} |
| by two new properties which can be used for <a href="intent-matching.html">Intent matching</a> later. |
| </li> |
| </ul> |
| </section> |
| |
| <section id="token-validators"> |
| <h2 class="section-title">Token validators<a href="#"><i class="top-link fas fa-fw fa-angle-double-up"></i></a></h2> |
| <p> |
| You have to implement {% scaladoc NCTokenValidator NCTokenValidator %} trait. |
| </p> |
| |
| <p> |
| This component is designed for tokens inspection, an exception can be thrown from user code to break user input processing. |
| </p> |
| |
| <pre class="brush: scala, highlight: [3]"> |
| new NCTokenValidator: |
| override def validate(req: NCRequest, cfg: NCModelConfig, toks: List[NCToken]): Unit = |
| if toks.exists(_.contains("restrictionFlag")) |
| then throw new NCException("Sentence cannot be processed.") |
| </pre> |
| |
| <ul> |
| <li> |
| There is anonymous instance of {% scaladoc NCTokenValidator NCTokenValidator %} |
| created. |
| </li> |
| <li> |
| <code>Lines 3</code> defines the rule when exception should be thrown and sentence processing should be stopped. |
| </li> |
| </ul> |
| </section> |
| |
| <section id="entity-parsers"> |
| <h2 class="section-title">Entity parsers<a href="#"><i class="top-link fas fa-fw fa-angle-double-up"></i></a></h2> |
| <p> |
| You have to implement {% scaladoc NCEntityParser NCEntityParser %} trait. |
| </p> |
| |
| <p> |
| Most important component which finds user specific data. |
| These defined entities are input for <a href="intent-matching.html">Intent matching</a> conditions. |
| You can implement your own custom logic for named entities detection here. |
| Also, there is a point for potential integrations with neural networks or any other solutions which |
| help you find and mark your domain specific named entities. |
| </p> |
| |
| <pre class="brush: scala, highlight: [5]"> |
| import org.apache.nlpcraft.* |
| |
| class CommentsEntityParser extends NCEntityParser : |
| def parse(req: NCRequest, cfg: NCModelConfig, toks: List[NCToken]): List[NCEntity] = |
| if req.getText.trim.startsWith("--") then |
| List( |
| new NCPropertyMapAdapter with NCEntity : |
| override def getTokens: List[NCToken] = toks |
| override def getRequestId: String = req.getRequestId |
| override def getId: String = "comment" |
| ) |
| else |
| List.empty |
| </pre> |
| <ul> |
| <li> |
| In given example whole input sentence is marked as single element <code>comment</code> if |
| condition defined on <code>line 5</code> is <code>true</code>. |
| </li> |
| </ul> |
| </section> |
| |
| <section id="entity-enrichers"> |
| <h2 class="section-title">Entity enrichers<a href="#"><i class="top-link fas fa-fw fa-angle-double-up"></i></a></h2> |
| <p> |
| You have to implement {% scaladoc NCEntityEnricher NCEntityEnricher %} trait. |
| </p> |
| <p> |
| Entity enricher is component which allows to add additional properties to prepared entities. |
| Can be useful for extending existing entity enrichers functionality. |
| </p> |
| |
| <pre class="brush: scala, highlight: [4, 11, 12]"> |
| import org.apache.nlpcraft.* |
| |
| object CityPopulationEntityEnricher: |
| val citiesPopulation: Map[String, Int] = someExternalService.getCitiesPopulation() |
| |
| import CityPopulationEntityEnricher.* |
| |
| class CityPopulationEntityEnricher extends NCEntityEnricher : |
| def enrich(req: NCRequest, cfg: NCModelConfig, ents: List[NCEntity]): Unit = |
| ents. |
| filter(_.getId == "city"). |
| foreach(e => e.put("city:population", citiesPopulation(e("city:name")))) |
| </pre> |
| |
| <ul> |
| <li> |
| <code>Line 4</code> defines getting cities population data from some external service. |
| </li> |
| <li> |
| <code>Line 11</code> filters entities by <code>ID</code>. |
| </li> |
| <li> |
| <code>Line 12</code> enriches entities by new <code>city:population</code> property. |
| </li> |
| </ul> |
| </section> |
| |
| <section id="entity-mappers"> |
| <h2 class="section-title">Entity mappers<a href="#"><i class="top-link fas fa-fw fa-angle-double-up"></i></a></h2> |
| <p> |
| You have to implement {% scaladoc NCEntityMapper NCEntityMapper %} trait. |
| </p> |
| |
| <p> |
| Entity mapper is component which allows to map one set of entities to another after the entities |
| were parsed and enriched. Can be useful for building complex parsers based on existing. |
| </p> |
| |
| <pre class="brush: scala, highlight: [4, 10, 12, 13, 14]"> |
| import org.apache.nlpcraft.* |
| |
| object CityPopulationEntityMapper: |
| val citiesPopulation: Map[String, Int] = externalService.getCitiesPopulation() |
| |
| import CityPopulationEntityMapper.* |
| |
| class CityPopulationEntityMapper extends NCEntityMapper : |
| def map(req: NCRequest, cfg: NCModelConfig, ents: List[NCEntity]): List[NCEntity] = |
| val cities = ents.filter(_.getId == "city") |
| |
| ents.filterNot(_.getId == "city") ++ |
| cities ++ |
| cities.filter(city => citiesPopulation(city("city:name")) > 1000000). |
| map(city => |
| new NCPropertyMapAdapter with NCEntity : |
| override def getTokens: List[NCToken] = city.getTokens |
| override def getRequestId: String = req.getRequestId |
| override def getId: String = "big-city" |
| ) |
| </pre> |
| <ul> |
| <li> |
| <code>Line 4</code> defines getting cities population data from some external service. |
| </li> |
| <li> |
| <code>Line 10</code> filters entities by <code>ID</code>. |
| </li> |
| <li> |
| <code>Line 12, 13 and 14</code> define component result entities set. |
| It contains previously defined <code>city</code> elements, new elements <code>big-city</code> and |
| another not city elements. |
| </li> |
| </ul> |
| </section> |
| |
| <section id="entity-validators"> |
| <h2 class="section-title">Entity validators<a href="#"><i class="top-link fas fa-fw fa-angle-double-up"></i></a></h2> |
| <p> |
| You have to implement {% scaladoc NCEntityValidator NCEntityValidator %} trait. |
| </p> |
| <p> |
| This component is designed for entities inspection, an exception can be thrown from user code to break user input processing. |
| </p> |
| |
| <pre class="brush: scala, highlight: [3]"> |
| new NCEntityValidator : |
| override def validate(req: NCRequest, cfg: NCModelConfig, ents: List[NCEntity]): Unit = |
| if ents.exists(_.getId == "restrictedID") |
| then throw new NCException("Sentence cannot be processed.") |
| </pre> |
| |
| <ul> |
| <li> |
| There is anonymous instance of {% scaladoc NCEntityValidator NCEntityValidator %} |
| created. |
| </li> |
| <li> |
| <code>Lines 3</code> defines the rule when exception should be thrown and sentence processing should be stopped. |
| </li> |
| </ul> |
| </section> |
| |
| <section id="variant-filters"> |
| <h2 class="section-title">Variant filters<a href="#"><i class="top-link fas fa-fw fa-angle-double-up"></i></a></h2> |
| <p> |
| You have to implement {% scaladoc NCVariantFilter NCVariantFilter %} trait. |
| </p> |
| |
| <p> |
| Component which allows filtering detected variants, rejecting undesirable. |
| </p> |
| |
| <pre class="brush: scala, highlight: [3]"> |
| new NCVariantFilter : |
| def filter(req: NCRequest, cfg: NCModelConfig, vars: List[NCVariant]): List[NCVariant] = |
| vars.filter(_.getEntities.exists(_.getId == "requiredID")) |
| </pre> |
| |
| <ul> |
| <li> |
| There is anonymous instance of {% scaladoc NCVariantFilter NCVariantFilter %} |
| created. |
| </li> |
| <li> |
| <code>Lines 3</code> defines variant's filter, |
| it passed only variants which contain <code>requiredID</code> elements. |
| </li> |
| </ul> |
| </section> |
| |
| </div> |
| <div class="col-md-2 third-column"> |
| <ul class="side-nav"> |
| <li class="side-nav-title">On This Page</li> |
| <li><a href="#overview">Overview</a></li> |
| <li><a href="#token-parsers">Token parsers</a></li> |
| <li><a href="#token-enrichers">Token enrichers</a></li> |
| <li><a href="#token-validators">Token validators</a></li> |
| <li><a href="#entity-parsers">Entity parsers</a></li> |
| <li><a href="#entity-enrichers">Entity enrichers</a></li> |
| <li><a href="#entity-mappers">Entity mappers</a></li> |
| <li><a href="#entity-validators">Entity validators</a></li> |
| <li><a href="#variant-filters">Variant filters</a></li> |
| {% include quick-links.html %} |
| </ul> |
| </div> |
| |
| |
| |
| |