custom-components.html - incubator-nlpcraft-website - Git at Google

 ---
 active_crumb: Docs
 layout: documentation
 id: overview
 ---

 <!--
  Licensed to the Apache Software Foundation (ASF) under one or more
  contributor license agreements.  See the NOTICE file distributed with
  this work for additional information regarding copyright ownership.
  The ASF licenses this file to You under the Apache License, Version 2.0
  (the "License"); you may not use this file except in compliance with
  the License.  You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License.
 -->

 <div class="col-md-8 second-column">
     <section id="overview">
         <h2 class="section-title">Custom components <a href="#"><i class="top-link fas fa-fw fa-angle-double-up"></i></a></h2>

         <p>
             NlpCraft provides a numeric of useful built components for English language.
             You can use them to prepare <code>Pipeline</code> for your <code>Model</code>.
             You also can use provided wrappers on <a href="https://opennlp.apache.org/">Apache OpenNLP</a> and
             <a href="https://nlp.stanford.edu/">Stanford NLP</a> projects NER components.
             Their models work with English and some another languages.
         </p>
         <p>
             But you can need to extend provided functionality and develop your own components.
             Let's review these components step by step.
         </p>
     </section>
     <section id="token-parser">
         <h2 class="section-title">Token parser <a href="#"><i class="top-link fas fa-fw fa-angle-double-up"></i></a></h2>
         <p>
             You have to implement <a href="apis/latest/org/apache/nlpcraft/NCTokenParser.html">NCTokenParser</a> trait.
         </p>
         <p>
             It's not often situation when you need to prepare your own language tokenizer.
             Mostly it can be necessary if you want to work with some new language.
             You have to prepare new implementation once and can use it for all projects on this language.
             Usually you just should find open source solution and wrap it for
             You have to implement <a href="apis/latest/org/apache/nlpcraft/NCTokenParser.html">NCTokenParser</a> trait.
         </p>
         <pre class="brush: scala, highlight: [2, 6]">
             import org.apache.nlpcraft.*
             import org.languagetool.tokenizers.fr.FrenchWordTokenizer
             import scala.jdk.CollectionConverters.*

             class NCFrTokenParser extends NCTokenParser:
                 private val tokenizer = new FrenchWordTokenizer

                 override def tokenize(text: String): List[NCToken] =
                     val toks = collection.mutable.ArrayBuffer.empty[NCToken]
                     var sumLen = 0

                     for ((word, idx) <- tokenizer.tokenize(text).asScala.zipWithIndex)
                         val start = sumLen
                         val end = sumLen + word.length

                         if word.strip.nonEmpty then
                             toks += new NCPropertyMapAdapter with NCToken:
                                 override def getText: String = word
                                 override def getIndex: Int = idx
                                 override def getStartCharIndex: Int = start
                                 override def getEndCharIndex: Int = end

                         sumLen = end

                     toks.toList
         </pre>
         <ul>
             <li>
                 <code>NCFrTokenParser</code> is a simple wrapper which implements <code>NCTokenParser</code> based on
                 open source <a href="https://languagetool.org">Language Tool</a> library.
             </li>
         </ul>
     </section>

     <section id="token-enricher">
         <h2 class="section-title">Token enricher <a href="#"><i class="top-link fas fa-fw fa-angle-double-up"></i></a></h2>
         <p>
             You have to implement <a href="apis/latest/org/apache/nlpcraft/NCTokenEnricher.html">NCTokenEnricher</a> trait.
         </p>
         <p>
             <a href="apis/latest/org/apache/nlpcraft/NCToken.html">NCToken</a> are used in
             <a href="intent-matching.html">Intent matching</a>. NlpCraft provides a numeric of built token enricher
             implementations for English language.
             You may want to create your own or extends existing. Look at the following example:
         </p>
         <pre class="brush: scala, highlight: [25, 26]">
             import org.apache.nlpcraft.*
             import org.languagetool.AnalyzedToken
             import org.languagetool.tagging.ru.RussianTagger
             import scala.jdk.CollectionConverters.*

             class NCRuLemmaPosTokenEnricher extends NCTokenEnricher:
                 private def nvl(v: String, dflt : => String): String = if v != null then v else dflt

                 override def enrich(req: NCRequest, cfg: NCModelConfig, toks: List[NCToken]): Unit =
                     val tags = RussianTagger.INSTANCE.tag(toks.map(_.getText).asJava).asScala

                     require(toks.size == tags.size)

                     toks.zip(tags).foreach { case (tok, tag) =>
                         val readings = tag.getReadings.asScala

                         val (lemma, pos) = readings.size match
                             // No data. Lemma is word as is, POS is undefined.
                             case 0 => (tok.getText, "")
                             // Takes first. Other variants ignored.
                             case _ =>
                                 val aTok: AnalyzedToken = readings.head
                                 (nvl(aTok.getLemma, tok.getText), nvl(aTok.getPOSTag, ""))

                         tok.put("pos", pos)
                         tok.put("lemma", lemma)

                         () // Otherwise NPE.
                     }
         </pre>
         <ul>
             <li>
                 <code>Lines 25 and 26</code> enriches <a href="apis/latest/org/apache/nlpcraft/NCToken.html">NCToken</a>
                 by two new properties which can be used for <a href="intent-matching.html">Intent matching</a> later.
             </li>
         </ul>
     </section>

     <section id="token-validator">
         <h2 class="section-title">Token validator <a href="#"><i class="top-link fas fa-fw fa-angle-double-up"></i></a></h2>
         <p>
             You have to implement <a href="apis/latest/org/apache/nlpcraft/NCTokenValidator.html">NCTokenValidator</a> trait.
         </p>
     </section>

     <section id="entity-parser">
         <h2 class="section-title">Entity parser <a href="#"><i class="top-link fas fa-fw fa-angle-double-up"></i></a></h2>
         <p>
             You have to implement <a href="apis/latest/org/apache/nlpcraft/NCEntityParser.html">NCEntityParser</a> trait.
         </p>
     </section>

     <section id="entity-enricher">
         <h2 class="section-title">Entity enricher <a href="#"><i class="top-link fas fa-fw fa-angle-double-up"></i></a></h2>
         <p>
             You have to implement <a href="apis/latest/org/apache/nlpcraft/NCEntityEnricher.html">NCEntityEnricher</a> trait.
         </p>
     </section>

     <section id="entity-mapper">
         <h2 class="section-title">Entity enricher<a href="#"><i class="top-link fas fa-fw fa-angle-double-up"></i></a></h2>
         <p>
             You have to implement <a href="apis/latest/org/apache/nlpcraft/NCEntityMapper.html">NCEntityMapper</a> trait.
         </p>
     </section>

     <section id="entity-validator">
         <h2 class="section-title">Entity validator<a href="#"><i class="top-link fas fa-fw fa-angle-double-up"></i></a></h2>
         <p>
             You have to implement <a href="apis/latest/org/apache/nlpcraft/NCEntityValidator.html">NCEntityValidator</a> trait.
         </p>
     </section>

     <section id="variant-filter">
         <h2 class="section-title">Variant filter<a href="#"><i class="top-link fas fa-fw fa-angle-double-up"></i></a></h2>
         <p>
             You have to implement <a href="apis/latest/org/apache/nlpcraft/NCVariantFilter.html">NCVariantFilter</a> trait.
         </p>
     </section>

 </div>
 <div class="col-md-2 third-column">
     <ul class="side-nav">
         <li class="side-nav-title">On This Page</li>
         <li><a href="#overview">Overview</a></li>
         <li><a href="#token-parser">Token parser</a></li>
         <li><a href="#token-enricher">Token enricher</a></li>
         <li><a href="#token-validator">Token validator</a></li>
         <li><a href="#entity-parser">Entity parser</a></li>
         <li><a href="#entity-enricher">Entity enricher</a></li>
         <li><a href="#entity-mapper">Entity mapper</a></li>
         <li><a href="#entity-validator">Entity validator</a></li>
         <li><a href="#variant-filter">Variant filter</a></li>
         {% include quick-links.html %}
     </ul>
 </div>
	---
	active_crumb: Docs
	layout: documentation
	id: overview
	---

	<!--
	Licensed to the Apache Software Foundation (ASF) under one or more
	contributor license agreements. See the NOTICE file distributed with
	this work for additional information regarding copyright ownership.
	The ASF licenses this file to You under the Apache License, Version 2.0
	(the "License"); you may not use this file except in compliance with
	the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing, software
	distributed under the License is distributed on an "AS IS" BASIS,
	WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	See the License for the specific language governing permissions and
	limitations under the License.
	-->

	<div class="col-md-8 second-column">
	<section id="overview">
	<h2 class="section-title">Custom components <a href="#"><i class="top-link fas fa-fw fa-angle-double-up"></i></a></h2>

	<p>
	NlpCraft provides a numeric of useful built components for English language.
	You can use them to prepare <code>Pipeline</code> for your <code>Model</code>.
	You also can use provided wrappers on <a href="https://opennlp.apache.org/">Apache OpenNLP</a> and
	<a href="https://nlp.stanford.edu/">Stanford NLP</a> projects NER components.
	Their models work with English and some another languages.
	</p>
	<p>
	But you can need to extend provided functionality and develop your own components.
	Let's review these components step by step.
	</p>
	</section>
	<section id="token-parser">
	<h2 class="section-title">Token parser <a href="#"><i class="top-link fas fa-fw fa-angle-double-up"></i></a></h2>
	<p>
	You have to implement <a href="apis/latest/org/apache/nlpcraft/NCTokenParser.html">NCTokenParser</a> trait.
	</p>
	<p>
	It's not often situation when you need to prepare your own language tokenizer.
	Mostly it can be necessary if you want to work with some new language.
	You have to prepare new implementation once and can use it for all projects on this language.
	Usually you just should find open source solution and wrap it for
	You have to implement <a href="apis/latest/org/apache/nlpcraft/NCTokenParser.html">NCTokenParser</a> trait.
	</p>
	<pre class="brush: scala, highlight: [2, 6]">
	import org.apache.nlpcraft.*
	import org.languagetool.tokenizers.fr.FrenchWordTokenizer
	import scala.jdk.CollectionConverters.*

	class NCFrTokenParser extends NCTokenParser:
	private val tokenizer = new FrenchWordTokenizer

	override def tokenize(text: String): List[NCToken] =
	val toks = collection.mutable.ArrayBuffer.empty[NCToken]
	var sumLen = 0

	for ((word, idx) <- tokenizer.tokenize(text).asScala.zipWithIndex)
	val start = sumLen
	val end = sumLen + word.length

	if word.strip.nonEmpty then
	toks += new NCPropertyMapAdapter with NCToken:
	override def getText: String = word
	override def getIndex: Int = idx
	override def getStartCharIndex: Int = start
	override def getEndCharIndex: Int = end

	sumLen = end

	toks.toList
	</pre>
	<ul>
	<li>
	<code>NCFrTokenParser</code> is a simple wrapper which implements <code>NCTokenParser</code> based on
	open source <a href="https://languagetool.org">Language Tool</a> library.
	</li>
	</ul>
	</section>

	<section id="token-enricher">
	<h2 class="section-title">Token enricher <a href="#"><i class="top-link fas fa-fw fa-angle-double-up"></i></a></h2>
	<p>
	You have to implement <a href="apis/latest/org/apache/nlpcraft/NCTokenEnricher.html">NCTokenEnricher</a> trait.
	</p>
	<p>
	<a href="apis/latest/org/apache/nlpcraft/NCToken.html">NCToken</a> are used in
	<a href="intent-matching.html">Intent matching</a>. NlpCraft provides a numeric of built token enricher
	implementations for English language.
	You may want to create your own or extends existing. Look at the following example:
	</p>
	<pre class="brush: scala, highlight: [25, 26]">
	import org.apache.nlpcraft.*
	import org.languagetool.AnalyzedToken
	import org.languagetool.tagging.ru.RussianTagger
	import scala.jdk.CollectionConverters.*

	class NCRuLemmaPosTokenEnricher extends NCTokenEnricher:
	private def nvl(v: String, dflt : => String): String = if v != null then v else dflt

	override def enrich(req: NCRequest, cfg: NCModelConfig, toks: List[NCToken]): Unit =
	val tags = RussianTagger.INSTANCE.tag(toks.map(_.getText).asJava).asScala

	require(toks.size == tags.size)

	toks.zip(tags).foreach { case (tok, tag) =>
	val readings = tag.getReadings.asScala

	val (lemma, pos) = readings.size match
	// No data. Lemma is word as is, POS is undefined.
	case 0 => (tok.getText, "")
	// Takes first. Other variants ignored.
	case _ =>
	val aTok: AnalyzedToken = readings.head
	(nvl(aTok.getLemma, tok.getText), nvl(aTok.getPOSTag, ""))

	tok.put("pos", pos)
	tok.put("lemma", lemma)

	() // Otherwise NPE.
	}
	</pre>
	<ul>
	<li>
	<code>Lines 25 and 26</code> enriches <a href="apis/latest/org/apache/nlpcraft/NCToken.html">NCToken</a>
	by two new properties which can be used for <a href="intent-matching.html">Intent matching</a> later.
	</li>
	</ul>
	</section>

	<section id="token-validator">
	<h2 class="section-title">Token validator <a href="#"><i class="top-link fas fa-fw fa-angle-double-up"></i></a></h2>
	<p>
	You have to implement <a href="apis/latest/org/apache/nlpcraft/NCTokenValidator.html">NCTokenValidator</a> trait.
	</p>
	</section>

	<section id="entity-parser">
	<h2 class="section-title">Entity parser <a href="#"><i class="top-link fas fa-fw fa-angle-double-up"></i></a></h2>
	<p>
	You have to implement <a href="apis/latest/org/apache/nlpcraft/NCEntityParser.html">NCEntityParser</a> trait.
	</p>
	</section>

	<section id="entity-enricher">
	<h2 class="section-title">Entity enricher <a href="#"><i class="top-link fas fa-fw fa-angle-double-up"></i></a></h2>
	<p>
	You have to implement <a href="apis/latest/org/apache/nlpcraft/NCEntityEnricher.html">NCEntityEnricher</a> trait.
	</p>
	</section>

	<section id="entity-mapper">
	<h2 class="section-title">Entity enricher<a href="#"><i class="top-link fas fa-fw fa-angle-double-up"></i></a></h2>
	<p>
	You have to implement <a href="apis/latest/org/apache/nlpcraft/NCEntityMapper.html">NCEntityMapper</a> trait.
	</p>
	</section>

	<section id="entity-validator">
	<h2 class="section-title">Entity validator<a href="#"><i class="top-link fas fa-fw fa-angle-double-up"></i></a></h2>
	<p>
	You have to implement <a href="apis/latest/org/apache/nlpcraft/NCEntityValidator.html">NCEntityValidator</a> trait.
	</p>
	</section>

	<section id="variant-filter">
	<h2 class="section-title">Variant filter<a href="#"><i class="top-link fas fa-fw fa-angle-double-up"></i></a></h2>
	<p>
	You have to implement <a href="apis/latest/org/apache/nlpcraft/NCVariantFilter.html">NCVariantFilter</a> trait.
	</p>
	</section>

	</div>
	<div class="col-md-2 third-column">
	<ul class="side-nav">
	<li class="side-nav-title">On This Page</li>
	<li><a href="#overview">Overview</a></li>
	<li><a href="#token-parser">Token parser</a></li>
	<li><a href="#token-enricher">Token enricher</a></li>
	<li><a href="#token-validator">Token validator</a></li>
	<li><a href="#entity-parser">Entity parser</a></li>
	<li><a href="#entity-enricher">Entity enricher</a></li>
	<li><a href="#entity-mapper">Entity mapper</a></li>
	<li><a href="#entity-validator">Entity validator</a></li>
	<li><a href="#variant-filter">Variant filter</a></li>
	{% include quick-links.html %}
	</ul>
	</div>