| --- |
| active_crumb: Docs |
| layout: documentation |
| id: built-in-entity-parser |
| --- |
| |
| <!-- |
| Licensed to the Apache Software Foundation (ASF) under one or more |
| contributor license agreements. See the NOTICE file distributed with |
| this work for additional information regarding copyright ownership. |
| The ASF licenses this file to You under the Apache License, Version 2.0 |
| (the "License"); you may not use this file except in compliance with |
| the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License. |
| --> |
| |
| <div class="col-md-8 second-column"> |
| <section id="overview"> |
| <h2 class="section-title">Overview<a href="#"><i class="top-link fas fa-fw fa-angle-double-up"></i></a></h2> |
| |
| <p> |
| {% scaladoc NCEntityParser NCEntityParser %} trait is part of <a href="api-components.html#model-pipeline">Model Pipeline</a>. |
| Its implementation should allow to find user defined named entities |
| based on prepared tokens as input. |
| </p> |
| |
| <p> |
| There are provided following built-in parsers: |
| </p> |
| |
| <ul> |
| <li> |
| <a href="#parser-opennlp">Wrapper</a> for <a href="https://opennlp.apache.org/">Apache OpenNLP</a> named entities finder which |
| prepared models support English and some other languages. |
| </li> |
| <li> |
| <a href="#parser-stanford">Wrapper</a> for <a href="https://nlp.stanford.edu/">Stanford NLP</a> named entities finder which |
| prepared models support English and some other languages. |
| </li> |
| <li> |
| NLP data <a href="#parser-nlp">wrapper</a> implementation. It is not depends on language. |
| </li> |
| <li> |
| Semantic <a href="#parser-semantic">implementation</a> for English language. |
| </li> |
| </ul> |
| </section> |
| |
| <section id="parser-opennlp"> |
| <h2 class="section-title">OpenNLP Based Parser<a href="#"><i class="top-link fas fa-fw fa-angle-double-up"></i></a></h2> |
| |
| <p> |
| {% scaladoc nlp/parsers/NCOpenNLPTokenParser NCOpenNLPTokenParser %} is wrapper on <a href="https://opennlp.apache.org/">Apache OpenNLP</a> NER components. |
| Look at the supported NER finders models <a href="https://opennlp.sourceforge.net/models-1.5/">here</a>. |
| For example for English language are accessible: <code>Location</code>, <code>Money</code>, |
| <code>Person</code>, <code>Organization</code>, <code>Date</code>, <code>Time</code> and <code>Percentage</code>. |
| There are also accessible models for other languages. |
| </p> |
| </section> |
| |
| <section id="parser-stanford"> |
| <h2 class="section-title">Stanford NLP OpenNLP Based Parser<a href="#"><i class="top-link fas fa-fw fa-angle-double-up"></i></a></h2> |
| |
| <p> |
| <code>NCStanfordNLPEntityParser</code> is wrapper on <a href="https://nlp.stanford.edu/">Stanford NLP</a> NER components. |
| Look at the supported NER finders models <a href="https://nlp.stanford.edu/software/CRF-NER.shtml">here</a>. |
| For example for English language are accessible: <code>Location</code>, <code>Money</code>, |
| <code>Person</code>, <code>Organization</code>, <code>Date</code>, <code>Time</code> and <code>Percent</code>. |
| There are also accessible models for other languages. |
| |
| </p> |
| </section> |
| |
| <section id="parser-nlp"> |
| <h2 class="section-title">NLP Parser<a href="#"><i class="top-link fas fa-fw fa-angle-double-up"></i></a></h2> |
| |
| <p> |
| {% scaladoc nlp/parsers/NCOpenNLPTokenParser NCOpenNLPTokenParser %} converts NLP tokens into entities with four mandatory properties: |
| <code>nlp:token:text</code>, <code>nlp:token:index</code>, <code>nlp:token:startCharIndex</code> and |
| <code>nlp:token:endCharIndex</code>. |
| However, if any other {% scaladoc NCTokenEnricher NCTokenEnricher %} components |
| are registered in the {% scaladoc NCPipeline NCPipeline %} |
| and they add other properties into the tokens, |
| these properties also will be copied with names prefixed with <code>nlp:token:</code>. |
| It is language independent component. |
| Note that converted tokens set can be restricted by predicate. |
| </p> |
| </section> |
| |
| <section id="parser-semantic"> |
| <h2 class="section-title">Semantic Parser<a href="#"><i class="top-link fas fa-fw fa-angle-double-up"></i></a></h2> |
| |
| <p> |
| Semantic entity parser |
| {% scaladoc nlp/parsers/NCSemanticEntityParser NCSemanticEntityParser %} |
| is synonyms based implementation of {% scaladoc NCEntityParser NCEntityParser %}. |
| This parser provides simple but very powerful way to find domain specific data in the input text. |
| It defines list of {% scaladoc nlp/parsers/NCSemanticElement NCSemanticElement %} |
| which are represent <a href="https://en.wikipedia.org/wiki/Named-entity_recognition">Named entities</a>. |
| We will name this list as <code>Semantic Model</code>. |
| </p> |
| |
| <p> |
| Let's talk a little bit more about <a href="https://en.wikipedia.org/wiki/Named-entity_recognition">Named entities</a>. |
| </p> |
| |
| <section id="parser-semantic-ne"> |
| <h3 class="sub-section-title">Named Entities</h3> |
| |
| <p> |
| Named entity, also known as a semantic element or a token, is one of the main a components defined by the NLPCraft data model. |
| A named entity is one or more individual words that have a consistent semantic meaning and typically denote a |
| real-world object, such as persons, locations, number, date and time, organizations, products, etc. Such |
| object can be abstract or have a physical existence. |
| </p> |
| <p> |
| For example, in the following sentence: TODO: PIC |
| </p> |
| <figure> |
| <img alt="named entities" class="img-fluid" src="/images/named-entities.png"> |
| <figcaption><b>Fig 2.</b> Named Entities</figcaption> |
| </figure> |
| <p> |
| the following named entities can be detected: |
| </p> |
| <table class="gradient-table"> |
| <thead> |
| <tr> |
| <th>Words</th> |
| <th>Type</th> |
| <th>Normalized Value</th> |
| </tr> |
| </thead> |
| <tbody> |
| <tr> |
| <td><b>Top 20</b></td> |
| <td><code>user:element:1</code></td> |
| <td>top 20</td> |
| </tr> |
| <tr> |
| <td><b>best pages</b></td> |
| <td><code>user:element:2</code></td> |
| <td>best pages</td> |
| </tr> |
| <tr> |
| <td><b>California USA</b></td> |
| <td><code>stanford:city</code></td> |
| <td>USA, California</td> |
| </tr> |
| <tr> |
| <td><b>last 3 months</b></td> |
| <td><code>stanford:date</code></td> |
| <td>1/1/2021 - 4/1/2021</td> |
| </tr> |
| </tbody> |
| </table> |
| <p> |
| In most cases named entities will have associated <em>normalized value</em>. It is especially important for named entities that have many |
| notational forms such as time and date, currency, geographical locations, etc. For example, <code>New York</code>, |
| <code>New York City</code> and <code>NYC</code> all refer to the same "New York City, NY USA" location which is a standard normalized form. |
| </p> |
| <p> |
| The process of detecting named entities is called Named Entity Recognition (NER). There are many ways of how a certain named entity can be detected: through list of synonyms, by name, rule-based or by using |
| statistical techniques like neural networks with large corpus of predefined data. NLPCraft natively supports synonym-based |
| named entities definition as well as the ability to compose new named entities through powerful <a href="/intent-matching.html">Intent Definition Language</a> (IDL) |
| combining other named entities including named entities from |
| such OpenNLP, or Stanford CoreNLP, look at the <a href="built-in-entity-parser.html">Built-in Entity Parser</a> chapter. |
| </p> |
| <p> |
| Named entities allow you to abstract from basic linguistic forms like nouns and verbs to deal with the higher level semantic |
| abstractions like geographical location or time when you are trying to understand the meaning of the sentence. |
| One of the main goals of named entities is to act as an input ingredients for <a href="/intent-matching.html">intent matching</a>. |
| </p> |
| <div class="bq info"> |
| <p> |
| <b>😀 User Input → Named Entities → Parsing Variants → Intent Matcher → Winning Intent 🚀</b> |
| </p> |
| <p> |
| User input is parsed into the list of named entities. That list is then further transformed into one or more |
| parsing variants where each variant represents a particular order and combination of detected named entities. |
| Finally, the list of variants act as an input to intent matching where each variant is matched against every intent |
| in the process of detecting the best matching intent for the original user input. |
| </p> |
| </div> |
| </section> |
| |
| <section id="parser-semantic-elements"> |
| <h3 class="sub-section-title">Elements</h3> |
| |
| <p> |
| {% scaladoc nlp/parsers/NCSemanticElement NCSemanticElement %} represents |
| NER element for its detection un the user input. |
| <p> |
| |
| <div class="bq info"> |
| <p> |
| <b>Semantic Element <span class="amp">&</span> Named Entity <span class="amp">&</span> Token</b> |
| </p> |
| <p> |
| Terms 'semantic element', 'named entity' and 'token' are used throughout this documentation relatively interchangeably: |
| </p> |
| <dl> |
| <dt>Semantic Element</dt> |
| <dd> |
| Denotes a named entity <em>declared</em> in NLPCraft model. |
| </dd> |
| <dt>Token</dt> |
| <dd> |
| Denotes a semantic element that was <em>detected</em> by NLPCraft in the user input. |
| </dd> |
| <dt>Named Entity</dt> |
| <dd> |
| Denotes a classic term, i.e. one or more individual words that have a |
| consistent semantic meaning and typically define a real-world object. |
| </dd> |
| </dl> |
| </div> |
| |
| <p> |
| Each {% scaladoc nlp/parsers/NCSemanticElement NCSemanticElement %} |
| is presented by <code>type</code>, <code>groups</code>, <code>synonyms</code>, <code>values</code> and <code>properties</code>. |
| <p> |
| <span id="synonyms" class="section-sub-title">Synonyms <a href="#"><i class="top-link fas fa-fw fa-angle-double-up"></i></a></span> |
| <p> |
| NLPCraft uses fully deterministic named entity recognition and is not based on statistical approaches that |
| would require pre-existing marked up data sets and extensive training. For each semantic element you can either provide a |
| set of synonyms to match on or specify a piece of code that would be responsible for detecting that named |
| entity (discussed below). A synonym can have one or more individual words. Note that element's type is its |
| implicit synonym so that even if no additional synonyms are defined at least one synonym always exists. |
| Note also that synonym matching is performed on <em>normalized</em> and <em>stemmatized</em> forms of both |
| a synonym and user input on first phase and if first attempt is not successful, it tries to match <em>stemmatized</em> forms |
| of synonyms with <em>stemmatized</em> forms of user input which were <em>lemmatized</em> preliminarily. |
| This approach allows to provide more accurate matching and doesn't force users to prepare synonyms in initial words form. |
| </p> |
| |
| <p> |
| Here's an example of a simple semantic element definition in JSON: |
| </p> |
| <pre class="brush: js, highlight: [6,7,8,9,10,11,12]"> |
| ... |
| "elements": [ |
| { |
| "id": "transport.vehicle", |
| "description": "Transportation vehicle", |
| "synonyms": [ |
| "car", |
| "truck", |
| "light duty truck" |
| "heavy duty truck" |
| "sedan", |
| "coupe" |
| ] |
| } |
| ] |
| ... |
| </pre> |
| <p> |
| While adding multi-word synonyms looks somewhat |
| trivial - in real models, the naive approach can lead to thousands and even tens of thousands of |
| possible synonyms due to words, grammar, and linguistic permutations - which quickly becomes untenable if |
| performed manually. |
| </p> |
| <p> |
| NLPCraft provides an effective tool for a compact synonyms representation. Instead of listing all possible |
| multi-word synonyms one by one you can use combination of following techniques: |
| </p> |
| <ul> |
| <li><a href="#macros">Macros</a></li> |
| <li><a href="#regex">Regular expressions</a></li> |
| <li><a href="#option-groups">Option Groups</a></li> |
| </ul> |
| <p> |
| Each whitespace separated string in the synonym can be either a regular word (like in the above transportation example |
| where it will be matched on using its normalized and stemmatized form) or one of the above expression. |
| </p> |
| <p> |
| Note that this synonyms definition is also used in the following |
| {% scaladoc nlp/parsers/NCSemanticElement NCSemanticElement %} methods: |
| </p> |
| <ul> |
| <li><code>getSynonyms()</code> - gets synonyms to match on.</li> |
| <li><code>getValues()</code> - get values to match on (see <a href="#values">below</a>).</li> |
| </ul> |
| <span id="values" class="section-sub-title">Element Values <a href="#"><i class="top-link fas fa-fw fa-angle-double-up"></i></a></span> |
| <p> |
| Semantic element can have an optional set of special synonyms called <em>values</em> or "proper nouns" for this element. |
| Unlike basic synonyms, each value is a pair of a name and a set of standard synonyms by which that value, |
| and ultimately its element, can be recognized in the user input. Note that the value name itself acts as an |
| implicit synonym even when no additional synonyms added for that value. |
| </p> |
| <p> |
| When a semantic element is recognized it is made available to the model's matching logic as an instance of |
| the {% scaladoc NCToken NCToken %} interface. |
| This interface has a method |
| {% scaladoc NCToken getValue() %} which |
| returns the name of the value, if any, by which |
| that semantic element was recognized. That value name can be further used in intent matching. |
| </p> |
| <p> |
| To understand the importance of the values consider the following changes to our transportation |
| example model: |
| </p> |
| <pre class="brush: js, highlight: [19,20,21,22,23,24,25,26,27,28,29,30]"> |
| ... |
| "macros": [ |
| { |
| "name": "<TRUCK_TYPE>", |
| "macro": "{light duty|heavy duty|half ton|1/2 ton|3/4 ton|one ton|super duty}" |
| } |
| ] |
| "elements": [ |
| { |
| "id": "transport.vehicle", |
| "description": "Transportation vehicle", |
| "synonyms": [ |
| "car", |
| "{<TRUCK_TYPE>|_} {pickup|_} truck" |
| "sedan", |
| "coupe" |
| ], |
| "values": [ |
| { |
| "value": "mercedes", |
| "synonyms": ["mercedes-ben{z|s}", "mb", "ben{z|s}"] |
| }, |
| { |
| "value": "bmw", |
| "synonyms": ["{bimmer|bimer|beemer}", "bayerische motoren werke"] |
| } |
| { |
| "value": "chevrolet", |
| "synonyms": ["chevy"] |
| } |
| ] |
| } |
| ] |
| ... |
| </pre> |
| <p> |
| With that setup <code>transport.vehicle</code> element will be recognized by any of the following input string: |
| </p> |
| <ul> |
| <li><code>car</code></li> |
| <li><code>benz</code> (with value <code>mercedes</code>)</li> |
| <li><code>3/4 ton pickup truck</code></li> |
| <li><code>light duty truck</code></li> |
| <li><code>chevy</code> (with value <code>chevrolet</code>)</li> |
| <li><code>bimmer</code> (with value <code>bmw</code>)</li> |
| <li><code>transport.vehicle</code></li> |
| </ul> |
| <span id="groups" class="section-sub-title">Element Groups <a href="#"><i class="top-link fas fa-fw fa-angle-double-up"></i></a></span> |
| <p> |
| Each semantic element always belongs to one or more groups. Semantic element provides its groups via |
| {% scaladoc nlp/parsers/NCSemanticElement getGroups() %} method. |
| By default, if element group is not specified, the element type will act as its default group. |
| Group membership is a quick and easy way to organise similar semantic elements together and use this |
| categorization in <a href="/intent-matching.html">IDL</a> intents. |
| </p> |
| <p> |
| Note that the proper grouping of the elements is also necessary for the correct operation of |
| Short-Term-Memory (STM) in the conversational context. Consider a |
| {% scaladoc NCToken NCToken %} that |
| represents a previously found semantic element that is stored in the conversation. Such token |
| will be overridden in the conversation by the more <b>recent token</b> |
| from the <b>same group</b> - a critical rule of maintaining the proper conversational context. |
| See |
| {% scaladoc NCConversation NCConversation %} |
| for mode details. |
| </p> |
| |
| <span id="macros" class="section-sub-title">Macros<a href="#"><i class="top-link fas fa-fw fa-angle-double-up"></i></a></span> |
| <p> |
| Listing all possible multi-word synonyms for a given element can be a time-consuming task. Macros |
| together with option groups allow for significant simplification of this task. |
| Macros allow you to give a name to an often used set of words or option groups and reuse it without |
| repeating those words or option groups again and again. A model provides a list of macros via |
| {% scaladoc nlp/parsers/NCSemanticEntityParser macros %} method. |
| Each macro has a name in a form of <code><X></code> where <code>X</code> |
| is any string, and a string value. Note that macros can be nested (but not recursive), i.e. macro value can include |
| references to other macros. When macro name <code>X</code> is encountered in the synonym it gets recursively |
| replaced with its value. |
| </p> |
| <p> |
| Here's a code snippet of macro definitions using JSON definition: |
| </p> |
| <pre class="brush: js"> |
| "macros": [ |
| { |
| "name": "<A>", |
| "macro": "aaa" |
| }, |
| { |
| "name": "<B>", |
| "macro": "<A> bbb" |
| }, |
| { |
| "name": "<C>", |
| "macro": "<A> bbb {z|w}" |
| } |
| ] |
| </pre> |
| <span id="option-groups" class="section-sub-title">Option Groups <a href="#"><i class="top-link fas fa-fw fa-angle-double-up"></i></a></span> |
| <p> |
| Option groups are similar to wildcard patterns that operates on a single word base. One line of |
| option group expands into one or more individual synonyms. Option groups is the key mechanism for shortened |
| synonyms notation. The following examples demonstrate how to use option groups. |
| </p> |
| <p> |
| Consider the following macros defined below (note that macros <code><B></code> and <code><C></code> |
| are nested): |
| </p> |
| <table class="gradient-table"> |
| <thead> |
| <tr> |
| <th>Name</th> |
| <th>Value</th> |
| </tr> |
| </thead> |
| <tbody> |
| <tr> |
| <td><code><A></code></td> |
| <td><code>aaa</code></td> |
| </tr> |
| <tr> |
| <td><code><B></code></td> |
| <td><code><A> bbb</code></td> |
| </tr> |
| <tr> |
| <td><code><C></code></td> |
| <td><code><A> bbb {z|w}</code></td> |
| </tr> |
| </tbody> |
| </table> |
| <p> |
| Then the following option group expansions will occur in these examples: |
| </p> |
| <table class="gradient-table"> |
| <thead> |
| <tr> |
| <th>Synonym</th> |
| <th>Synonym Expansions</th> |
| </tr> |
| </thead> |
| <tbody> |
| <tr> |
| <td><code><A> {b|_} c</code></td> |
| <td> |
| <code>"aaa b c"</code><br> |
| <code>"aaa c"</code> |
| </td> |
| </tr> |
| <tr> |
| <td><code><A> {b|a}[1,2] c</code></td> |
| <td> |
| <code>"aaa b c"</code><br> |
| <code>"aaa b b c"</code><br> |
| <code>"aaa a c"</code><br> |
| <code>"aaa a a c"</code><br> |
| <code>"aaa c"</code> |
| </td> |
| </tr> |
| <tr> |
| <td> |
| <code><B> {b|_} c</code><br> |
| or<br> |
| <code><B> {b}[0,1] c</code> |
| </td> |
| <td> |
| <code>"aaa bbb b c"</code><br> |
| <code>"aaa bbb c"</code> |
| </td> |
| </tr> |
| <tr> |
| <td><code>{b|\{\_\}}</code></td> |
| <td> |
| <code>"b"</code><br> |
| <code>"b {_}"</code> |
| </td> |
| </tr> |
| <tr> |
| <td><code>a {b|_}. c</code></td> |
| <td> |
| <code>"a b. c"</code><br> |
| <code>"a . c"</code> |
| </td> |
| </tr> |
| <tr> |
| <td><code>a .{b, |_}. c</code></td> |
| <td> |
| <code>"a .b, . c"</code><br> |
| <code>"a .. c"</code> |
| </td> |
| </tr> |
| <tr> |
| <td><code> |
| {% raw %}a {{b|c}|_}.{% endraw %}</code></td> |
| <td> |
| <code>"a ."</code><br> |
| <code>"a b."</code><br> |
| <code>"a c."</code> |
| </td> |
| </tr> |
| <tr> |
| <td><code>a {% raw %}{{{<C>}}|{_}}{% endraw %} c</code></td> |
| <td> |
| <code>"a aaa bbb z c"</code><br> |
| <code>"a aaa bbb w c"</code><br> |
| <code>"a c"</code> |
| </td> |
| </tr> |
| <tr> |
| <td><code>{% raw %}{{{a}}} {b||_|{{_}}||_}{% endraw %}</code></td> |
| <td> |
| <code>"a b"</code><br> |
| <code>"a"</code> |
| </td> |
| </tr> |
| </tbody> |
| </table> |
| <p> |
| Specifically: |
| </p> |
| <ul> |
| <li><code>{A|B}</code> denotes either <code>A</code> or <code>B</code>.</li> |
| <li> |
| <code>{A|B|_}</code> denotes either <code>A</code> or <code>B</code> or nothing. |
| <ul> |
| <li>Symbol <code>_</code> cam appear anywhere in the list of options, i.e. <code>{A|B|_}</code> is equal to <code>{A|_|B}</code>.</li> |
| </ul> |
| </li> |
| <li> |
| <code>{C}[x,y]</code> denotes an option group with quantifier, i.e. group <code>C</code> appearing from <code>x</code> to <code>y</code> times inclusive. |
| <ul> |
| <li>For example, <code>{C}[1,3]</code> is the same as <code>{C|C C|C C C}</code> notation.</li> |
| <li>Note that <code>{C|_}</code> is equal to <code>{C}[0,1]</code></li> |
| </ul> |
| </li> |
| <li>Excessive curly brackets are ignored, when safe to do so.</li> |
| <li>Macros cannot be recursive but can be nested.</li> |
| <li>Option groups can be nested.</li> |
| <li> |
| <code>'\'</code> (backslash) can be used to escape <code>'{'</code>, <code>'}'</code>, <code>'|'</code> and |
| <code>'_'</code> special symbols used by the option groups. |
| </li> |
| <li>Excessive whitespaces are trimmed when expanding option groups.</li> |
| </ul> |
| <p> |
| We can rewrite our transportation semantic element in a more efficient way using macros and option groups. |
| Even though the actual length of definition hasn't changed much it now auto-generates many dozens of synonyms |
| we would have to write out manually otherwise: |
| </p> |
| <pre class="brush: js, highlight: [4,5,14]"> |
| ... |
| "macros": [ |
| { |
| "name": "<TRUCK_TYPE>", |
| "macro": "{ {light|super|heavy|medium} duty|half ton|1/2 ton|3/4 ton|one ton}" |
| } |
| ] |
| "elements": [ |
| { |
| "id": "transport.vehicle", |
| "description": "Transportation vehicle", |
| "synonyms": [ |
| "car", |
| "{<TRUCK_TYPE>|_} {pickup|_} truck" |
| "sedan", |
| "coupe" |
| ] |
| } |
| ] |
| ... |
| </pre> |
| <span id="regex" class="section-sub-title">Regular Expressions <a href="#"><i class="top-link fas fa-fw fa-angle-double-up"></i></a></span> |
| <p> |
| Any individual synonym word that starts and ends with <code>//</code> (two forward slashes) is |
| considered to be Java regular expression as defined in <code>java.util.regex.Pattern</code>. Note that |
| regular expression can only span a single word, i.e. only individual words from the user input will be |
| matched against given regular expression and no whitespaces are allowed within regular expression. Note |
| also that option group special symbols <code>{</code>, <code>}</code>, |
| <code>|</code> and <code>_</code> have to be escaped in the regular expression using <code>\</code> |
| (backslash). |
| </p> |
| <p> |
| For example, the following synonym: |
| </p> |
| <pre class="brush: js"> |
| "synonyms": [ |
| "{foo|//[bar].+//}}" |
| ] |
| </pre> |
| <p> |
| will match word <code>foo</code> or any other strings that start with <code>bar</code> as long as |
| this string doesn't contain whitespaces. |
| </p> |
| <div class="bq info"> |
| <b>Regular Expressions Performance</b> |
| <p> |
| It's important to note that regular expressions can significantly affect the performance of the |
| NLPCraft processing if used uncontrolled. Use it with caution and test the performance |
| of your model to ensure it meets your requirements. |
| </p> |
| </div> |
| </section> |
| |
| <section id="parser-semantic-examples"> |
| <h3 class="sub-section-title">Examples</h3> |
| |
| <p> |
| The following example shows how to build model programmatically. |
| </p> |
| |
| <pre class="brush: scala, highlight: [3, 5, 10]"> |
| val mdl = new NCModel( |
| NCModelConfig("test.id", "Test Model", "1.0"), |
| new NCPipelineBuilder().withSemantic( |
| "en", |
| Map( |
| "<OF>" -> "{of|for|per}", |
| "<CUR>" -> "current|present|now|local}", |
| "<TIME>" -> "{time <OF> day|day time|date|time|moment|datetime|hour|o'clock|clock|date time|date and time|time and date}", |
| ) |
| List( |
| new NCSemanticElement(): |
| override def getType: String = "time" |
| override def getSynonyms: Set[String] = Set("{<CUR>|_} <TIME>", "what <TIME> {is it now|now|is it|_}" ) |
| ) |
| ).build |
| ): |
| // Add your callbacks definition or references on them here. |
| </pre> |
| <ul> |
| <li> |
| <code>Line 5</code> shows <code>macro</code> parameter definition. |
| </li> |
| <li> |
| <code>Line 10</code> shows <code>macro</code> list of {% scaladoc nlp/parsers/NCSemanticElement NCSemanticElement %} parameter usage. |
| </li> |
| <li> |
| Note that usage {% scaladoc NCPipelineBuilder#withSemantic-fffff4b0 withSemantic() %} |
| method which represented on <code>line 3</code> is optional. |
| You can add {% scaladoc nlp/parsers/NCNLPEntityParser NCNLPEntityParser %} |
| as usual {% scaladoc NCEntityParser NCEntityParser %} |
| when you define your {% scaladoc NCPipeline NCPipeline %}. |
| </li> |
| </ul> |
| |
| <p> |
| The following example is based on YAML semantic elements representation. |
| </p> |
| |
| <pre class="brush: js, highlight: []"> |
| macros: |
| "<OF>": "{of|for|per}" |
| "<CUR>": "{current|present|now|local}" |
| "<TIME>": "{time <OF> day|day time|date|time|moment|datetime|hour|o'clock|clock|date time|date and time|time and date}" |
| elements: |
| - type: "x:time" |
| description: "Date and/or time token indicator." |
| synonyms: |
| - "{<CUR>|_} <TIME>" |
| - "what <TIME> {is it now|now|is it|_}" |
| </pre> |
| <ul> |
| <li> |
| Same macros and the same element as in previous example are defined here in |
| <code>time_model.yaml</code> YAML file. |
| </li> |
| </ul> |
| <pre class="brush: scala, highlight: [3]"> |
| val mdl = new NCModel( |
| NCModelConfig("test.id", "Test Model", "1.0"), |
| new NCPipelineBuilder().withSemantic("en", "time_model.yaml").build |
| ): |
| // Add your callbacks definition or references on them here. |
| </pre> |
| <ul> |
| <li> |
| <code>Line 3</code> makes semantic model which elements are defined in <code>time_model.yaml</code> YAML file. |
| </li> |
| </ul> |
| |
| <p> |
| If you want to use {% scaladoc nlp/parsers/NCSemanticEntityParser NCSemanticEntityParser %} |
| with not English language, you have to provide custom |
| {% scaladoc nlp/parsers/NCSemanticStemmer NCSemanticStemmer %} and |
| {% scaladoc NCTokenParser NCTokenParser %} |
| implementations for required language. Look at the <a href="examples/light_switch_fr.html">Light Switch FR</a> |
| for more details. |
| </p> |
| |
| <pre class="brush: scala, highlight: [4, 7, 8]"> |
| package demo |
| |
| import opennlp.tools.stemmer.snowball.SnowballStemmer |
| import demo.nlp.token.parser.NCFrTokenParser |
| import org.apache.nlpcraft.nlp.parsers.* |
| |
| class NCFrSemanticEntityParser(src: String) extends NCSemanticEntityParser( |
| new NCSemanticStemmer: |
| private val stemmer = new SnowballStemmer(SnowballStemmer.ALGORITHM.FRENCH) |
| override def stem(txt: String): String = stemmer.synchronized { stemmer.stem(txt.toLowerCase).toString } |
| , |
| new NCFrTokenParser(), |
| mdlSrcOpt = Option(src) |
| ) |
| </pre> |
| <ul> |
| <li> |
| <code>Line 4</code> includes <code>NCFrTokenParser</code> import. |
| Its custom {% scaladoc NCTokenParser NCTokenParser %} |
| implementation for French language, described here: <a href="examples/light_switch_fr.html">Light Switch FR</a>. |
| </li> |
| <li> |
| <code>Line 8</code> defines custom {% scaladoc nlp/parsers/NCSemanticStemmer NCSemanticStemmer %} |
| implementation for French language. |
| </li> |
| <li> |
| As you can see, <code>NCFrSemanticEntityParser</code> is very simple extension of |
| {% scaladoc nlp/parsers/NCSemanticEntityParser NCSemanticEntityParser %} |
| base class, look at <code>line 7</code>. |
| </li> |
| </ul> |
| </section> |
| <section id="parser-semantic-extending"> |
| <h3 class="sub-section-title">Languages Extending</h3> |
| |
| <p> |
| If you want to use |
| {% scaladoc nlp/parsers/NCSemanticEntityParser NCSemanticEntityParser %} |
| with any not English language you have to provide custom |
| {% scaladoc nlp/parsers/NCSemanticStemmer NCSemanticStemmer %} and |
| {% scaladoc NCTokenParser NCTokenParser %} |
| implementations for this desirable language. |
| Look at the <a href="examples/light_switch_fr.html">Light Switch FR</a> for more details. |
| </p> |
| </section> |
| </section> |
| </div> |
| <div class="col-md-2 third-column"> |
| <ul class="side-nav"> |
| <li class="side-nav-title">On This Page</li> |
| <li><a href="#overview">Overview</a></li> |
| <li><a href="#parser-opennlp">OpenNLP Based Parser</a></li> |
| <li><a href="#parser-stanford">Stanford NLP Based Entity</a></li> |
| <li><a href="#parser-nlp">NLP Parser</a></li> |
| <li><a href="#parser-semantic">Semantic Parser</a></li> |
| <li><a href="#parser-semantic-ne">Semantic Parser Named Entities</a></li> |
| <li><a href="#parser-semantic-elements">Semantic Parser Elements</a></li> |
| <li><a href="#parser-semantic-examples">Semantic Parser Examples</a></li> |
| <li><a href="#parser-semantic-extending">SemanticParser Languages Extending</a></li> |
| {% include quick-links.html %} |
| </ul> |
| </div> |