blob: fc10bf35a2a6aebdee798e13624daf02a1cee937 [file] [log] [blame]
---
active_crumb: Composable Named Entities
layout: blog
blog_title: Composable Named Entities
author_name: Aaron Radzinski
author_avatar: images/lion.jpg
author_twitter_id: aaron_radzinski
publish_date: January 20, 2021
---
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<section>
<div class="bq info">
This blog is an English adaptation of Sergey Kamov's <a target=habr href="https://habr.com/ru/post/530878/">blog</a> written in Russian.
</div>
{% include latest-ver-blog-warn.html %}
<p>
Most of the NLP tasks start with the basic challenge - how to find or detect something in the text. Whether
you are designing a search engine, conversational interface or some sort of classificator you will likely
start with a problem of how to detect <a target="_new" href="https://www.wikiwand.com/en/Named_entity">named entities</a>
in the input text. These named entities can be universal
such as dates, countries, cities as well as domain specific for your model. It is also important to note that
we are talking about a class of NLP tasks where you actually know what you are looking for.
</p>
<figure>
<img class="img-fluid" src="/images/how_to_find_something_fig1.png" alt="">
<figcaption><b>Fig 1.</b> Named Entities</figcaption>
</figure>
<p>
The software component responsible for finding the named entities is called a named entity recognition
(NER) component. Its goal is to find a certain entity in the input text and optionally extract additional
information about this entity. For example, consider the sentence "Give me <b>twenty two</b> face masks". A numeric
NER component will find numeric entity “twenty two” and will extract normalized integer value “22” from it
which can be then used further.
</p>
<p>
NER components are usually based on neural networks that are trained on extensive and well prepared models
(corpus) as well as on simpler rule-based or synonym matching algorithms that are better suited for domain
specific applications. Note that most universal NER components tend to use some variation of neural networks
based algorithms.
</p>
<p>
Rest of this blog will concentrate on a brief review of existing popular products and open source projects
that provide NER components. We will also look at what Apache NLPCraft project brings to the table on this
topic as well. Note that this review is far from exhaustive analysis of these libraries but rather a quick
overview of their pros and cons as of the end of 2020.
</p>
</section>
<section>
<h2 class="section-title">NER Providers <a href="#"><i class="top-link fas fa-fw fa-angle-double-up"></i></a></h2>
<p>
Let's take a look at several well-known NLP libraries that provide built-in NER components.
</p>
<h2 class="section-sub-title">Apache OpenNLP <a href="#"><i class="top-link fas fa-fw fa-angle-double-up"></i></a></h2>
<p>
<img class="img-title" src="/images/opennlp-logo.png" height="48px" alt="">
</p>
<p>
<a target="_blank" href="https://opennlp.apache.org/">Apache OpenNlp</a> is an Apache licensed open source library that provides a typical set of built-in NER
components for English to detect dates, times, geographic locations, organizations, numeric values,
and known persons. Other languages such as Spanish and Dutch are supported with some limitations.
</p>
<p>
Apache OpenNLP is a Java library and its models are available separately due to ASF
release process limitations.
</p>
<p>
<b>Pros:</b><br/>
Apache license, mature and well tested implementation.
</p>
<p>
<b>Cons:</b><br/>
Built-in models don’t see much of the development, if any at all, in the last several years.
It appears that the main usage pattern of Apache OpenNLP is to build and train your own models from scratch.
</p>
<h2 class="section-sub-title">Stanford NLP <a href="#"><i class="top-link fas fa-fw fa-angle-double-up"></i></a></h2>
<p>
<img class="img-title" src="/images/corenlp-logo.png" height="64px" alt="">
</p>
<p>
<a target="_parent" href="https://nlp.stanford.edu/">Stanford NLP</a> is a popular and actively developed, mature NLP library that provides a wide range of
functionality. For English it supports the following named entities: person, location, organization,
misc, money, number, ordinal, percent, date, time, duration, set. Furthermore, built in regular expressions
based NER component allows to recognize the following additional named entities: email, url, city,
state_or_province, country, nationality, religion, (job) title, ideology, criminal_charge, cause_of_death,
handle. More information <a target="_blank" href="https://stanfordnlp.github.io/CoreNLP/ner.html#description">here</a>.
</p>
<p>
There’s a limited support for German, Spanish and Mandarin languages. <a target="_blank" href="https://corenlp.run/">Live demo</a> allows you to test out
various capabilities of Stanford NLP.
</p>
<p>
Stanford NLP is a Java library. Models are available in Maven along with the project itself.
I could not find a detailed description of NER components for languages other than English. <a target=_blank href="https://medium.com/sicara/train-ner-model-with-nltk-stanford-tagger-english-french-german-6d90573a9486">Here</a>
and <a target=_blank href="https://medium.com/@klintcho/training-a-swedish-ner-model-for-stanford-corenlp-part-2-20a0cfd801dd#.vnow3swam">here</a> you can find instructions on how to train your own NER components for other languages.
</p>
<p>
<b>Pros:</b><br/>
Maturity of the project. Live and actively developed project with very good recognition quality
(I use the word “good” very subjectively as we won’t go into formal qualitative metrics of each
project here).
</p>
<p>
<b>Cons:</b><br/>
The biggest gripe is the usage of <a target="_blank" href="https://www.wikiwand.com/ru/GNU_General_Public_License">GNU GPL</a> license which is all but shun away these days due its viral
nature and business unfriendliness. In other words - it is not free and you have to buy a commercial
license if you intend to use it in any serious way. Documentation is adequate at best and can be a
frustrating experience (like many academically driven software projects).
</p>
<h2 class="section-sub-title">Google Language API <a href="#"><i class="top-link fas fa-fw fa-angle-double-up"></i></a></h2>
<img class="img-title" src="/images/google-cloud-logo-small.png" height="56px" alt="">
<p>
<a target="_blank" href="https://cloud.google.com/natural-language">Google Language API</a> supports the
following named entities for the English language: person, location, organization, event, work_of_art,
consumer_good, other, phone_number, address, date, number, price.
</p>
<p>
Google Language API is available as REST API with the native client libraries for Java, C#, Python, Go, etc.
</p>
<p>
<b>Pros:</b><br/>
Large set of NER components from a trusted NLP-based company like Google. Scalability and availability of
modern SaaS platform developed by Google...
</p>
<p>
<b>Cons:</b><br/>
REST API inherently limits the performance of the final solution - making it almost impossible to be used
in any “real-time” applications. Free only for a small number of transactions, paid after that. Not open source.
</p>
<h2 class="section-sub-title">spacy <a href="#"><i class="top-link fas fa-fw fa-angle-double-up"></i></a></h2>
<img id="spacy" class="img-title" src="/images/spacy-logo.png" height="48px" alt="">
<p>
<a target="_blank" href="https://spacy.io">spaCy</a> is a Python library that provides one of the best, if not the best, collection of NER components.
<a target="_blank" href="https://spacy.io/api/annotation#named-entities">Here</a> you can see a full list of supported NERs.
</p>
<p>
<b>Pros:</b><br/>
Actively developed and mature project. Open source with MIT license. One of the best
documentation among similar projects. One of the most popular NLP libraries among a few dozens of available
libraries for the Python community.
</p>
<p>
<b>Cons:</b><br/>
Python - which is rarely used for production level applications. Slow, often unacceptably slow,
performance (due to Python as well). Lack of 1st grade support for language other than English.
</p>
</section>
<section>
<h2 class="section-title">Additional Capabilities of Apache NLPCraft <a href="#"><i class="top-link fas fa-fw fa-angle-double-up"></i></a></h2>
<p>
Let’s take a look at what Apache NLPCraft brings different or additional to the table.
</p>
<p>
When it comes to NER components, Apache NLPCraft provides the following:
</p>
<ul>
<li>Built-in NER components for date, geographical locations, numerics, sorting, limiting, and few others with all of them supporting the extraction of the normalized values and extensive metadata.</li>
<li>Integration with external NER components from Apache OpenNLP, Stanford NLP, Google Language API and spacy.</li>
<li>Support for “composable <span class="amp">&amp;</span> reusable named entities” where users can create new detectable named entities out of existing ones.</li>
</ul>
<p>
While built-in NER components and integration with 3rd party ones is rather a “pedestrian”
capabilities (and you can read about them <a href="/integrations.html">here</a>) - the “composable <span class="amp">&amp;</span> reusable named entities” is something that is unique for Apache NLPCraft.
Let’s look at it in more detail.
</p>
</section>
<section>
<h2 class="section-title">Reusable <span class="amp">&amp;</span> Composable Named Entities <a href="#"><i class="top-link fas fa-fw fa-angle-double-up"></i></a></h2>
<p>
Apache NLPCraft is the first project that provides direct support for composable named entities - named entities
that are defined in terms of other (constituent or part) entities.
Let’s illustrate this by an example.
</p>
<p>
Let’s imagine you are building an NLP-based answering application utilizing intent-based matching (Alexa,
Google DialogFlow, Apache NLPCraft, etc.). In this application we want to answer questions about geographical
locations but <b>only the USA</b>.
</p>
<p>
The one of the ways to accomplish this task is to use any NER providers, for example, <code>nlpcraft:city</code> from
Apache NLPCraft, and build your intents using it. Then, when a particular intent is selected and its callback is called you can check the <code>country</code>
metadata field of the detected named entity. If it does not equal the <code>USA</code> you need to exit (break) from
the intent's callback and continue trying other intents, if any were matched as well.
</p>
<p>
Well, that’s not so easy in real life:
</p>
<ul>
<li>
First of all, your intent-based NLP library must support such a back-and-forth between intent’s callback
and intent matching logic. And very few indeed do…
</li>
<li>
You are spreading the matching logic between declarative intent definition (YAML file) and a
programmable intent’s callback (Java code) which generally leads to a very hard to maintain implementation.
</li>
</ul>
<p>
Okay... you can create your own brand new NER component from scratch that would detect only geographical
locations in the US. However, this will surely take more than a few minutes.
</p>
<p>
Yet another approach, if supported by your intent-based NLP library, is to enhance the intent definition itself
to match only USA geographical locations. At this time, however, I’m not aware of any other NLP libraries
supporting this other than Apache NLPCraft. Furthermore, you are complicating your intents that generally should be
as simple and maintainable as possible.
</p>
<p>
That’s where <b>composable named entities</b> come to the rescue. Apache NLPCraft allows you to define a new named entity
using existing ones - user-defined, built-in or external - named entities (more documentation on this can be found
<a href="/data-model.html#dsl">here</a>). Following up on our example application:
</p>
<pre class="brush: js, highlight: 3, 6">
"elements": [
{
"id": "custom:city:usa",
"description": "Wrapper for USA cities",
"synonyms": [
"^^id == 'nlpcraft:city' && lowercase(~city:country) == 'usa')^^"
]
}
]
</pre>
<p>
In this model snippet, we are defining a new named entity <code>custom:city:usa</code> (line 3) that is based on
existing <code>nlpcraft:city</code> (line 6) that is also filtered for USA country. Once you have this new named entity
defined you can use it to define the intent that will only match cities in the USA.
</p>
<p>
Another example:
</p>
<pre class="brush: js, highlight: [9, 12]">
"macros": [
{
"name": "&lt;AIRPORT&gt;",
"macro": "{airport|aerodrome|airdrome|air station}"
}
],
"elements": [
{
"id": "custom:airport:usa",
"description": "Wrapper for USA airports",
"synonyms": [
"&lt;AIRPORT&gt; {of|for|_} ^^id == 'nlpcraft:city' && lowercase(~city:country) == 'usa')^^"
]
}
]
</pre>
<p>
In this example, we defined a new named entity <code>custom:airport:usa</code>. In its definition we not only
filter cities for the USA but also added a prefix that would indicate that this is an airport (learn more about
NLPCraft IDL syntax <a href="https://nlpcraft.apache.org/intent-matching.html">here</a>).
</p>
<p>
Composable named entities can be nested but not recursive. All the normalized metadata of the constituent
(part) entities - of any nesting depths - is accessible to the named entity, e.g. metadata
from <code>nlpcraft:city</code> is accessible in <code>custom:airport:usa</code> entity.
You can also define a new composed named entity based on your own named entities. This way you are
essentially <b>mixing in</b> new entities instead of creating something from scratch every time.
</p>
<p>
In the end, composable entities allow you to:
</p>
<ul>
<li>Simplify intents by concentrating matching logic in reusable <span class="amp">&amp;</span> composable named entities.</li>
<li>Create new named entities without any coding or expensive model training.</li>
<li>Reuse existing named entities to build new ones.</li>
</ul>
</section>