| --- |
| active_crumb: Composable Named Entities |
| layout: blog |
| blog_title: Composable Named Entities |
| author_name: Aaron Radzinski |
| author_avatar: images/lion.jpg |
| author_twitter_id: aaron_radzinski |
| publish_date: January 20, 2021 |
| --- |
| |
| <!-- |
| Licensed to the Apache Software Foundation (ASF) under one or more |
| contributor license agreements. See the NOTICE file distributed with |
| this work for additional information regarding copyright ownership. |
| The ASF licenses this file to You under the Apache License, Version 2.0 |
| (the "License"); you may not use this file except in compliance with |
| the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License. |
| --> |
| |
| <section> |
| <div class="bq info"> |
| This blog is an English adaptation of Sergey Kamov's <a target=habr href="https://habr.com/ru/post/530878/">blog</a> written in Russian. |
| </div> |
| {% include latest-ver-blog-warn.html %} |
| <p> |
| Most of the NLP tasks start with the basic challenge - how to find or detect something in the text. Whether |
| you are designing a search engine, conversational interface or some sort of classificator you will likely |
| start with a problem of how to detect <a target="_new" href="https://www.wikiwand.com/en/Named_entity">named entities</a> |
| in the input text. These named entities can be universal |
| such as dates, countries, cities as well as domain specific for your model. It is also important to note that |
| we are talking about a class of NLP tasks where you actually know what you are looking for. |
| </p> |
| <figure> |
| <img class="img-fluid" src="/images/how_to_find_something_fig1.png" alt=""> |
| <figcaption><b>Fig 1.</b> Named Entities</figcaption> |
| </figure> |
| <p> |
| The software component responsible for finding the named entities is called a named entity recognition |
| (NER) component. Its goal is to find a certain entity in the input text and optionally extract additional |
| information about this entity. For example, consider the sentence "Give me <b>twenty two</b> face masks". A numeric |
| NER component will find numeric entity “twenty two” and will extract normalized integer value “22” from it |
| which can be then used further. |
| </p> |
| <p> |
| NER components are usually based on neural networks that are trained on extensive and well prepared models |
| (corpus) as well as on simpler rule-based or synonym matching algorithms that are better suited for domain |
| specific applications. Note that most universal NER components tend to use some variation of neural networks |
| based algorithms. |
| </p> |
| <p> |
| Rest of this blog will concentrate on a brief review of existing popular products and open source projects |
| that provide NER components. We will also look at what Apache NLPCraft project brings to the table on this |
| topic as well. Note that this review is far from exhaustive analysis of these libraries but rather a quick |
| overview of their pros and cons as of the end of 2020. |
| </p> |
| </section> |
| <section> |
| <h2 class="section-title">NER Providers <a href="#"><i class="top-link fas fa-fw fa-angle-double-up"></i></a></h2> |
| <p> |
| Let's take a look at several well-known NLP libraries that provide built-in NER components. |
| </p> |
| <h2 class="section-sub-title">Apache OpenNLP <a href="#"><i class="top-link fas fa-fw fa-angle-double-up"></i></a></h2> |
| <p> |
| <img class="img-title" src="/images/opennlp-logo.png" height="48px" alt=""> |
| </p> |
| <p> |
| <a target="_blank" href="https://opennlp.apache.org/">Apache OpenNlp</a> is an Apache licensed open source library that provides a typical set of built-in NER |
| components for English to detect dates, times, geographic locations, organizations, numeric values, |
| and known persons. Other languages such as Spanish and Dutch are supported with some limitations. |
| </p> |
| <p> |
| Apache OpenNLP is a Java library and its models are available separately due to ASF |
| release process limitations. |
| </p> |
| <p> |
| <b>Pros:</b><br/> |
| Apache license, mature and well tested implementation. |
| </p> |
| <p> |
| <b>Cons:</b><br/> |
| Built-in models don’t see much of the development, if any at all, in the last several years. |
| It appears that the main usage pattern of Apache OpenNLP is to build and train your own models from scratch. |
| </p> |
| <h2 class="section-sub-title">Stanford NLP <a href="#"><i class="top-link fas fa-fw fa-angle-double-up"></i></a></h2> |
| <p> |
| <img class="img-title" src="/images/corenlp-logo.png" height="64px" alt=""> |
| </p> |
| <p> |
| <a target="_parent" href="https://nlp.stanford.edu/">Stanford NLP</a> is a popular and actively developed, mature NLP library that provides a wide range of |
| functionality. For English it supports the following named entities: person, location, organization, |
| misc, money, number, ordinal, percent, date, time, duration, set. Furthermore, built in regular expressions |
| based NER component allows to recognize the following additional named entities: email, url, city, |
| state_or_province, country, nationality, religion, (job) title, ideology, criminal_charge, cause_of_death, |
| handle. More information <a target="_blank" href="https://stanfordnlp.github.io/CoreNLP/ner.html#description">here</a>. |
| </p> |
| <p> |
| There’s a limited support for German, Spanish and Mandarin languages. <a target="_blank" href="https://corenlp.run/">Live demo</a> allows you to test out |
| various capabilities of Stanford NLP. |
| </p> |
| <p> |
| Stanford NLP is a Java library. Models are available in Maven along with the project itself. |
| I could not find a detailed description of NER components for languages other than English. <a target=_blank href="https://medium.com/sicara/train-ner-model-with-nltk-stanford-tagger-english-french-german-6d90573a9486">Here</a> |
| and <a target=_blank href="https://medium.com/@klintcho/training-a-swedish-ner-model-for-stanford-corenlp-part-2-20a0cfd801dd#.vnow3swam">here</a> you can find instructions on how to train your own NER components for other languages. |
| </p> |
| <p> |
| <b>Pros:</b><br/> |
| Maturity of the project. Live and actively developed project with very good recognition quality |
| (I use the word “good” very subjectively as we won’t go into formal qualitative metrics of each |
| project here). |
| </p> |
| <p> |
| <b>Cons:</b><br/> |
| The biggest gripe is the usage of <a target="_blank" href="https://www.wikiwand.com/ru/GNU_General_Public_License">GNU GPL</a> license which is all but shun away these days due its viral |
| nature and business unfriendliness. In other words - it is not free and you have to buy a commercial |
| license if you intend to use it in any serious way. Documentation is adequate at best and can be a |
| frustrating experience (like many academically driven software projects). |
| </p> |
| <h2 class="section-sub-title">Google Language API <a href="#"><i class="top-link fas fa-fw fa-angle-double-up"></i></a></h2> |
| <img class="img-title" src="/images/google-cloud-logo-small.png" height="56px" alt=""> |
| <p> |
| <a target="_blank" href="https://cloud.google.com/natural-language">Google Language API</a> supports the |
| following named entities for the English language: person, location, organization, event, work_of_art, |
| consumer_good, other, phone_number, address, date, number, price. |
| </p> |
| <p> |
| Google Language API is available as REST API with the native client libraries for Java, C#, Python, Go, etc. |
| </p> |
| <p> |
| <b>Pros:</b><br/> |
| Large set of NER components from a trusted NLP-based company like Google. Scalability and availability of |
| modern SaaS platform developed by Google... |
| </p> |
| <p> |
| <b>Cons:</b><br/> |
| REST API inherently limits the performance of the final solution - making it almost impossible to be used |
| in any “real-time” applications. Free only for a small number of transactions, paid after that. Not open source. |
| </p> |
| <h2 class="section-sub-title">spacy <a href="#"><i class="top-link fas fa-fw fa-angle-double-up"></i></a></h2> |
| <img id="spacy" class="img-title" src="/images/spacy-logo.png" height="48px" alt=""> |
| <p> |
| <a target="_blank" href="https://spacy.io">spaCy</a> is a Python library that provides one of the best, if not the best, collection of NER components. |
| <a target="_blank" href="https://spacy.io/api/annotation#named-entities">Here</a> you can see a full list of supported NERs. |
| </p> |
| <p> |
| <b>Pros:</b><br/> |
| Actively developed and mature project. Open source with MIT license. One of the best |
| documentation among similar projects. One of the most popular NLP libraries among a few dozens of available |
| libraries for the Python community. |
| </p> |
| <p> |
| <b>Cons:</b><br/> |
| Python - which is rarely used for production level applications. Slow, often unacceptably slow, |
| performance (due to Python as well). Lack of 1st grade support for language other than English. |
| </p> |
| </section> |
| <section> |
| <h2 class="section-title">Additional Capabilities of Apache NLPCraft <a href="#"><i class="top-link fas fa-fw fa-angle-double-up"></i></a></h2> |
| <p> |
| Let’s take a look at what Apache NLPCraft brings different or additional to the table. |
| </p> |
| <p> |
| When it comes to NER components, Apache NLPCraft provides the following: |
| </p> |
| <ul> |
| <li>Built-in NER components for date, geographical locations, numerics, sorting, limiting, and few others with all of them supporting the extraction of the normalized values and extensive metadata.</li> |
| <li>Integration with external NER components from Apache OpenNLP, Stanford NLP, Google Language API and spacy.</li> |
| <li>Support for “composable <span class="amp">&</span> reusable named entities” where users can create new detectable named entities out of existing ones.</li> |
| </ul> |
| <p> |
| While built-in NER components and integration with 3rd party ones is rather a “pedestrian” |
| capabilities (and you can read about them <a href="/integrations.html">here</a>) - the “composable <span class="amp">&</span> reusable named entities” is something that is unique for Apache NLPCraft. |
| Let’s look at it in more detail. |
| </p> |
| </section> |
| <section> |
| <h2 class="section-title">Reusable <span class="amp">&</span> Composable Named Entities <a href="#"><i class="top-link fas fa-fw fa-angle-double-up"></i></a></h2> |
| <p> |
| Apache NLPCraft is the first project that provides direct support for composable named entities - named entities |
| that are defined in terms of other (constituent or part) entities. |
| Let’s illustrate this by an example. |
| </p> |
| <p> |
| Let’s imagine you are building an NLP-based answering application utilizing intent-based matching (Alexa, |
| Google DialogFlow, Apache NLPCraft, etc.). In this application we want to answer questions about geographical |
| locations but <b>only the USA</b>. |
| </p> |
| <p> |
| The one of the ways to accomplish this task is to use any NER providers, for example, <code>nlpcraft:city</code> from |
| Apache NLPCraft, and build your intents using it. Then, when a particular intent is selected and its callback is called you can check the <code>country</code> |
| metadata field of the detected named entity. If it does not equal the <code>USA</code> you need to exit (break) from |
| the intent's callback and continue trying other intents, if any were matched as well. |
| </p> |
| <p> |
| Well, that’s not so easy in real life: |
| </p> |
| <ul> |
| <li> |
| First of all, your intent-based NLP library must support such a back-and-forth between intent’s callback |
| and intent matching logic. And very few indeed do… |
| </li> |
| <li> |
| You are spreading the matching logic between declarative intent definition (YAML file) and a |
| programmable intent’s callback (Java code) which generally leads to a very hard to maintain implementation. |
| </li> |
| </ul> |
| <p> |
| Okay... you can create your own brand new NER component from scratch that would detect only geographical |
| locations in the US. However, this will surely take more than a few minutes. |
| </p> |
| <p> |
| Yet another approach, if supported by your intent-based NLP library, is to enhance the intent definition itself |
| to match only USA geographical locations. At this time, however, I’m not aware of any other NLP libraries |
| supporting this other than Apache NLPCraft. Furthermore, you are complicating your intents that generally should be |
| as simple and maintainable as possible. |
| </p> |
| <p> |
| That’s where <b>composable named entities</b> come to the rescue. Apache NLPCraft allows you to define a new named entity |
| using existing ones - user-defined, built-in or external - named entities (more documentation on this can be found |
| <a href="/data-model.html#dsl">here</a>). Following up on our example application: |
| </p> |
| <pre class="brush: js, highlight: 3, 6"> |
| "elements": [ |
| { |
| "id": "custom:city:usa", |
| "description": "Wrapper for USA cities", |
| "synonyms": [ |
| "^^id == 'nlpcraft:city' && lowercase(~city:country) == 'usa')^^" |
| ] |
| } |
| ] |
| </pre> |
| <p> |
| In this model snippet, we are defining a new named entity <code>custom:city:usa</code> (line 3) that is based on |
| existing <code>nlpcraft:city</code> (line 6) that is also filtered for USA country. Once you have this new named entity |
| defined you can use it to define the intent that will only match cities in the USA. |
| </p> |
| <p> |
| Another example: |
| </p> |
| <pre class="brush: js, highlight: [9, 12]"> |
| "macros": [ |
| { |
| "name": "<AIRPORT>", |
| "macro": "{airport|aerodrome|airdrome|air station}" |
| } |
| ], |
| "elements": [ |
| { |
| "id": "custom:airport:usa", |
| "description": "Wrapper for USA airports", |
| "synonyms": [ |
| "<AIRPORT> {of|for|_} ^^id == 'nlpcraft:city' && lowercase(~city:country) == 'usa')^^" |
| ] |
| } |
| ] |
| </pre> |
| <p> |
| In this example, we defined a new named entity <code>custom:airport:usa</code>. In its definition we not only |
| filter cities for the USA but also added a prefix that would indicate that this is an airport (learn more about |
| NLPCraft IDL syntax <a href="https://nlpcraft.apache.org/intent-matching.html">here</a>). |
| </p> |
| <p> |
| Composable named entities can be nested but not recursive. All the normalized metadata of the constituent |
| (part) entities - of any nesting depths - is accessible to the named entity, e.g. metadata |
| from <code>nlpcraft:city</code> is accessible in <code>custom:airport:usa</code> entity. |
| You can also define a new composed named entity based on your own named entities. This way you are |
| essentially <b>mixing in</b> new entities instead of creating something from scratch every time. |
| </p> |
| <p> |
| In the end, composable entities allow you to: |
| </p> |
| <ul> |
| <li>Simplify intents by concentrating matching logic in reusable <span class="amp">&</span> composable named entities.</li> |
| <li>Create new named entities without any coding or expensive model training.</li> |
| <li>Reuse existing named entities to build new ones.</li> |
| </ul> |
| </section> |
| |
| |
| |