This engine implements entity linking functionality based on Lucene (Finite State Transducer) technology. This allows this engine to perform the label based Entity lookup fully in-memory. Only Entity specific information (URI, labels, types and ranking) for tagged Entities need to be loaded from disc (or retrieved from an in-memory cache. By doing so this engine can outperform query based entity linking engines by a factor of ten or more.
This persentation by ???? ????? provides a good overview of FST and how they are implemented and used in Solr/Lucene.
This Engine does not use the Lucene FST directly, but is based on OpenSextant SolrTextTagger that provides already a naive text tagger functionality. This video of a presentation by David Smileys provides a lot of details on how this all works.
To give users some Idea on how efficient FST can be used to hold information this are the statistics for FST models required for Entity Linking against Freebase:
This means that the memory requirements for in-memory Entity Linking against Freebase are less as for English NLP processing using Stanford NLP or Freeling.
This engine currently depends to an unreleased version 1.2 of the SolrTextTagger module. Because of that users will need to get the source from Github and mvm install it to your local repository.
Currently it is highly recommended to use the SolrTextTagger fork of Rupert Westenthaler as it includes already a Pull requests that adds support for multi valued fields.
After completing this the engine can be normally build and used with Apache Stanbol.
The Solr index is configured by using the enhancer.engines.linking.solrfst.solrcore
configuration property of the Engine. This property needs to point to a Solr index that runs embedded in the same JVM as Apache Stanbol. The Stanbol Commons Solr modules provide two Components that allow to configure embedded Solr Indexes:
Used Solr indexes need also confirm to the requirements of the SolrTextTagger module. That means that fields used for FST linking MUST use field analyzers that produce consecutive positions (i.e. the position increment of each term must always be 1). This means that typical field analyzers as sued for searches will not work.
The SolrTextTagger README provides an example for a Field Analyzer configuration that does work. To make things easier this engine includes this XML file that includes a schema.xml fragment with FST tagging compatible configurations for most languages supported by Solr.
This part of the configuration is used to specify the layout if the used Solr index. It specifies how Entity information are stored in the Solr index.
The Field Name Encoding configuration enhancer.engines.linking.solrfst.fieldEncoding
specifies how Solr fields for multiple languages are encoded. As an example a Vocabulary with labels in multiple languages might use “en_label” for the English language labels and “de_label” for the German language labels. In this case users should set this property to UnderscorePrefix
and simple use “label” when configuring the FST field name.
The Field Name Encodings work well with Solr dynamic field configurations that allow to map language specific FieldType specifications to prefixes and suffixes such as
This is the full list of supported Field encodings:
field
and store
values is done. This means that the FST Configuration MUST define the exact field names in the Solr index for every configured language.The FST Tagging Configuration enhancer.engines.linking.solrfst.fstconfig
defines several things:
indexed="true" stored="true"
.This configuration is line based (multi valued) and uses the following generic syntax:
{language};{param}={value};{param1}={value1}; !{language}
{language}
is either the name of the language (e.g. ‘en’), ‘*’ for all languages or '' (empty string) for defining default parameter values without including all languages. Lines that do start with ‘!’ do explicitly exclude a language. Those lines do not allow parameters.
The following parameters are supported by the Engine:
stored
is assumed to be equals to field
.fst/{fst}.{lang}.fst
. By default the configured field
name is used (with non alpha-numeric chars replaced by ‘_’).If runtime creation is enabled those files will be created if not present.enhancer.engines.linking.solrfst.fstThreadPoolSize
parameter. Because of this the default is false
.A more advanced Configuration might look like:
;field=fise:fstTagging;stored=rdfs:label;generate=true en de es fr it
This would set the index field to “fise:fstTagging”, the stored field to “rdfs:label” and allow runtime generation. It would also enable to process English, German, Spanish, French and Italian texts. A similar configuration that would build FST models for all languages would look as follows
*;field=fise:fstTagging;stored=rdfs:label;generate=true
fise:confidence
value if labels of several Entities do match the text.The enhancer.engines.linking.solrfst.fstThreadPoolSize
parameter can be used to configure the size of the thread pool used for the runtime generation of FST models. The default size of the thread pool is 1
. Threads do use the lowest possible priority to reduce the performance impact on enhancements as much as possible.
When configuring the size of the thread pool users need to be aware that the generation of FST models does need a lot more memory as the resulting model. So having to manny parallel threads might require to increase the memory settings of the JVM. On typical machines FST creation threads will consume 100% CPU. That means that the number of threads should be configured to the number of CPU cores that can be spared for FST generation.
NOTE that the generate
parameter of the FST Tagging Configuration needs to be set to true
to enable runtime generation.
While FST tagging is fully done in-memory the FST linking engine needs to read information of matching Entities from the Solr index. This requires disc IO and is typically the part of the process that consumes the most time. The Entity Cache tries to prevent such disc level IO by caching SolrDocuments containing only fields required for the linking process (labels, types and (if available) entity rankings). To further reduce memory requirements only labels in languages requested by processed ContentItems are stored in the cache. The Cache uses the LRU semantic and is based on the Solr cache implementation.
The size of the cache can be configured by using the enhancer.engines.linking.solrfst.entityCacheSize
parameter. The default size is ~65k entities. Increasing the maximum size of the cache will improve performance. For small and medium sized vocabularies the cache can be configured in a way that all entities are cached in memory.
During the development of this Engine the SolrTextTagger was extended by a feature that allows to only lookup some tokens in the text (see this Pull Request for details). This feature is used to integrate the Stanbol NLP Processing API with the SolrTextTagger. Meaning that NLP processing results (such as POS tags, Chunks and Named Entities) can be used to tell the SOlrTextTagger what tokens to lookup in the Vocabulary.
For now this engine uses the exact same Text Processing configuration as the Entity Linking Engine. Please see the linked section of the EntityLinkingEngine documentation for details.
The Entity Linking Configuration of this Engine is very similar as the one for the EntityLinking engine. The configuration does use the exact same keys, but it does not support all properties and some do have a slightly different meaning. In the following only the differences are described. For the all other things please refer to the linked section of the documentation of the EntityLinking engine.
enhancer.engines.linking.solrfst.typeField
. See the [Additional Entity Information] section for details.fise:confidence
values less as 0.5.In addition the following properties are IGNORED as they are not relevant for the FST Linking Engine:
The generate
The FST Model
Making existing Entityhub SolrYard indexes Compatible with FST linking:
Build process and Testing related:
Feature related
Other
As the first version of the FST Linking Engine is still in active development their are some know issues: