tree: 779f55530f0c900d24cffc43720ab2665e9a19e2 [path history] [tgz]
  1. lib/
  2. src/
  5. pom.xml
  7. README.txt

OpenNLP.Similarity Component

It is a project under Apache OpenNLP which subjects results of parsing, part-of-speech tagging and rhetoric parsing to machine learning. It is leveraged in search, content generation & enrichment, chat bots and other text processing domains where relevance assessment task is a key.

What is OpenNLP.Similarity?

OpenNLP.Similarity is an NLP engine which solves a number of text processing and search tasks based on OpenNLP and Stanford NLP parsers. It is designed to be used by a non-linguist software engineer to build linguistically-enabled:

OpenNLP similarity provides a series of techniques to support the overall content pipeline, from text collection to cleaning, classification, personalization and distribution. Technology and implementation of content pipeline developed at eBay is described here.


  1. Do git clone to setup the environment including resources. Besides what you get from git, /resources directory requires some additional work:

  2. Download the main jar.

  3. Set all necessary jars in /lib folder. Larger size jars are not on git so please download them from Stanford NLP site

  1. Set up src/test/resources directory
  • needs to be unzipped
  • OpenNLP models need to be downloaded into the directory ‘models’ from here

As a result the following folders should be in in /resources: As obtained from git:

  1. Try running tests which will give you a hint on how to integrate OpenNLP.Similarity functionality into your application. You can start with Matcher test and observe how long paragraphs can be linguistically matched (you can compare this with just an intersection of keywords)

  2. Look at example POMs for how to better integrate into your existing project

Creating a simple project

Create a project from

Engines and Systems of OpenNLP.Similarity

Main relevance assessment function

It takes two texts and returns the cardinality of a maximum common subgraph representations of these texts. This measure is supposed to be much more accurate than keyword statistics, compositional semantic models word2vec because linguistic structure is taken into account, not just co-occurrences of keywords. Matching class in [matching package] ( has

List<List<ParseTreeChunk>> assessRelevance(String para1, String para2)

function which returns the list of [common phrases between these paragraph]s.

To avoid re-parsing the same strings and improve the speed, use

List<List<ParseTreeChunk>> assessRelevanceCache(String para1, String para2)

It operates on the level of sentences (giving maximal common subtree) and paragraphs (giving maximal common sub-parse thicket). Maximal common sub-parse thicket is also represented as a list of common phrases.

Search engine

The following set of functionalities is available to enable search with linguistic features. It is desirable when query is long (more than 4 keywords), logically complex, ambiguous or

SOLR request handlers are available here

Taxonomy builder is here. Examples of pre-built taxonomy are available in this directory. Please pay attention at taxonomies built for languages other than English. A music taxonomy is an example of the seed data for taxonomy building, and this taxonomy hashmap dump is a good example of what can be automatically constructed. A paper on taxonomy learning is here.

Search results re-ranker

Re-ranking scores similarity between a given orderedListOfAnswers and question

List<Pair<String,Double>> pairList = new ArrayList<Pair<String,Double>>();

for (String ans: orderedListOfAnswers) {

        `List<List<ParseTreeChunk>> similarityResult = m.assessRelevanceCache(question, ans);`
        `double score = parseTreeChunkListScorer.getParseTreeChunkListScoreAggregPhraseType(similarityResult);`
        `Pair<String,Double> p = new Pair<String, Double>(ans, score);`

Collections.sort(pairList, Comparator.comparing(p -> p.getSecond()));

Then pairList is then ranked according to the linguistic relevance score. This score can be combined with other sources such as popularity, geo-proximity and others.

Content generator

It takes a topic, builds a taxonomy for it and forms a table of content. It then mines the web for documents for each table of content item, finds relevant sentences and paragraphs and merges them into a document package. The resultant document has a TOC, sections, figures & captions and also a reference section. We attempt to reproduce how humans cut-and-paste content from the web while writing on a topic. Content generation has a demo and to run it from IDE start here. Examples of written documents are here. Another content generation option is about opinion data. Reviews are mined for, cross-bred and made “original” for search engines. This and general content generation is done for SEO purposes. Review builder composes fake reviews which are in turn should be recognized by a Fake Review detector

Text classifier / feature detector in text

The classifier code is the same but the model files vary for the applications below:

Document classification to six major classes {finance, business, legal, computing, engineering, health} is available via nearest neighbor model. A Lucene training model (1G file) is obtained from Wikipedia corpus. This classifier can be trained for an arbitrary classes once respective Wiki pages are selected and respective Lucene index is built. Once proper training documents are selected from Wikipedia with adequate coverage, the accuracy is usually higher than can be achieved by word2vec classification models.

General-purpose deterministic inductive learner implements JS Mills method of induction and abduction (deduction is also partially implemented).

Inductive learning implemented as a base for syntactic tree-based learning is similar to the family of approaches such as Explanation-based Learning and Inductive Logic Programming.

Tree-kernel learning

is integrated to allow application of SVM learning to sentence-level and paragraph-level linguistic data including discourse. Unlike learning in numerical space, each dimension in tree kernel learning is an occurrence of a particular subtree. Similarity is not a numerical distance but a count of common subtrees. A set of parse trees for individual sentences to represent a paragraph is called parse thicket. Its representation as a graph is coded in a tree representation via parenthesis such as [model*.txt] ( To do model building and predictions, C modules are run in this directory, so proper choice need to be made: {svm_classify.linux, svm_classify.max, svm_classify.exe, svm_learn.*}. Also, proper run permissions needs to be set for these files.

Concept learning

is a branch of deterministic learning which is applied to attribute-value pairs and possesses useful explainability feature, unlike statistical and deep learning. It is fairly useful for data exploration and visualization since all interesting relations can be visualized. Concept learning covers inductive and abductive learning and also some cases of deduction. Explore this package for the concept learning-related features.

Filtering results for Speech Recognition based on semantic meaningfulness

It takes results from a speech-to-text system and subjects them to [filtering] ( Those recognized candidate words which do not make sense together are filtered out, based on the frequency of co-occurrences found on the web.

Related Research

Here's the link to the book on question-answering

and research papers.

Also the recent book related to reasoning and linguistics in humans & machines

Configuring OpenNLP.Similarity component

VerbNet model is included by default, so that the hand-coded meanings of the verb are used when simularity between verb phrases are computed.

To include word2vector model, download it and make sure the following path is valid: resourceDir + "/w2v/GoogleNews-vectors-negative300.bin.gz"