| Apache OpenNLP ${pom.version} | |
| =============================== | |
| Building from the Source Distribution | |
| ------------------------------------- | |
| At least Maven 3.0.0 is required for building. | |
| To build everything go into the opennlp directory and run the following command: | |
| mvn clean install | |
| The results of the build will be placed in: | |
| opennlp-distr/target/apache-opennlp-[version]-bin.tar-gz (or .zip) | |
| What is in Similarity component in Apache OpenNLP ${pom.version} | |
| --------------------------------------- | |
| SIMILARITY COMPONENT of OpenNLP | |
| 1. Introduction | |
| This component does text relevance assessment. It takes two portions of texts (phrases, sentences, paragraphs) and returns a similarity score. | |
| Similarity component can be used on top of search to improve relevance, computing similarity score between a question and all search results (snippets). | |
| Also, this component is useful for web mining of images, videos, forums, blogs, and other media with textual descriptions. Such applications as content generation | |
| and filtering meaningless speech recognition results are included in the sample applications of this component. | |
| Relevance assessment is based on machine learning of syntactic parse trees (constituency trees, http://en.wikipedia.org/wiki/Parse_tree). | |
| The similarity score is calculated as the size of all maximal common sub-trees for sentences from a pair of texts ( | |
| www.aaai.org/ocs/index.php/WS/AAAIW11/paper/download/3971/4187, www.aaai.org/ocs/index.php/FLAIRS/FLAIRS11/paper/download/2573/3018, | |
| www.aaai.org/ocs/index.php/SSS/SSS10/paper/download/1146/1448). | |
| The objective of Similarity component is to give an application engineer as tool for text relevance which can be used as a black box, no need to understand | |
| computational linguistics or machine learning. | |
| 2. Installation | |
| Please refer to OpenNLP installation instructions | |
| 3. First use case of Similarity component: search | |
| To start with this component, please refer to SearchResultsProcessorTest.java in package opennlp.tools.similarity.apps | |
| public void testSearchOrder() runs web search using Bing API and improves search relevance. | |
| Look at the code of | |
| public List<HitBase> runSearch(String query) | |
| and then at | |
| private BingResponse calculateMatchScoreResortHits(BingResponse resp, String searchQuery) | |
| which gets search results from Bing and re-ranks them based on computed similarity score. | |
| The main entry to Similarity component is | |
| SentencePairMatchResult matchRes = sm.assessRelevance(snapshot, searchQuery); | |
| where we pass the search query and the snapshot and obtain the similarity assessment structure which includes the similarity score. | |
| To run this test you need to obtain search API key from Bing at www.bing.com/developers/s/APIBasics.html and specify it in public class BingQueryRunner in | |
| protected static final String APP_ID. | |
| 4. Solving a unique problem: content generation | |
| To demonstrate the usability of Similarity component to tackle a problem which is hard to solve without a linguistic-based technology, | |
| we introduce a content generation component: | |
| RelatedSentenceFinder.java | |
| The entry point here is the function call | |
| hits = f.generateContentAbout("Albert Einstein"); | |
| which writes a biography of Albert Einstein by finding sentences on the web about various kinds of his activities (such as 'born', 'graduate', 'invented' etc.). | |
| The key here is to compute similarity between the seed expression like "Albert Einstein invented relativity theory" and search result like | |
| "Albert Einstein College of Medicine | Medical Education | Biomedical ... | |
| www.einstein.yu.edu/Albert Einstein College of Medicine is one of the nation's premier institutions for medical education, ..." | |
| and filter out irrelevant search results. | |
| This is done in function | |
| public HitBase augmentWithMinedSentencesAndVerifyRelevance(HitBase item, String originalSentence, | |
| List<String> sentsAll) | |
| SentencePairMatchResult matchRes = sm.assessRelevance(pageSentence + " " + title, originalSentence); | |
| You can consult the results in gen.txt, where an essay on Einstein bio is written. | |
| These are examples of generated articles, given the article title | |
| http://www.allvoices.com/contributed-news/9423860/content/81937916-ichie-sings-jazz-blues-contemporary-tunes | |
| http://www.allvoices.com/contributed-news/9415063-britney-spears-femme-fatale-in-north-sf-bay-area | |
| 5. Solving a high-importance problem: filtering out meaningless speech recognition results. | |
| Speech recognitions SDKs usually produce a number of phrases as results, such as | |
| "remember to buy milk tomorrow from trader joes", | |
| "remember to buy milk tomorrow from 3 to jones" | |
| One can see that the former is meaningful, and the latter is meaningless (although similar in terms of how it is pronounced). | |
| We use web mining and Similarity component to detect a meaningful option (a mistake caused by trying to interpret meaningless | |
| request by a query understanding system such as Siri for iPhone can be costly). | |
| SpeechRecognitionResultsProcessor.java does the job: | |
| public List<SentenceMeaningfullnessScore> runSearchAndScoreMeaningfulness(List<String> sents) | |
| re-ranks the phrases in the order of decrease of meaningfulness. | |
| 6. Similarity component internals | |
| in the package opennlp.tools.textsimilarity.chunker2matcher | |
| ParserChunker2MatcherProcessor.java does parsing of two portions of text and matching the resultant parse trees to assess similarity between | |
| these portions of text. | |
| To run ParserChunker2MatcherProcessor | |
| private static String MODEL_DIR = "resources/models"; | |
| needs to be specified | |
| The key function | |
| public SentencePairMatchResult assessRelevance(String para1, String para2) | |
| takes two portions of text and does similarity assessment by finding the set of all maximum common subtrees | |
| of the set of parse trees for each portion of text | |
| It splits paragraphs into sentences, parses them, obtained chunking information and produces grouped phrases (noun, evrn, prepositional etc.): | |
| public synchronized List<List<ParseTreeChunk>> formGroupedPhrasesFromChunksForPara(String para) | |
| and then attempts to find common subtrees: | |
| in ParseTreeMatcherDeterministic.java | |
| List<List<ParseTreeChunk>> res = md.matchTwoSentencesGroupedChunksDeterministic(sent1GrpLst, sent2GrpLst) | |
| Phrase matching functionality is in package opennlp.tools.textsimilarity; | |
| ParseTreeMatcherDeterministic.java: | |
| Here's the key matching function which takes two phrases, aligns them and finds a set of maximum common sub-phrase | |
| public List<ParseTreeChunk> generalizeTwoGroupedPhrasesDeterministic | |
| 7. Package structure | |
| opennlp.tools.similarity.apps : 3 main applications | |
| opennlp.tools.similarity.apps.utils: utilities for above applications | |
| opennlp.tools.textsimilarity.chunker2matcher: parser which converts text into a form for matching parse trees | |
| opennlp.tools.textsimilarity: parse tree matching functionality | |
| Requirements | |
| ------------ | |
| Java 1.5 is required to run OpenNLP | |
| Maven 3.0.0 is required for building it | |
| Known OSGi Issues | |
| ------------ | |
| In an OSGi environment the following things are not supported: | |
| - The coreference resolution component | |
| - The ability to load a user provided feature generator class | |
| Note | |
| ---- | |
| The current API contains still many deprecated methods, these | |
| will be removed in one of our next releases, please | |
| migrate to our new API. |