| Apache OpenNLP ${pom.version} |
| =============================== |
| |
| |
| Building from the Source Distribution |
| ------------------------------------- |
| |
| At least Maven 3.0.0 is required for building. |
| |
| To build everything go into the opennlp directory and run the following command: |
| mvn clean install |
| |
| The results of the build will be placed in: |
| opennlp-distr/target/apache-opennlp-[version]-bin.tar-gz (or .zip) |
| |
| What is in Similarity component in Apache OpenNLP ${pom.version} |
| --------------------------------------- |
| SIMILARITY COMPONENT of OpenNLP |
| |
| 1. Introduction |
| This component does text relevance assessment. It takes two portions of texts (phrases, sentences, paragraphs) and returns a similarity score. |
| Similarity component can be used on top of search to improve relevance, computing similarity score between a question and all search results (snippets). |
| Also, this component is useful for web mining of images, videos, forums, blogs, and other media with textual descriptions. Such applications as content generation |
| and filtering meaningless speech recognition results are included in the sample applications of this component. |
| Relevance assessment is based on machine learning of syntactic parse trees (constituency trees, http://en.wikipedia.org/wiki/Parse_tree). |
| The similarity score is calculated as the size of all maximal common subtrees for sentences from a pair of texts ( |
| www.aaai.org/ocs/index.php/WS/AAAIW11/paper/download/3971/4187, www.aaai.org/ocs/index.php/FLAIRS/FLAIRS11/paper/download/2573/3018, |
| www.aaai.org/ocs/index.php/SSS/SSS10/paper/download/1146/1448). |
| The objective of Similarity component is to give an application engineer as tool for text relevance which can be used as a black box, no need to understand |
| computational linguistics or machine learning. |
| |
| 2. Installation |
| Please refer to OpenNLP installation instructions |
| |
| 3. First use case of Similarity component: search |
| |
| To start with this component, please refer to SearchResultsProcessorTest.java in package opennlp.tools.similarity.apps |
| public void testSearchOrder() runs web search using Bing API and improves search relevance. |
| Look at the code of |
| public List<HitBase> runSearch(String query) |
| and then at |
| private BingResponse calculateMatchScoreResortHits(BingResponse resp, String searchQuery) |
| which gets search results from Bing and re-ranks them based on computed similarity score. |
| |
| The main entry to Similarity component is |
| SentencePairMatchResult matchRes = sm.assessRelevance(snapshot, searchQuery); |
| where we pass the search query and the snapshot and obtain the similarity assessment structure which includes the similarity score. |
| |
| To run this test you need to obtain search API key from Bing at www.bing.com/developers/s/APIBasics.html and specify it in public class BingQueryRunner in |
| protected static final String APP_ID. |
| |
| 4. Solving a unique problem: content generation |
| To demonstrate the usability of Similarity component to tackle a problem which is hard to solve without a linguistic-based technology, |
| we introduce a content generation component: |
| RelatedSentenceFinder.java |
| |
| The entry point here is the function call |
| hits = f.generateContentAbout("Albert Einstein"); |
| which writes a biography of Albert Einstein by finding sentences on the web about various kinds of his activities (such as 'born', 'graduate', 'invented' etc.). |
| The key here is to compute similarity between the seed expression like "Albert Einstein invented relativity theory" and search result like |
| "Albert Einstein College of Medicine | Medical Education | Biomedical ... |
| www.einstein.yu.edu/Albert Einstein College of Medicine is one of the nation's premier institutions for medical education, ..." |
| and filter out irrelevant search results. |
| |
| This is done in function |
| public HitBase augmentWithMinedSentencesAndVerifyRelevance(HitBase item, String originalSentence, |
| List<String> sentsAll) |
| |
| SentencePairMatchResult matchRes = sm.assessRelevance(pageSentence + " " + title, originalSentence); |
| You can consult the results in gen.txt, where an essay on Einstein bio is written. |
| |
| These are examples of generated articles, given the article title |
| http://www.allvoices.com/contributed-news/9423860/content/81937916-ichie-sings-jazz-blues-contemporary-tunes |
| http://www.allvoices.com/contributed-news/9415063-britney-spears-femme-fatale-in-north-sf-bay-area |
| |
| 5. Solving a high-importance problem: filtering out meaningless speech recognition results. |
| Speech recognitions SDKs usually produce a number of phrases as results, such as |
| "remember to buy milk tomorrow from trader joes", |
| "remember to buy milk tomorrow from 3 to jones" |
| One can see that the former is meaningful, and the latter is meaningless (although similar in terms of how it is pronounced). |
| We use web mining and Similarity component to detect a meaningful option (a mistake caused by trying to interpret meaningless |
| request by a query understanding system such as Siri for iPhone can be costly). |
| |
| SpeechRecognitionResultsProcessor.java does the job: |
| public List<SentenceMeaningfullnessScore> runSearchAndScoreMeaningfulness(List<String> sents) |
| re-ranks the phrases in the order of decrease of meaningfulness. |
| |
| 6. Similarity component internals |
| in the package opennlp.tools.textsimilarity.chunker2matcher |
| ParserChunker2MatcherProcessor.java does parsing of two portions of text and matching the resultant parse trees to assess similarity between |
| these portions of text. |
| To run ParserChunker2MatcherProcessor |
| private static String MODEL_DIR = "resources/models"; |
| needs to be specified |
| |
| The key function |
| public SentencePairMatchResult assessRelevance(String para1, String para2) |
| takes two portions of text and does similarity assessment by finding the set of all maximum common subtrees |
| of the set of parse trees for each portion of text |
| |
| It splits paragraphs into sentences, parses them, obtained chunking information and produces grouped phrases (noun, evrn, prepositional etc.): |
| public synchronized List<List<ParseTreeChunk>> formGroupedPhrasesFromChunksForPara(String para) |
| |
| and then attempts to find common subtrees: |
| in ParseTreeMatcherDeterministic.java |
| List<List<ParseTreeChunk>> res = md.matchTwoSentencesGroupedChunksDeterministic(sent1GrpLst, sent2GrpLst) |
| |
| Phrase matching functionality is in package opennlp.tools.textsimilarity; |
| ParseTreeMatcherDeterministic.java: |
| Here's the key matching function which takes two phrases, aligns them and finds a set of maximum common sub-phrase |
| public List<ParseTreeChunk> generalizeTwoGroupedPhrasesDeterministic |
| |
| 7. Package structure |
| opennlp.tools.similarity.apps : 3 main applications |
| opennlp.tools.similarity.apps.utils: utilities for above applications |
| |
| opennlp.tools.textsimilarity.chunker2matcher: parser which converts text into a form for matching parse trees |
| opennlp.tools.textsimilarity: parse tree matching functionality |
| |
| |
| |
| |
| Requirements |
| ------------ |
| Java 1.5 is required to run OpenNLP |
| Maven 3.0.0 is required for building it |
| |
| Known OSGi Issues |
| ------------ |
| In an OSGi environment the following things are not supported: |
| - The coreference resolution component |
| - The ability to load a user provided feature generator class |
| |
| Note |
| ---- |
| The current API contains still many deprecated methods, these |
| will be removed in one of our next releases, please |
| migrate to our new API. |