OpenNLP OPENNLP-497 create maven script, release notes

commit: ec9aa616c0c2ec779a032b9befe832ca5ce561a7 [log] [tgz]
author: Boris Galitsky <bgalitsky@apache.org> Mon Apr 16 10:57:16 2012 +0000
committer: Boris Galitsky <bgalitsky@apache.org> Mon Apr 16 10:57:16 2012 +0000
tree: b1a4edcb0382ec7d4f352615890edb6f04b581f2
parent: 20d048b22fbb3f7ad4dbc84e653afb0a60357842 [diff]
diff --git a/opennlp-similarity/README b/opennlp-similarity/README
new file mode 100644
index 0000000..b535487
--- /dev/null
+++ b/opennlp-similarity/README

@@ -0,0 +1,138 @@
+Apache OpenNLP ${pom.version}

+===============================

+

+

+Building from the Source Distribution

+-------------------------------------

+

+At least Maven 3.0.0 is required for building.

+

+To build everything go into the opennlp directory and run the following command:

+    mvn clean install

+   

+The results of the build will be placed  in:

+    opennlp-distr/target/apache-opennlp-[version]-bin.tar-gz (or .zip)

+

+What is in Similarity component in Apache OpenNLP ${pom.version}

+---------------------------------------

+SIMILARITY COMPONENT of OpenNLP

+

+1. Introduction

+This component does text relevance assessment. It takes two portions of texts (phrases, sentences, paragraphs) and returns a similarity score.

+Similarity component can be used on top of search to improve relevance, computing similarity score between a question and all search results (snippets). 

+Also, this component is useful for web mining of images, videos, forums, blogs, and other media with textual descriptions. Such applications as content generation 

+and filtering meaningless speech recognition results are included in the sample applications of this component.

+   Relevance assessment is based on machine learning of syntactic parse trees (constituency trees, http://en.wikipedia.org/wiki/Parse_tree). 

+The similarity score is calculated as the size of all maximal common sub-trees for sentences from a pair of texts (

+www.aaai.org/ocs/index.php/WS/AAAIW11/paper/download/3971/4187, www.aaai.org/ocs/index.php/FLAIRS/FLAIRS11/paper/download/2573/3018,

+www.aaai.org/ocs/index.php/SSS/SSS10/paper/download/1146/1448).

+   The objective of Similarity component is to give an application engineer as tool for text relevance which can be used as a black box, no need to understand 

+ computational linguistics or machine learning. 

+ 

+ 2. Installation

+ Please refer to OpenNLP installation instructions

+ 

+ 3. First use case of Similarity component: search

+ 

+ To start with this component, please refer to SearchResultsProcessorTest.java in package opennlp.tools.similarity.apps

+   public void testSearchOrder() runs web search using Bing API and improves search relevance.

+   Look at the code of 

+      public List<HitBase> runSearch(String query) 

+   and then at 

+      private	BingResponse calculateMatchScoreResortHits(BingResponse resp, String searchQuery)

+   which gets search results from Bing and re-ranks them based on computed similarity score.

+ 

+   The main entry to Similarity component is 

+    SentencePairMatchResult matchRes = sm.assessRelevance(snapshot, searchQuery);

+    where we pass the search query and the snapshot and obtain the similarity assessment structure which includes the similarity score.

+   

+   To run this test you need to obtain search API key from Bing at www.bing.com/developers/s/APIBasics.html and specify it in public class BingQueryRunner in

+  protected static final String APP_ID. 

+  

+  4. Solving a unique problem: content generation

+  To demonstrate the usability of Similarity component to tackle a problem which is hard to solve without a linguistic-based technology, 

+  we introduce a content generation component:

+   RelatedSentenceFinder.java

+   

+   The entry point here is the function call

+   hits = f.generateContentAbout("Albert Einstein");

+   which writes a biography of Albert Einstein by finding sentences on the web about various kinds of his activities (such as 'born', 'graduate', 'invented' etc.).

+   The key here is to compute similarity between the seed expression like "Albert Einstein invented relativity theory" and search result like 

+   "Albert Einstein College of Medicine | Medical Education | Biomedical ...

+    www.einstein.yu.edu/Albert Einstein College of Medicine is one of the nation's premier institutions for medical education, ..."

+    and filter out irrelevant search results.

+   

+   This is done in function 

+   public HitBase augmentWithMinedSentencesAndVerifyRelevance(HitBase item, String originalSentence,

+			List<String> sentsAll)

+			

+   	  SentencePairMatchResult matchRes = sm.assessRelevance(pageSentence + " " + title, originalSentence);

+   You can consult the results in gen.txt, where an essay on Einstein bio is written.

+   

+   These are examples of generated articles, given the article title

+     http://www.allvoices.com/contributed-news/9423860/content/81937916-ichie-sings-jazz-blues-contemporary-tunes

+     http://www.allvoices.com/contributed-news/9415063-britney-spears-femme-fatale-in-north-sf-bay-area

+     

+  5. Solving a high-importance problem: filtering out meaningless speech recognition results.

+  Speech recognitions SDKs usually produce a number of phrases as results, such as 

+  			 "remember to buy milk tomorrow from trader joes",

+			 "remember to buy milk tomorrow from 3 to jones"

+  One can see that the former is meaningful, and the latter is meaningless (although similar in terms of how it is pronounced).

+  We use web mining and Similarity component to detect a meaningful option (a mistake caused by trying to interpret meaningless 

+  request by a query understanding system such as Siri for iPhone can be costly).

+ 

+  SpeechRecognitionResultsProcessor.java does the job:

+  public List<SentenceMeaningfullnessScore> runSearchAndScoreMeaningfulness(List<String> sents)

+  re-ranks the phrases in the order of decrease of meaningfulness.

+  

+  6. Similarity component internals

+  in the package   opennlp.tools.textsimilarity.chunker2matcher

+  ParserChunker2MatcherProcessor.java does parsing of two portions of text and matching the resultant parse trees to assess similarity between 

+  these portions of text.

+  To run ParserChunker2MatcherProcessor

+     private static String MODEL_DIR = "resources/models";

+  needs to be specified

+  

+  The key function

+  public SentencePairMatchResult assessRelevance(String para1, String para2)

+  takes two portions of text and does similarity assessment by finding the set of all maximum common subtrees 

+  of the set of parse trees for each portion of text

+  

+  It splits paragraphs into sentences, parses them, obtained chunking information and produces grouped phrases (noun, evrn, prepositional etc.):

+  public synchronized List<List<ParseTreeChunk>> formGroupedPhrasesFromChunksForPara(String para)

+  

+  and then attempts to find common subtrees:

+  in ParseTreeMatcherDeterministic.java

+		List<List<ParseTreeChunk>> res = md.matchTwoSentencesGroupedChunksDeterministic(sent1GrpLst, sent2GrpLst)

+  

+  Phrase matching functionality is in package opennlp.tools.textsimilarity;

+  ParseTreeMatcherDeterministic.java:

+  Here's the key matching function which takes two phrases, aligns them and finds a set of maximum common sub-phrase

+  public List<ParseTreeChunk> generalizeTwoGroupedPhrasesDeterministic

+  

+  7. Package structure

+  	opennlp.tools.similarity.apps : 3 main applications

+	opennlp.tools.similarity.apps.utils: utilities for above applications

+	

+	opennlp.tools.textsimilarity.chunker2matcher: parser which converts text into a form for matching parse trees

+	opennlp.tools.textsimilarity: parse tree matching functionality

+	

+

+

+

+Requirements

+------------

+Java 1.5 is required to run OpenNLP

+Maven 3.0.0 is required for building it

+

+Known OSGi Issues

+------------

+In an OSGi environment the following things are not supported:

+- The coreference resolution component

+- The ability to load a user provided feature generator class

+

+Note

+----

+The current API contains still many deprecated methods, these

+will be removed in one of our next releases, please

+migrate to our new API.


diff --git a/opennlp-similarity/RELEASE_NOTES.html b/opennlp-similarity/RELEASE_NOTES.html
new file mode 100644
index 0000000..7706367
--- /dev/null
+++ b/opennlp-similarity/RELEASE_NOTES.html

@@ -0,0 +1,77 @@
+<!--

+    ***************************************************************

+    * Licensed to the Apache Software Foundation (ASF) under one

+    * or more contributor license agreements.  See the NOTICE file

+    * distributed with this work for additional information

+    * regarding copyright ownership.  The ASF licenses this file

+    * to you under the Apache License, Version 2.0 (the

+    * "License"); you may not use this file except in compliance

+    * with the License.  You may obtain a copy of the License at

+    *

+    *   http://www.apache.org/licenses/LICENSE-2.0

+    * 

+    * Unless required by applicable law or agreed to in writing,

+    * software distributed under the License is distributed on an

+    * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY

+    * KIND, either express or implied.  See the License for the

+    * specific language governing permissions and limitations

+    * under the License.

+    ***************************************************************

+--> 

+

+<html> 

+<head> 

+  <title>Apache OpenNLP ${pom.version} Release Notes</title> 

+</head> 

+<body> 

+<h1>Apache OpenNLP ${pom.version} Release Notes</h1> 

+ 

+<h2>Contents</h2> 

+<p> 

+<a href="#what.is.opennlp">What is Similarity component of Apache OpenNLP?</a><br/> 

+<a href="#major.changes">This Release</a><br/> 

+<a href="#get.involved">How to Get Involved</a><br/> 

+<a href="#report.issues">How to Report Issues</a><br/> 

+<a href="#list.issues">List of JIRA Issues Fixed in this Release</a><br/> 

+</p>  

+   

+<h2><a name="what.is.opennlp">1. What is Apache OpenNLP?</a></h2> 

+<p>

+This component does text relevance assessment. It takes two portions of texts (phrases, sentences, paragraphs) and returns a similarity score.

+Similarity component can be used on top of search to improve relevance, computing similarity score between a question and all search results (snippets). 

+Also, this component is useful for web mining of images, videos, forums, blogs, and other media with textual descriptions. Such applications as content generation 

+and filtering meaningless speech recognition results are included in the sample applications of this component.

+   Relevance assessment is based on machine learning of syntactic parse trees (constituency trees, http://en.wikipedia.org/wiki/Parse_tree). 

+The similarity score is calculated as the size of all maximal common sub-trees for sentences from a pair of texts (

+www.aaai.org/ocs/index.php/WS/AAAIW11/paper/download/3971/4187, www.aaai.org/ocs/index.php/FLAIRS/FLAIRS11/paper/download/2573/3018,

+www.aaai.org/ocs/index.php/SSS/SSS10/paper/download/1146/1448).

+   The objective of Similarity component is to give an application engineer as tool for text relevance which can be used as a black box, no need to understand 

+ computational linguistics or machine learning. 

+</p>

+

+<h2><a name="major.changes">This Release</a></h2> 

+<p> 

+Please see the <a href="README">README</a> for this information.

+</p> 

+  

+<h2><a name="get.involved">How to Get Involved</a></h2> 

+<p> 

+The Apache OpenNLP project really needs and appreciates any contributions, 

+including documentation help, source code and feedback.  If you are interested

+in contributing, please visit <a href="http://opennlp.apache.org/">http://opennlp.apache.org/</a>

+</p>

+  

+<h2><a name="report.issues">How to Report Issues</a></h2> 

+<p> 

+The Apache OpenNLP project uses JIRA for issue tracking.  Please report any 

+issues you find at 

+<a href="http://issues.apache.org/jira/browse/opennlp">http://issues.apache.org/jira/browse/opennlp</a> 

+</p> 

+  

+<h2><a name="list.issues">List of JIRA Issues Fixed in this Release</a></h2>

+<p>

+Click <a href="issuesFixed/jira-report.html">issuesFixed/jira-report.hmtl</a> for the list of 

+issues fixed in this release.

+</p>

+</body> 

+</html>
\ No newline at end of file

diff --git a/opennlp-similarity/pom.xml b/opennlp-similarity/pom.xml
index 4d0bb87..d2fdf4d 100644
--- a/opennlp-similarity/pom.xml
+++ b/opennlp-similarity/pom.xml

@@ -71,15 +71,15 @@
 			<version>0.7</version>
 		</dependency>
 		<dependency>
-			<groupId>xstream.codehaus.org</groupId>
-			<artifactId>xstream</artifactId>
-			<version>1.4.2</version>
-		</dependency>
-		<dependency>
 			<groupId>net.sf.opencsv</groupId>
 			<artifactId>opencsv</artifactId>
 			<version>2.0</version>
 		</dependency>
+		<dependency>
+			<groupId>org.apache.lucene</groupId>
+			<artifactId>lucene-core</artifactId>
+			<version>3.5.0</version>
+		</dependency>
 	</dependencies>
 	
 	<build>

diff --git a/opennlp-similarity/src/main/java/opennlp/tools/similarity/apps/taxo_builder/TaxoQuerySnapshotMatcher.java b/opennlp-similarity/src/main/java/opennlp/tools/similarity/apps/taxo_builder/TaxoQuerySnapshotMatcher.java
index 3c6fc59..3b47619 100644
--- a/opennlp-similarity/src/main/java/opennlp/tools/similarity/apps/taxo_builder/TaxoQuerySnapshotMatcher.java
+++ b/opennlp-similarity/src/main/java/opennlp/tools/similarity/apps/taxo_builder/TaxoQuerySnapshotMatcher.java

@@ -28,7 +28,7 @@
 import opennlp.tools.similarity.apps.utils.FileHandler;

 import opennlp.tools.textsimilarity.chunker2matcher.ParserChunker2MatcherProcessor;

 

-import com.thoughtworks.xstream.XStream;

+//import com.thoughtworks.xstream.XStream;

 

 /**

  * This class can be used to generate scores based on the overlapping between a

@@ -106,7 +106,7 @@
    * 

    * @param taxonomyPath

    * @param taxonomyXML_Path

-   * */

+   * 

 

   public void convertDatToXML(String taxonomyXML_Path, TaxonomySerializer taxo) {

     XStream xStream = new XStream();

@@ -128,7 +128,7 @@
     matcher.taxo = (TaxonomySerializer) xStream.fromXML(fileHandler

         .readFromTextFile("src/test/resources/taxo_English.xml"));

   }

-

+*/

   public void close() {

     sm.close();

   }


diff --git a/opennlp-similarity/src/main/java/opennlp/tools/similarity/apps/utils/FileHandler.java b/opennlp-similarity/src/main/java/opennlp/tools/similarity/apps/utils/FileHandler.java
index adb5321..d15cc17 100644
--- a/opennlp-similarity/src/main/java/opennlp/tools/similarity/apps/utils/FileHandler.java
+++ b/opennlp-similarity/src/main/java/opennlp/tools/similarity/apps/utils/FileHandler.java

@@ -34,7 +34,7 @@
 import java.util.ArrayList;

 import java.util.Iterator;

 import java.util.List;

-import org.apache.log4j.Logger;

+import java.util.logging.Logger;

 

 /**

  * This class responsible to save data to files as well as read out! It is

@@ -52,7 +52,7 @@
       out.write(data + "\n");

       out.close();

     } catch (IOException e) {

-      LOG.error(e);

+      LOG.severe(e.toString());

       e.printStackTrace();

     }

   }

@@ -99,7 +99,7 @@
       outputStream = new ObjectOutputStream(new FileOutputStream(filepath));

       outputStream.writeObject(obj);

     } catch (IOException e) {

-      LOG.error(e);

+      LOG.severe(e.toString());

     }

   }

 

@@ -114,13 +114,13 @@
       }

     } catch (EOFException ex) { // This exception will be caught when EOF is

                                 // reached

-      LOG.error("End of file reached.", ex);

+      LOG.severe("End of file reached.\n" + ex.toString());

     } catch (ClassNotFoundException ex) {

-      LOG.error(ex);

+      LOG.severe(ex.toString());

     } catch (FileNotFoundException ex) {

-      LOG.error(ex);

+      LOG.severe(ex.toString());

     } catch (IOException ex) {

-      LOG.error(ex);

+      LOG.severe(ex.toString());

     } finally {

       // Close the ObjectInputStream

       try {

@@ -128,7 +128,7 @@
           inputStream.close();

         }

       } catch (IOException ex) {

-        LOG.error(ex);

+        LOG.severe(ex.toString());

       }

     }

     return null;

@@ -187,7 +187,7 @@
         input.close();

       }

     } catch (IOException ex) {

-      LOG.error("fileName: " + filePath, ex);

+      LOG.severe("fileName: " + filePath +"\n " + ex);

     }

     return contents.toString();

   }

@@ -224,7 +224,7 @@
         input.close();

       }

     } catch (IOException ex) {

-      LOG.error(ex);

+      LOG.severe(ex.toString());

     }

     return lines;

   }

@@ -244,7 +244,7 @@
       try {

         file.mkdirs();

       } catch (Exception e) {

-        LOG.error("Directory already exists or the file-system is read only", e);

+        LOG.severe("Directory already exists or the file-system is read only");

       }

     }

   }

@@ -300,7 +300,7 @@
         try {

           deleteFile(dirName + fileNameList.get(i));

         } catch (IllegalArgumentException e) {

-          LOG.error("No way to delete file: " + dirName + fileNameList.get(i),

+          LOG.severe("No way to delete file: " + dirName + fileNameList.get(i) + "\n"+

               e);

         }

       }


diff --git a/opennlp-similarity/src/test/java/opennlp/tools/similarity/apps/taxo_builder/TaxonomyBuildMatchTest.java b/opennlp-similarity/src/test/java/opennlp/tools/similarity/apps/taxo_builder/TaxonomyBuildMatchTest.java
index cb55caa..fd58841 100644
--- a/opennlp-similarity/src/test/java/opennlp/tools/similarity/apps/taxo_builder/TaxonomyBuildMatchTest.java
+++ b/opennlp-similarity/src/test/java/opennlp/tools/similarity/apps/taxo_builder/TaxonomyBuildMatchTest.java

@@ -28,7 +28,7 @@
     System.out.println(ad.lemma_AssocWords);

     assertTrue(ad.lemma_AssocWords.size() > 0);

   }

-

+/*

   public void testTaxonomyBuild() {

     TaxonomyExtenderViaMebMining self = new TaxonomyExtenderViaMebMining();

     self.extendTaxonomy("src/test/resources/taxonomies/irs_dom.ari", "tax",

@@ -36,7 +36,7 @@
     self.close();

     assertTrue(self.getAssocWords_ExtendedAssocWords().size() > 0);

   }

-

+*/

   public void testTaxonomyMatch() {

     TaxoQuerySnapshotMatcher matcher = new TaxoQuerySnapshotMatcher(

         "src/test/resources/taxonomies/irs_domTaxo.dat");
commit	ec9aa616c0c2ec779a032b9befe832ca5ce561a7	[log] [tgz]
author	Boris Galitsky <bgalitsky@apache.org>	Mon Apr 16 10:57:16 2012 +0000
committer	Boris Galitsky <bgalitsky@apache.org>	Mon Apr 16 10:57:16 2012 +0000
tree	b1a4edcb0382ec7d4f352615890edb6f04b581f2
parent	20d048b22fbb3f7ad4dbc84e653afb0a60357842 [diff]