OpenNLP OPENNLP-497
create maven script, release notes
diff --git a/opennlp-similarity/README b/opennlp-similarity/README
new file mode 100644
index 0000000..b535487
--- /dev/null
+++ b/opennlp-similarity/README
@@ -0,0 +1,138 @@
+Apache OpenNLP ${pom.version}
+===============================
+
+
+Building from the Source Distribution
+-------------------------------------
+
+At least Maven 3.0.0 is required for building.
+
+To build everything go into the opennlp directory and run the following command:
+ mvn clean install
+
+The results of the build will be placed in:
+ opennlp-distr/target/apache-opennlp-[version]-bin.tar-gz (or .zip)
+
+What is in Similarity component in Apache OpenNLP ${pom.version}
+---------------------------------------
+SIMILARITY COMPONENT of OpenNLP
+
+1. Introduction
+This component does text relevance assessment. It takes two portions of texts (phrases, sentences, paragraphs) and returns a similarity score.
+Similarity component can be used on top of search to improve relevance, computing similarity score between a question and all search results (snippets).
+Also, this component is useful for web mining of images, videos, forums, blogs, and other media with textual descriptions. Such applications as content generation
+and filtering meaningless speech recognition results are included in the sample applications of this component.
+ Relevance assessment is based on machine learning of syntactic parse trees (constituency trees, http://en.wikipedia.org/wiki/Parse_tree).
+The similarity score is calculated as the size of all maximal common sub-trees for sentences from a pair of texts (
+www.aaai.org/ocs/index.php/WS/AAAIW11/paper/download/3971/4187, www.aaai.org/ocs/index.php/FLAIRS/FLAIRS11/paper/download/2573/3018,
+www.aaai.org/ocs/index.php/SSS/SSS10/paper/download/1146/1448).
+ The objective of Similarity component is to give an application engineer as tool for text relevance which can be used as a black box, no need to understand
+ computational linguistics or machine learning.
+
+ 2. Installation
+ Please refer to OpenNLP installation instructions
+
+ 3. First use case of Similarity component: search
+
+ To start with this component, please refer to SearchResultsProcessorTest.java in package opennlp.tools.similarity.apps
+ public void testSearchOrder() runs web search using Bing API and improves search relevance.
+ Look at the code of
+ public List<HitBase> runSearch(String query)
+ and then at
+ private BingResponse calculateMatchScoreResortHits(BingResponse resp, String searchQuery)
+ which gets search results from Bing and re-ranks them based on computed similarity score.
+
+ The main entry to Similarity component is
+ SentencePairMatchResult matchRes = sm.assessRelevance(snapshot, searchQuery);
+ where we pass the search query and the snapshot and obtain the similarity assessment structure which includes the similarity score.
+
+ To run this test you need to obtain search API key from Bing at www.bing.com/developers/s/APIBasics.html and specify it in public class BingQueryRunner in
+ protected static final String APP_ID.
+
+ 4. Solving a unique problem: content generation
+ To demonstrate the usability of Similarity component to tackle a problem which is hard to solve without a linguistic-based technology,
+ we introduce a content generation component:
+ RelatedSentenceFinder.java
+
+ The entry point here is the function call
+ hits = f.generateContentAbout("Albert Einstein");
+ which writes a biography of Albert Einstein by finding sentences on the web about various kinds of his activities (such as 'born', 'graduate', 'invented' etc.).
+ The key here is to compute similarity between the seed expression like "Albert Einstein invented relativity theory" and search result like
+ "Albert Einstein College of Medicine | Medical Education | Biomedical ...
+ www.einstein.yu.edu/Albert Einstein College of Medicine is one of the nation's premier institutions for medical education, ..."
+ and filter out irrelevant search results.
+
+ This is done in function
+ public HitBase augmentWithMinedSentencesAndVerifyRelevance(HitBase item, String originalSentence,
+ List<String> sentsAll)
+
+ SentencePairMatchResult matchRes = sm.assessRelevance(pageSentence + " " + title, originalSentence);
+ You can consult the results in gen.txt, where an essay on Einstein bio is written.
+
+ These are examples of generated articles, given the article title
+ http://www.allvoices.com/contributed-news/9423860/content/81937916-ichie-sings-jazz-blues-contemporary-tunes
+ http://www.allvoices.com/contributed-news/9415063-britney-spears-femme-fatale-in-north-sf-bay-area
+
+ 5. Solving a high-importance problem: filtering out meaningless speech recognition results.
+ Speech recognitions SDKs usually produce a number of phrases as results, such as
+ "remember to buy milk tomorrow from trader joes",
+ "remember to buy milk tomorrow from 3 to jones"
+ One can see that the former is meaningful, and the latter is meaningless (although similar in terms of how it is pronounced).
+ We use web mining and Similarity component to detect a meaningful option (a mistake caused by trying to interpret meaningless
+ request by a query understanding system such as Siri for iPhone can be costly).
+
+ SpeechRecognitionResultsProcessor.java does the job:
+ public List<SentenceMeaningfullnessScore> runSearchAndScoreMeaningfulness(List<String> sents)
+ re-ranks the phrases in the order of decrease of meaningfulness.
+
+ 6. Similarity component internals
+ in the package opennlp.tools.textsimilarity.chunker2matcher
+ ParserChunker2MatcherProcessor.java does parsing of two portions of text and matching the resultant parse trees to assess similarity between
+ these portions of text.
+ To run ParserChunker2MatcherProcessor
+ private static String MODEL_DIR = "resources/models";
+ needs to be specified
+
+ The key function
+ public SentencePairMatchResult assessRelevance(String para1, String para2)
+ takes two portions of text and does similarity assessment by finding the set of all maximum common subtrees
+ of the set of parse trees for each portion of text
+
+ It splits paragraphs into sentences, parses them, obtained chunking information and produces grouped phrases (noun, evrn, prepositional etc.):
+ public synchronized List<List<ParseTreeChunk>> formGroupedPhrasesFromChunksForPara(String para)
+
+ and then attempts to find common subtrees:
+ in ParseTreeMatcherDeterministic.java
+ List<List<ParseTreeChunk>> res = md.matchTwoSentencesGroupedChunksDeterministic(sent1GrpLst, sent2GrpLst)
+
+ Phrase matching functionality is in package opennlp.tools.textsimilarity;
+ ParseTreeMatcherDeterministic.java:
+ Here's the key matching function which takes two phrases, aligns them and finds a set of maximum common sub-phrase
+ public List<ParseTreeChunk> generalizeTwoGroupedPhrasesDeterministic
+
+ 7. Package structure
+ opennlp.tools.similarity.apps : 3 main applications
+ opennlp.tools.similarity.apps.utils: utilities for above applications
+
+ opennlp.tools.textsimilarity.chunker2matcher: parser which converts text into a form for matching parse trees
+ opennlp.tools.textsimilarity: parse tree matching functionality
+
+
+
+
+Requirements
+------------
+Java 1.5 is required to run OpenNLP
+Maven 3.0.0 is required for building it
+
+Known OSGi Issues
+------------
+In an OSGi environment the following things are not supported:
+- The coreference resolution component
+- The ability to load a user provided feature generator class
+
+Note
+----
+The current API contains still many deprecated methods, these
+will be removed in one of our next releases, please
+migrate to our new API.
diff --git a/opennlp-similarity/RELEASE_NOTES.html b/opennlp-similarity/RELEASE_NOTES.html
new file mode 100644
index 0000000..7706367
--- /dev/null
+++ b/opennlp-similarity/RELEASE_NOTES.html
@@ -0,0 +1,77 @@
+<!--
+ ***************************************************************
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ ***************************************************************
+-->
+
+<html>
+<head>
+ <title>Apache OpenNLP ${pom.version} Release Notes</title>
+</head>
+<body>
+<h1>Apache OpenNLP ${pom.version} Release Notes</h1>
+
+<h2>Contents</h2>
+<p>
+<a href="#what.is.opennlp">What is Similarity component of Apache OpenNLP?</a><br/>
+<a href="#major.changes">This Release</a><br/>
+<a href="#get.involved">How to Get Involved</a><br/>
+<a href="#report.issues">How to Report Issues</a><br/>
+<a href="#list.issues">List of JIRA Issues Fixed in this Release</a><br/>
+</p>
+
+<h2><a name="what.is.opennlp">1. What is Apache OpenNLP?</a></h2>
+<p>
+This component does text relevance assessment. It takes two portions of texts (phrases, sentences, paragraphs) and returns a similarity score.
+Similarity component can be used on top of search to improve relevance, computing similarity score between a question and all search results (snippets).
+Also, this component is useful for web mining of images, videos, forums, blogs, and other media with textual descriptions. Such applications as content generation
+and filtering meaningless speech recognition results are included in the sample applications of this component.
+ Relevance assessment is based on machine learning of syntactic parse trees (constituency trees, http://en.wikipedia.org/wiki/Parse_tree).
+The similarity score is calculated as the size of all maximal common sub-trees for sentences from a pair of texts (
+www.aaai.org/ocs/index.php/WS/AAAIW11/paper/download/3971/4187, www.aaai.org/ocs/index.php/FLAIRS/FLAIRS11/paper/download/2573/3018,
+www.aaai.org/ocs/index.php/SSS/SSS10/paper/download/1146/1448).
+ The objective of Similarity component is to give an application engineer as tool for text relevance which can be used as a black box, no need to understand
+ computational linguistics or machine learning.
+</p>
+
+<h2><a name="major.changes">This Release</a></h2>
+<p>
+Please see the <a href="README">README</a> for this information.
+</p>
+
+<h2><a name="get.involved">How to Get Involved</a></h2>
+<p>
+The Apache OpenNLP project really needs and appreciates any contributions,
+including documentation help, source code and feedback. If you are interested
+in contributing, please visit <a href="http://opennlp.apache.org/">http://opennlp.apache.org/</a>
+</p>
+
+<h2><a name="report.issues">How to Report Issues</a></h2>
+<p>
+The Apache OpenNLP project uses JIRA for issue tracking. Please report any
+issues you find at
+<a href="http://issues.apache.org/jira/browse/opennlp">http://issues.apache.org/jira/browse/opennlp</a>
+</p>
+
+<h2><a name="list.issues">List of JIRA Issues Fixed in this Release</a></h2>
+<p>
+Click <a href="issuesFixed/jira-report.html">issuesFixed/jira-report.hmtl</a> for the list of
+issues fixed in this release.
+</p>
+</body>
+</html>
\ No newline at end of file
diff --git a/opennlp-similarity/pom.xml b/opennlp-similarity/pom.xml
index 4d0bb87..d2fdf4d 100644
--- a/opennlp-similarity/pom.xml
+++ b/opennlp-similarity/pom.xml
@@ -71,15 +71,15 @@
<version>0.7</version>
</dependency>
<dependency>
- <groupId>xstream.codehaus.org</groupId>
- <artifactId>xstream</artifactId>
- <version>1.4.2</version>
- </dependency>
- <dependency>
<groupId>net.sf.opencsv</groupId>
<artifactId>opencsv</artifactId>
<version>2.0</version>
</dependency>
+ <dependency>
+ <groupId>org.apache.lucene</groupId>
+ <artifactId>lucene-core</artifactId>
+ <version>3.5.0</version>
+ </dependency>
</dependencies>
<build>
diff --git a/opennlp-similarity/src/main/java/opennlp/tools/similarity/apps/taxo_builder/TaxoQuerySnapshotMatcher.java b/opennlp-similarity/src/main/java/opennlp/tools/similarity/apps/taxo_builder/TaxoQuerySnapshotMatcher.java
index 3c6fc59..3b47619 100644
--- a/opennlp-similarity/src/main/java/opennlp/tools/similarity/apps/taxo_builder/TaxoQuerySnapshotMatcher.java
+++ b/opennlp-similarity/src/main/java/opennlp/tools/similarity/apps/taxo_builder/TaxoQuerySnapshotMatcher.java
@@ -28,7 +28,7 @@
import opennlp.tools.similarity.apps.utils.FileHandler;
import opennlp.tools.textsimilarity.chunker2matcher.ParserChunker2MatcherProcessor;
-import com.thoughtworks.xstream.XStream;
+//import com.thoughtworks.xstream.XStream;
/**
* This class can be used to generate scores based on the overlapping between a
@@ -106,7 +106,7 @@
*
* @param taxonomyPath
* @param taxonomyXML_Path
- * */
+ *
public void convertDatToXML(String taxonomyXML_Path, TaxonomySerializer taxo) {
XStream xStream = new XStream();
@@ -128,7 +128,7 @@
matcher.taxo = (TaxonomySerializer) xStream.fromXML(fileHandler
.readFromTextFile("src/test/resources/taxo_English.xml"));
}
-
+*/
public void close() {
sm.close();
}
diff --git a/opennlp-similarity/src/main/java/opennlp/tools/similarity/apps/utils/FileHandler.java b/opennlp-similarity/src/main/java/opennlp/tools/similarity/apps/utils/FileHandler.java
index adb5321..d15cc17 100644
--- a/opennlp-similarity/src/main/java/opennlp/tools/similarity/apps/utils/FileHandler.java
+++ b/opennlp-similarity/src/main/java/opennlp/tools/similarity/apps/utils/FileHandler.java
@@ -34,7 +34,7 @@
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;
-import org.apache.log4j.Logger;
+import java.util.logging.Logger;
/**
* This class responsible to save data to files as well as read out! It is
@@ -52,7 +52,7 @@
out.write(data + "\n");
out.close();
} catch (IOException e) {
- LOG.error(e);
+ LOG.severe(e.toString());
e.printStackTrace();
}
}
@@ -99,7 +99,7 @@
outputStream = new ObjectOutputStream(new FileOutputStream(filepath));
outputStream.writeObject(obj);
} catch (IOException e) {
- LOG.error(e);
+ LOG.severe(e.toString());
}
}
@@ -114,13 +114,13 @@
}
} catch (EOFException ex) { // This exception will be caught when EOF is
// reached
- LOG.error("End of file reached.", ex);
+ LOG.severe("End of file reached.\n" + ex.toString());
} catch (ClassNotFoundException ex) {
- LOG.error(ex);
+ LOG.severe(ex.toString());
} catch (FileNotFoundException ex) {
- LOG.error(ex);
+ LOG.severe(ex.toString());
} catch (IOException ex) {
- LOG.error(ex);
+ LOG.severe(ex.toString());
} finally {
// Close the ObjectInputStream
try {
@@ -128,7 +128,7 @@
inputStream.close();
}
} catch (IOException ex) {
- LOG.error(ex);
+ LOG.severe(ex.toString());
}
}
return null;
@@ -187,7 +187,7 @@
input.close();
}
} catch (IOException ex) {
- LOG.error("fileName: " + filePath, ex);
+ LOG.severe("fileName: " + filePath +"\n " + ex);
}
return contents.toString();
}
@@ -224,7 +224,7 @@
input.close();
}
} catch (IOException ex) {
- LOG.error(ex);
+ LOG.severe(ex.toString());
}
return lines;
}
@@ -244,7 +244,7 @@
try {
file.mkdirs();
} catch (Exception e) {
- LOG.error("Directory already exists or the file-system is read only", e);
+ LOG.severe("Directory already exists or the file-system is read only");
}
}
}
@@ -300,7 +300,7 @@
try {
deleteFile(dirName + fileNameList.get(i));
} catch (IllegalArgumentException e) {
- LOG.error("No way to delete file: " + dirName + fileNameList.get(i),
+ LOG.severe("No way to delete file: " + dirName + fileNameList.get(i) + "\n"+
e);
}
}
diff --git a/opennlp-similarity/src/test/java/opennlp/tools/similarity/apps/taxo_builder/TaxonomyBuildMatchTest.java b/opennlp-similarity/src/test/java/opennlp/tools/similarity/apps/taxo_builder/TaxonomyBuildMatchTest.java
index cb55caa..fd58841 100644
--- a/opennlp-similarity/src/test/java/opennlp/tools/similarity/apps/taxo_builder/TaxonomyBuildMatchTest.java
+++ b/opennlp-similarity/src/test/java/opennlp/tools/similarity/apps/taxo_builder/TaxonomyBuildMatchTest.java
@@ -28,7 +28,7 @@
System.out.println(ad.lemma_AssocWords);
assertTrue(ad.lemma_AssocWords.size() > 0);
}
-
+/*
public void testTaxonomyBuild() {
TaxonomyExtenderViaMebMining self = new TaxonomyExtenderViaMebMining();
self.extendTaxonomy("src/test/resources/taxonomies/irs_dom.ari", "tax",
@@ -36,7 +36,7 @@
self.close();
assertTrue(self.getAssocWords_ExtendedAssocWords().size() > 0);
}
-
+*/
public void testTaxonomyMatch() {
TaxoQuerySnapshotMatcher matcher = new TaxoQuerySnapshotMatcher(
"src/test/resources/taxonomies/irs_domTaxo.dat");