| <html><head> |
| <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"> |
| <title>Apache OpenNLP Developer Documentation</title><link rel="stylesheet" href="css/opennlp-docs.css" type="text/css"><meta name="generator" content="DocBook XSL-NS Stylesheets V1.75.2"></head><body bgcolor="white" text="black" link="#0000FF" vlink="#840084" alink="#0000FF"><div lang="en" class="book" title="Apache OpenNLP Developer Documentation"><div class="titlepage"><div><div><h1 class="title"><a name="d4e1"></a>Apache OpenNLP Developer Documentation</h1></div><div><div class="authorgroup"> |
| <h3 class="corpauthor">Written and maintained by the Apache OpenNLP Development |
| Community</h3> |
| </div></div><div><p class="releaseinfo"> |
| Version 2.1.0 |
| </p></div><div><p class="copyright">Copyright © 2011, 2024 The Apache Software Foundation</p></div><div><div class="legalnotice" title="Legal Notice"><a name="d4e7"></a> |
| <p title="License and Disclaimer"> |
| <b>License and Disclaimer. </b> |
| |
| The ASF licenses this documentation |
| to you under the Apache License, |
| Version 2.0 (the |
| "License"); you may not use this documentation |
| except in compliance |
| with the License. You may obtain a copy of the |
| License at |
| |
| </p><div class="blockquote"><blockquote class="blockquote"> |
| <p> |
| <a class="ulink" href="http://www.apache.org/licenses/LICENSE-2.0" target="_top">http://www.apache.org/licenses/LICENSE-2.0</a> |
| </p> |
| </blockquote></div><p title="License and Disclaimer"> |
| |
| Unless required by applicable law or agreed to in writing, |
| this documentation and its contents are distributed under the License |
| on an |
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| KIND, either express or implied. See the License for the |
| specific language governing permissions and limitations |
| under the License. |
| |
| </p> |
| </div></div></div><hr></div><div class="toc"><p><b>Table of Contents</b></p><dl><dt><span class="chapter"><a href="#opennlp">1. Introduction</a></span></dt><dd><dl><dt><span class="section"><a href="#intro.description">Description</a></span></dt><dt><span class="section"><a href="#intro.general.library.structure">General Library Structure</a></span></dt><dt><span class="section"><a href="#intro.api">Application Program Interface (API). Generic Example</a></span></dt><dt><span class="section"><a href="#intro.cli">Command line interface (CLI)</a></span></dt><dd><dl><dt><span class="section"><a href="#intro.cli.description">Description</a></span></dt><dt><span class="section"><a href="#intro.cli.toolslist">List of tools</a></span></dt><dt><span class="section"><a href="#intro.cli.setup">Setting up</a></span></dt><dt><span class="section"><a href="#intro.cli.generic">Generic Example</a></span></dt></dl></dd><dt><span class="section"><a href="#intro.models">OpenNLP Models</a></span></dt><dd><dl><dt><span class="section"><a href="#intro.models.native">OpenNLP Models</a></span></dt><dt><span class="section"><a href="#intro.models.onnx">ONNX Models</a></span></dt></dl></dd></dl></dd><dt><span class="chapter"><a href="#tools.langdetect">2. Language Detector</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.langdetect.classifying">Classifying</a></span></dt><dt><span class="section"><a href="#tools.langdetect.classifying.cmdline">Language Detector Tool</a></span></dt><dt><span class="section"><a href="#tools.langdetect.classifying.api">Language Detector API</a></span></dt><dt><span class="section"><a href="#tools.langdetect.training">Training</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.langdetect.training.tool">Training Tool</a></span></dt><dt><span class="section"><a href="#tools.langdetect.training.leipzig">Training with Leipzig</a></span></dt><dt><span class="section"><a href="#tools.langdetect.training.api">Training API</a></span></dt></dl></dd></dl></dd><dt><span class="chapter"><a href="#tools.sentdetect">3. Sentence Detector</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.sentdetect.detection">Sentence Detection</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.sentdetect.detection.cmdline">Sentence Detection Tool</a></span></dt><dt><span class="section"><a href="#tools.sentdetect.detection.api">Sentence Detection API</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.sentdetect.training">Sentence Detector Training</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.sentdetect.training.tool">Training Tool</a></span></dt><dt><span class="section"><a href="#tools.sentdetect.training.api">Training API</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.sentdetect.eval">Evaluation</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.sentdetect.eval.tool">Evaluation Tool</a></span></dt></dl></dd></dl></dd><dt><span class="chapter"><a href="#tools.tokenizer">4. Tokenizer</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.tokenizer.introduction">Tokenization</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.tokenizer.cmdline">Tokenizer Tools</a></span></dt><dt><span class="section"><a href="#tools.tokenizer.api">Tokenizer API</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.tokenizer.training">Tokenizer Training</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.tokenizer.training.tool">Training Tool</a></span></dt><dt><span class="section"><a href="#tools.tokenizer.training.api">Training API</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.tokenizer.detokenizing">Detokenizing</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.tokenizer.detokenizing.api">Detokenizing API</a></span></dt><dt><span class="section"><a href="#tools.tokenizer.detokenizing.dict">Detokenizer Dictionary</a></span></dt></dl></dd></dl></dd><dt><span class="chapter"><a href="#tools.namefind">5. Name Finder</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.namefind.recognition">Named Entity Recognition</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.namefind.recognition.cmdline">Name Finder Tool</a></span></dt><dt><span class="section"><a href="#tools.namefind.recognition.api">Name Finder API</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.namefind.api.onnx">Using an ONNX Model</a></span></dt></dl></dd></dl></dd><dt><span class="section"><a href="#tools.namefind.training">Name Finder Training</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.namefind.training.tool">Training Tool</a></span></dt><dt><span class="section"><a href="#tools.namefind.training.api">Training API</a></span></dt><dt><span class="section"><a href="#tools.namefind.training.featuregen">Custom Feature Generation</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.namefind.training.featuregen.api">Feature Generation defined by API</a></span></dt><dt><span class="section"><a href="#tools.namefind.training.featuregen.xml">Feature Generation defined by XML Descriptor</a></span></dt></dl></dd></dl></dd><dt><span class="section"><a href="#tools.namefind.eval">Evaluation</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.namefind.eval.tool">Evaluation Tool</a></span></dt><dt><span class="section"><a href="#tools.namefind.eval.api">Evaluation API</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.namefind.annotation_guides">Named Entity Annotation Guidelines</a></span></dt></dl></dd><dt><span class="chapter"><a href="#tools.doccat">6. Document Categorizer</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.doccat.classifying">Classifying</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.doccat.classifying.cmdline">Document Categorizer Tool</a></span></dt><dt><span class="section"><a href="#tools.doccat.classifying.api">Document Categorizer API</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.namefind.api.onnx">Using an ONNX Model</a></span></dt></dl></dd></dl></dd><dt><span class="section"><a href="#tools.doccat.training">Training</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.doccat.training.tool">Training Tool</a></span></dt><dt><span class="section"><a href="#tools.doccat.training.api">Training API</a></span></dt></dl></dd></dl></dd><dt><span class="chapter"><a href="#tools.postagger">7. Part-of-Speech Tagger</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.postagger.tagging">Tagging</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.postagger.tagging.cmdline">POS Tagger Tool</a></span></dt><dt><span class="section"><a href="#tools.postagger.tagging.api">POS Tagger API</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.postagger.training">Training</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.postagger.training.tool">Training Tool</a></span></dt><dt><span class="section"><a href="#tools.postagger.training.api">Training API</a></span></dt><dt><span class="section"><a href="#tools.postagger.training.tagdict">Tag Dictionary</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.postagger.eval">Evaluation</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.postagger.eval.tool">Evaluation Tool</a></span></dt></dl></dd></dl></dd><dt><span class="chapter"><a href="#tools.lemmatizer">8. Lemmatizer</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.lemmatizer.tagging.cmdline">Lemmatizer Tool</a></span></dt><dt><span class="section"><a href="#tools.lemmatizer.tagging.api">Lemmatizer API</a></span></dt><dt><span class="section"><a href="#tools.lemmatizer.training">Lemmatizer Training</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.lemmatizer.training.tool">Training Tool</a></span></dt><dt><span class="section"><a href="#tools.lemmatizer.training.api">Training API</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.lemmatizer.evaluation">Lemmatizer Evaluation</a></span></dt></dl></dd><dt><span class="chapter"><a href="#tools.chunker">9. Chunker</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.parser.chunking">Chunking</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.parser.chunking.cmdline">Chunker Tool</a></span></dt><dt><span class="section"><a href="#tools.parser.chunking.api">Chunking API</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.chunker.training">Chunker Training</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.chunker.training.tool">Training Tool</a></span></dt><dt><span class="section"><a href="#tools.chunker.training.api">Training API</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.chunker.evaluation">Chunker Evaluation</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.chunker.evaluation.tool">Chunker Evaluation Tool</a></span></dt></dl></dd></dl></dd><dt><span class="chapter"><a href="#tools.parser">10. Parser</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.parser.parsing">Parsing</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.parser.parsing.cmdline">Parser Tool</a></span></dt><dt><span class="section"><a href="#tools.parser.parsing.api">Parsing API</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.parser.training">Parser Training</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.parser.training.tool">Training Tool</a></span></dt><dt><span class="section"><a href="#tools.parser.training.api">Training API</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.parser.evaluation">Parser Evaluation</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.parser.evaluation.tool">Parser Evaluation Tool</a></span></dt><dt><span class="section"><a href="#tools.parser.evaluation.api">Evaluation API</a></span></dt></dl></dd></dl></dd><dt><span class="chapter"><a href="#tools.coref">11. Coreference Resolution</a></span></dt><dt><span class="chapter"><a href="#tools.extension">12. Extending OpenNLP</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.extension.writing">Writing an extension</a></span></dt><dt><span class="section"><a href="#tools.extension.osgi">Running in an OSGi container</a></span></dt></dl></dd><dt><span class="chapter"><a href="#tools.corpora">13. Corpora</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.corpora.conll">CONLL</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.corpora.conll.2000">CONLL 2000</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.corpora.conll.2000.getting">Getting the data</a></span></dt><dt><span class="section"><a href="#tools.corpora.conll.2000.converting">Converting the data</a></span></dt><dt><span class="section"><a href="#tools.corpora.conll.2000.training">Training</a></span></dt><dt><span class="section"><a href="#tools.corpora.conll.2000.evaluation">Evaluating</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.corpora.conll.2002">CONLL 2002</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.corpora.conll.2002.getting">Getting the data</a></span></dt><dt><span class="section"><a href="#tools.corpora.conll.2002.converting">Converting the data</a></span></dt><dt><span class="section"><a href="#tools.corpora.conll.2002.training.spanish">Training with Spanish data</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.corpora.conll.2003">CONLL 2003</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.corpora.conll.2003.getting">Getting the data</a></span></dt><dt><span class="section"><a href="#tools.corpora.conll.2003.converting">Converting the data (optional)</a></span></dt><dt><span class="section"><a href="#tools.corpora.conll.2003.training.english">Training with English data</a></span></dt><dt><span class="section"><a href="#tools.corpora.conll.2003.evaluation.english">Evaluating with English data</a></span></dt></dl></dd></dl></dd><dt><span class="section"><a href="#tools.corpora.arvores-deitadas">Arvores Deitadas</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.corpora.arvores-deitadas.getting">Getting the data</a></span></dt><dt><span class="section"><a href="#tools.corpora.arvores-deitadas.converting">Converting the data (optional)</a></span></dt><dt><span class="section"><a href="#tools.corpora.arvores-deitadas.evaluation">Training and Evaluation</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.corpora.ontonotes">OntoNotes Release 4.0</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.corpora.ontonotes.namefinder">Name Finder Training</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.corpora.brat">Brat Format Support</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.corpora.brat.webtool">Sentences and Tokens</a></span></dt><dt><span class="section"><a href="#tools.corpora.brat.training">Training</a></span></dt><dt><span class="section"><a href="#tools.corpora.brat.evaluation">Evaluation</a></span></dt><dt><span class="section"><a href="#tools.corpora.brat.cross-validation">Cross Validation</a></span></dt></dl></dd></dl></dd><dt><span class="chapter"><a href="#opennlp.ml">14. Machine Learning</a></span></dt><dd><dl><dt><span class="section"><a href="#opennlp.ml.maxent">Maximum Entropy</a></span></dt><dd><dl><dt><span class="section"><a href="#opennlp.ml.maxent.impl">Implementation</a></span></dt></dl></dd></dl></dd><dt><span class="chapter"><a href="#org.apche.opennlp.uima">15. UIMA Integration</a></span></dt><dd><dl><dt><span class="section"><a href="#org.apche.opennlp.running-pear-sample">Running the pear sample in CVD</a></span></dt><dt><span class="section"><a href="#org.apche.opennlp.further-help">Further Help</a></span></dt></dl></dd><dt><span class="chapter"><a href="#tools.morfologik-addon">16. Morfologik Addon</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.morfologik-addon.api">Morfologik Integration</a></span></dt><dt><span class="section"><a href="#tools.morfologik-addon.cmdline">Morfologik CLI Tools</a></span></dt></dl></dd><dt><span class="chapter"><a href="#tools.cli">17. The Command Line Interface</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.cli.doccat">Doccat</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.cli.doccat.Doccat">Doccat</a></span></dt><dt><span class="section"><a href="#tools.cli.doccat.DoccatTrainer">DoccatTrainer</a></span></dt><dt><span class="section"><a href="#tools.cli.doccat.DoccatEvaluator">DoccatEvaluator</a></span></dt><dt><span class="section"><a href="#tools.cli.doccat.DoccatCrossValidator">DoccatCrossValidator</a></span></dt><dt><span class="section"><a href="#tools.cli.doccat.DoccatConverter">DoccatConverter</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.cli.langdetect">Langdetect</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.cli.langdetect.LanguageDetector">LanguageDetector</a></span></dt><dt><span class="section"><a href="#tools.cli.langdetect.LanguageDetectorTrainer">LanguageDetectorTrainer</a></span></dt><dt><span class="section"><a href="#tools.cli.langdetect.LanguageDetectorConverter">LanguageDetectorConverter</a></span></dt><dt><span class="section"><a href="#tools.cli.langdetect.LanguageDetectorCrossValidator">LanguageDetectorCrossValidator</a></span></dt><dt><span class="section"><a href="#tools.cli.langdetect.LanguageDetectorEvaluator">LanguageDetectorEvaluator</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.cli.dictionary">Dictionary</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.cli.dictionary.DictionaryBuilder">DictionaryBuilder</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.cli.tokenizer">Tokenizer</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.cli.tokenizer.SimpleTokenizer">SimpleTokenizer</a></span></dt><dt><span class="section"><a href="#tools.cli.tokenizer.TokenizerME">TokenizerME</a></span></dt><dt><span class="section"><a href="#tools.cli.tokenizer.TokenizerTrainer">TokenizerTrainer</a></span></dt><dt><span class="section"><a href="#tools.cli.tokenizer.TokenizerMEEvaluator">TokenizerMEEvaluator</a></span></dt><dt><span class="section"><a href="#tools.cli.tokenizer.TokenizerCrossValidator">TokenizerCrossValidator</a></span></dt><dt><span class="section"><a href="#tools.cli.tokenizer.TokenizerConverter">TokenizerConverter</a></span></dt><dt><span class="section"><a href="#tools.cli.tokenizer.DictionaryDetokenizer">DictionaryDetokenizer</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.cli.sentdetect">Sentdetect</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.cli.sentdetect.SentenceDetector">SentenceDetector</a></span></dt><dt><span class="section"><a href="#tools.cli.sentdetect.SentenceDetectorTrainer">SentenceDetectorTrainer</a></span></dt><dt><span class="section"><a href="#tools.cli.sentdetect.SentenceDetectorEvaluator">SentenceDetectorEvaluator</a></span></dt><dt><span class="section"><a href="#tools.cli.sentdetect.SentenceDetectorCrossValidator">SentenceDetectorCrossValidator</a></span></dt><dt><span class="section"><a href="#tools.cli.sentdetect.SentenceDetectorConverter">SentenceDetectorConverter</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.cli.namefind">Namefind</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.cli.namefind.TokenNameFinder">TokenNameFinder</a></span></dt><dt><span class="section"><a href="#tools.cli.namefind.TokenNameFinderTrainer">TokenNameFinderTrainer</a></span></dt><dt><span class="section"><a href="#tools.cli.namefind.TokenNameFinderEvaluator">TokenNameFinderEvaluator</a></span></dt><dt><span class="section"><a href="#tools.cli.namefind.TokenNameFinderCrossValidator">TokenNameFinderCrossValidator</a></span></dt><dt><span class="section"><a href="#tools.cli.namefind.TokenNameFinderConverter">TokenNameFinderConverter</a></span></dt><dt><span class="section"><a href="#tools.cli.namefind.CensusDictionaryCreator">CensusDictionaryCreator</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.cli.postag">Postag</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.cli.postag.POSTagger">POSTagger</a></span></dt><dt><span class="section"><a href="#tools.cli.postag.POSTaggerTrainer">POSTaggerTrainer</a></span></dt><dt><span class="section"><a href="#tools.cli.postag.POSTaggerEvaluator">POSTaggerEvaluator</a></span></dt><dt><span class="section"><a href="#tools.cli.postag.POSTaggerCrossValidator">POSTaggerCrossValidator</a></span></dt><dt><span class="section"><a href="#tools.cli.postag.POSTaggerConverter">POSTaggerConverter</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.cli.lemmatizer">Lemmatizer</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.cli.lemmatizer.LemmatizerME">LemmatizerME</a></span></dt><dt><span class="section"><a href="#tools.cli.lemmatizer.LemmatizerTrainerME">LemmatizerTrainerME</a></span></dt><dt><span class="section"><a href="#tools.cli.lemmatizer.LemmatizerEvaluator">LemmatizerEvaluator</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.cli.chunker">Chunker</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.cli.chunker.ChunkerME">ChunkerME</a></span></dt><dt><span class="section"><a href="#tools.cli.chunker.ChunkerTrainerME">ChunkerTrainerME</a></span></dt><dt><span class="section"><a href="#tools.cli.chunker.ChunkerEvaluator">ChunkerEvaluator</a></span></dt><dt><span class="section"><a href="#tools.cli.chunker.ChunkerCrossValidator">ChunkerCrossValidator</a></span></dt><dt><span class="section"><a href="#tools.cli.chunker.ChunkerConverter">ChunkerConverter</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.cli.parser">Parser</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.cli.parser.Parser">Parser</a></span></dt><dt><span class="section"><a href="#tools.cli.parser.ParserTrainer">ParserTrainer</a></span></dt><dt><span class="section"><a href="#tools.cli.parser.ParserEvaluator">ParserEvaluator</a></span></dt><dt><span class="section"><a href="#tools.cli.parser.ParserConverter">ParserConverter</a></span></dt><dt><span class="section"><a href="#tools.cli.parser.BuildModelUpdater">BuildModelUpdater</a></span></dt><dt><span class="section"><a href="#tools.cli.parser.CheckModelUpdater">CheckModelUpdater</a></span></dt><dt><span class="section"><a href="#tools.cli.parser.TaggerModelReplacer">TaggerModelReplacer</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.cli.entitylinker">Entitylinker</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.cli.entitylinker.EntityLinker">EntityLinker</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.cli.languagemodel">Languagemodel</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.cli.languagemodel.NGramLanguageModel">NGramLanguageModel</a></span></dt></dl></dd></dl></dd></dl></div><div class="list-of-tables"><p><b>List of Tables</b></p><dl><dt>2.1. <a href="#d4e96">Normalizers</a></dt><dt>5.1. <a href="#d4e361">Feature Generators</a></dt></dl></div> |
| |
| |
| |
| |
| <div class="chapter" title="Chapter 1. Introduction"><div class="titlepage"><div><div><h2 class="title"><a name="opennlp"></a>Chapter 1. Introduction</h2></div></div></div><div class="toc"><p><b>Table of Contents</b></p><dl><dt><span class="section"><a href="#intro.description">Description</a></span></dt><dt><span class="section"><a href="#intro.general.library.structure">General Library Structure</a></span></dt><dt><span class="section"><a href="#intro.api">Application Program Interface (API). Generic Example</a></span></dt><dt><span class="section"><a href="#intro.cli">Command line interface (CLI)</a></span></dt><dd><dl><dt><span class="section"><a href="#intro.cli.description">Description</a></span></dt><dt><span class="section"><a href="#intro.cli.toolslist">List of tools</a></span></dt><dt><span class="section"><a href="#intro.cli.setup">Setting up</a></span></dt><dt><span class="section"><a href="#intro.cli.generic">Generic Example</a></span></dt></dl></dd><dt><span class="section"><a href="#intro.models">OpenNLP Models</a></span></dt><dd><dl><dt><span class="section"><a href="#intro.models.native">OpenNLP Models</a></span></dt><dt><span class="section"><a href="#intro.models.onnx">ONNX Models</a></span></dt></dl></dd></dl></div> |
| |
| <div class="section" title="Description"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="intro.description"></a>Description</h2></div></div></div> |
| |
| <p> |
| The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text. |
| It supports the most common NLP tasks, such as tokenization, sentence segmentation, |
| part-of-speech tagging, named entity extraction, chunking, parsing, and coreference resolution. |
| These tasks are usually required to build more advanced text processing services. |
| OpenNLP also includes maximum entropy and perceptron based machine learning. |
| </p> |
| |
| <p> |
| The goal of the OpenNLP project will be to create a mature toolkit for the abovementioned tasks. |
| An additional goal is to provide a large number of pre-built models for a variety of languages, as |
| well as the annotated text resources that those models are derived from. |
| </p> |
| </div> |
| |
| <div class="section" title="General Library Structure"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="intro.general.library.structure"></a>General Library Structure</h2></div></div></div> |
| |
| <p>The Apache OpenNLP library contains several components, enabling one to build |
| a full natural language processing pipeline. These components |
| include: sentence detector, tokenizer, |
| name finder, document categorizer, part-of-speech tagger, chunker, parser, |
| coreference resolution. Components contain parts which enable one to execute the |
| respective natural language processing task, to train a model and often also to evaluate a |
| model. Each of these facilities is accessible via its application program |
| interface (API). In addition, a command line interface (CLI) is provided for convenience |
| of experiments and training. |
| </p> |
| </div> |
| |
| <div class="section" title="Application Program Interface (API). Generic Example"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="intro.api"></a>Application Program Interface (API). Generic Example</h2></div></div></div> |
| |
| <p> |
| OpenNLP components have similar APIs. Normally, to execute a task, |
| one should provide a model and an input. |
| </p> |
| <p> |
| A model is usually loaded by providing a FileInputStream with a model to a |
| constructor of the model class: |
| </p><pre class="programlisting"> |
| |
| <b class="hl-keyword">try</b> (InputStream modelIn = <b class="hl-keyword">new</b> FileInputStream(<b class="hl-string"><i style="color:red">"lang-model-name.bin"</i></b>)) { |
| SomeModel model = <b class="hl-keyword">new</b> SomeModel(modelIn); |
| } |
| |
| </pre><p> |
| </p> |
| <p> |
| After the model is loaded the tool itself can be instantiated. |
| </p><pre class="programlisting"> |
| |
| ToolName toolName = <b class="hl-keyword">new</b> ToolName(model); |
| </pre><p> |
| After the tool is instantiated, the processing task can be executed. The input and the |
| output formats are specific to the tool, but often the output is an array of String, |
| and the input is a String or an array of String. |
| </p><pre class="programlisting"> |
| |
| String output[] = toolName.executeTask(<b class="hl-string"><i style="color:red">"This is a sample text."</i></b>); |
| </pre><p> |
| </p> |
| </div> |
| |
| <div class="section" title="Command line interface (CLI)"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="intro.cli"></a>Command line interface (CLI)</h2></div></div></div><div class="toc"><dl><dt><span class="section"><a href="#intro.cli.description">Description</a></span></dt><dt><span class="section"><a href="#intro.cli.toolslist">List of tools</a></span></dt><dt><span class="section"><a href="#intro.cli.setup">Setting up</a></span></dt><dt><span class="section"><a href="#intro.cli.generic">Generic Example</a></span></dt></dl></div> |
| |
| <div class="section" title="Description"><div class="titlepage"><div><div><h3 class="title"><a name="intro.cli.description"></a>Description</h3></div></div></div> |
| |
| <p> |
| OpenNLP provides a command line script, serving as a unique entry point to all |
| included tools. The script is located in the bin directory of OpenNLP binary |
| distribution. Included are versions for Windows: opennlp.bat and Linux or |
| compatible systems: opennlp. |
| </p> |
| </div> |
| |
| <div class="section" title="List of tools"><div class="titlepage"><div><div><h3 class="title"><a name="intro.cli.toolslist"></a>List of tools</h3></div></div></div> |
| |
| <p> |
| The list of command line tools for Apache OpenNLP 2.1.0, |
| as well as a description of its arguments, is available at section <a class="xref" href="#tools.cli" title="Chapter 17. The Command Line Interface">Chapter 17, <i>The Command Line Interface</i></a>. |
| </p> |
| </div> |
| |
| <div class="section" title="Setting up"><div class="titlepage"><div><div><h3 class="title"><a name="intro.cli.setup"></a>Setting up</h3></div></div></div> |
| |
| <p> |
| OpenNLP script uses JAVA_CMD and JAVA_HOME variables to determine which command to |
| use to execute Java virtual machine. |
| </p> |
| <p> |
| OpenNLP script uses OPENNLP_HOME variable to determine the location of the binary |
| distribution of OpenNLP. It is recommended to point this variable to the binary |
| distribution of current OpenNLP version and update PATH variable to include |
| $OPENNLP_HOME/bin or %OPENNLP_HOME%\bin. |
| </p> |
| <p> |
| Such configuration allows calling OpenNLP conveniently. Examples below |
| suppose this configuration has been done. |
| </p> |
| </div> |
| |
| <div class="section" title="Generic Example"><div class="titlepage"><div><div><h3 class="title"><a name="intro.cli.generic"></a>Generic Example</h3></div></div></div> |
| |
| |
| <p> |
| Apache OpenNLP provides a common command line script to access all its tools: |
| </p><pre class="screen"> |
| |
| $ opennlp |
| </pre><p> |
| This script prints current version of the library and lists all available tools: |
| </p><pre class="screen"> |
| |
| OpenNLP <VERSION>. Usage: opennlp TOOL |
| where TOOL is one of: |
| Doccat learnable document categorizer |
| DoccatTrainer trainer for the learnable document categorizer |
| DoccatConverter converts leipzig data format to native OpenNLP format |
| DictionaryBuilder builds a new dictionary |
| SimpleTokenizer character class tokenizer |
| TokenizerME learnable tokenizer |
| TokenizerTrainer trainer for the learnable tokenizer |
| TokenizerMEEvaluator evaluator for the learnable tokenizer |
| TokenizerCrossValidator K-fold cross validator for the learnable tokenizer |
| TokenizerConverter converts foreign data formats (namefinder,conllx,pos) to native OpenNLP format |
| DictionaryDetokenizer |
| SentenceDetector learnable sentence detector |
| SentenceDetectorTrainer trainer for the learnable sentence detector |
| SentenceDetectorEvaluator evaluator for the learnable sentence detector |
| SentenceDetectorCrossValidator K-fold cross validator for the learnable sentence detector |
| SentenceDetectorConverter converts foreign data formats (namefinder,conllx,pos) to native OpenNLP format |
| TokenNameFinder learnable name finder |
| TokenNameFinderTrainer trainer for the learnable name finder |
| TokenNameFinderEvaluator Measures the performance of the NameFinder model with the reference data |
| TokenNameFinderCrossValidator K-fold cross validator for the learnable Name Finder |
| TokenNameFinderConverter converts foreign data formats (bionlp2004,conll03,conll02,ad) to native OpenNLP format |
| CensusDictionaryCreator Converts 1990 US Census names into a dictionary |
| POSTagger learnable part of speech tagger |
| POSTaggerTrainer trains a model for the part-of-speech tagger |
| POSTaggerEvaluator Measures the performance of the POS tagger model with the reference data |
| POSTaggerCrossValidator K-fold cross validator for the learnable POS tagger |
| POSTaggerConverter converts conllx data format to native OpenNLP format |
| ChunkerME learnable chunker |
| ChunkerTrainerME trainer for the learnable chunker |
| ChunkerEvaluator Measures the performance of the Chunker model with the reference data |
| ChunkerCrossValidator K-fold cross validator for the chunker |
| ChunkerConverter converts ad data format to native OpenNLP format |
| Parser performs full syntactic parsing |
| ParserTrainer trains the learnable parser |
| ParserEvaluator Measures the performance of the Parser model with the reference data |
| BuildModelUpdater trains and updates the build model in a parser model |
| CheckModelUpdater trains and updates the check model in a parser model |
| TaggerModelReplacer replaces the tagger model in a parser model |
| All tools print help when invoked with help parameter |
| Example: opennlp SimpleTokenizer help |
| |
| </pre><p> |
| </p> |
| <p>OpenNLP tools have similar command line structure and options. To discover tool |
| options, run it with no parameters: |
| </p><pre class="screen"> |
| |
| $ opennlp ToolName |
| </pre><p> |
| The tool will output two blocks of help. |
| </p> |
| <p> |
| The first block describes the general structure of this tool command line: |
| </p><pre class="screen"> |
| |
| Usage: opennlp TokenizerTrainer[.namefinder|.conllx|.pos] [-abbDict path] ... -model modelFile ... |
| </pre><p> |
| The general structure of this tool command line includes the obligatory tool name |
| (TokenizerTrainer), the optional format parameters ([.namefinder|.conllx|.pos]), |
| the optional parameters ([-abbDict path] ...), and the obligatory parameters |
| (-model modelFile ...). |
| </p> |
| <p> |
| The format parameters enable direct processing of non-native data without conversion. |
| Each format might have its own parameters, which are displayed if the tool is |
| executed without or with help parameter: |
| </p><pre class="screen"> |
| |
| $ opennlp TokenizerTrainer.conllx help |
| </pre><p> |
| </p><pre class="screen"> |
| |
| Usage: opennlp TokenizerTrainer.conllx [-abbDict path] [-alphaNumOpt isAlphaNumOpt] ... |
| |
| Arguments description: |
| -abbDict path |
| abbreviation dictionary in XML format. |
| ... |
| </pre><p> |
| To switch the tool to a specific format, add a dot and the format name after |
| the tool name: |
| </p><pre class="screen"> |
| |
| $ opennlp TokenizerTrainer.conllx -model en-pos.bin ... |
| </pre><p> |
| </p> |
| <p> |
| The second block of the help message describes the individual arguments: |
| </p><pre class="screen"> |
| |
| Arguments description: |
| -type maxent|perceptron|perceptron_sequence |
| The type of the token name finder model. One of maxent|perceptron|perceptron_sequence. |
| -dict dictionaryPath |
| The XML tag dictionary file |
| ... |
| </pre><p> |
| </p> |
| <p> |
| Most tools for processing need to be provided at least a model: |
| </p><pre class="screen"> |
| |
| $ opennlp ToolName lang-model-name.bin |
| </pre><p> |
| When tool is executed this way, the model is loaded and the tool is waiting for |
| the input from standard input. This input is processed and printed to standard |
| output. |
| </p> |
| <p>Alternative, or one should say, most commonly used way is to use console input and |
| output redirection options to provide also an input and an output files: |
| </p><pre class="screen"> |
| |
| $ opennlp ToolName lang-model-name.bin < input.txt > output.txt |
| </pre><p> |
| </p> |
| <p> |
| Most tools for model training need to be provided first a model name, |
| optionally some training options (such as model type, number of iterations), |
| and then the data. |
| </p> |
| <p> |
| A model name is just a file name. |
| </p> |
| <p> |
| Training options often include number of iterations, cutoff, |
| abbreviations dictionary or something else. Sometimes it is possible to provide these |
| options via training options file. In this case these options are ignored and the |
| ones from the file are used. |
| </p> |
| <p> |
| For the data one has to specify the location of the data (filename) and often |
| language and encoding. |
| </p> |
| <p> |
| A generic example of a command line to launch a tool trainer might be: |
| </p><pre class="screen"> |
| |
| $ opennlp ToolNameTrainer -model en-model-name.bin -lang en -data input.train -encoding UTF-8 |
| </pre><p> |
| or with a format: |
| </p><pre class="screen"> |
| |
| $ opennlp ToolNameTrainer.conll03 -model en-model-name.bin -lang en -data input.train \ |
| -types per -encoding UTF-8 |
| </pre><p> |
| </p> |
| <p>Most tools for model evaluation are similar to those for task execution, and |
| need to be provided fist a model name, optionally some evaluation options (such |
| as whether to print misclassified samples), and then the test data. A generic |
| example of a command line to launch an evaluation tool might be: |
| </p><pre class="screen"> |
| |
| $ opennlp ToolNameEvaluator -model en-model-name.bin -lang en -data input.test -encoding UTF-8 |
| </pre><p> |
| </p> |
| </div> |
| </div> |
| |
| <div class="section" title="OpenNLP Models"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="intro.models"></a>OpenNLP Models</h2></div></div></div><div class="toc"><dl><dt><span class="section"><a href="#intro.models.native">OpenNLP Models</a></span></dt><dt><span class="section"><a href="#intro.models.onnx">ONNX Models</a></span></dt></dl></div> |
| |
| <div class="section" title="OpenNLP Models"><div class="titlepage"><div><div><h3 class="title"><a name="intro.models.native"></a>OpenNLP Models</h3></div></div></div> |
| |
| <p> |
| OpenNLP supports training NLP models that can be used by OpenNLP. In this |
| documentation we will refer to these models as "OpenNLP models." All NLP |
| components of OpenNLP support this type of model. The sections below in |
| this documentation describe how to train and use these models. <a class="ulink" href="https://opennlp.apache.org/models.html" target="_top">Pre-trained |
| models</a> are available for some languages and some of the OpenNLP components. |
| </p> |
| </div> |
| <div class="section" title="ONNX Models"><div class="titlepage"><div><div><h3 class="title"><a name="intro.models.onnx"></a>ONNX Models</h3></div></div></div> |
| |
| <p> |
| OpenNLP supports ONNX models via the ONNX Runtime for the <a class="link" href="#tools.namefind" title="Chapter 5. Name Finder">Name Finder</a>. |
| and <a class="link" href="#tools.doccat" title="Chapter 6. Document Categorizer">Document Categorizer</a>. This allows models trained by other frameworks |
| such as PyTorch and Tensorflow to be used by OpenNLP. The documentation for |
| each of the OpenNLP components that supports ONNX models describes how to |
| use ONNX models for inference. Note that OpenNLP does not support training |
| models that can be used by the ONNX Runtime - ONNX models must be created |
| outside of OpenNLP using other tools. |
| </p> |
| </div> |
| </div> |
| |
| </div> |
| <div class="chapter" title="Chapter 2. Language Detector"><div class="titlepage"><div><div><h2 class="title"><a name="tools.langdetect"></a>Chapter 2. Language Detector</h2></div></div></div><div class="toc"><p><b>Table of Contents</b></p><dl><dt><span class="section"><a href="#tools.langdetect.classifying">Classifying</a></span></dt><dt><span class="section"><a href="#tools.langdetect.classifying.cmdline">Language Detector Tool</a></span></dt><dt><span class="section"><a href="#tools.langdetect.classifying.api">Language Detector API</a></span></dt><dt><span class="section"><a href="#tools.langdetect.training">Training</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.langdetect.training.tool">Training Tool</a></span></dt><dt><span class="section"><a href="#tools.langdetect.training.leipzig">Training with Leipzig</a></span></dt><dt><span class="section"><a href="#tools.langdetect.training.api">Training API</a></span></dt></dl></dd></dl></div> |
| |
| <div class="section" title="Classifying"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="tools.langdetect.classifying"></a>Classifying</h2></div></div></div> |
| |
| <p> |
| The OpenNLP Language Detector classifies a document in ISO-639-3 languages according to the model capabilities. |
| A model can be trained with Maxent, Perceptron or Naive Bayes algorithms. By default normalizes a text and |
| the context generator extracts n-grams of size 1, 2 and 3. The n-gram sizes, the normalization and the |
| context generator can be customized by extending the LanguageDetectorFactory. |
| |
| </p> |
| <p> |
| The default normalizers are: |
| |
| </p><div class="table"><a name="d4e96"></a><p class="title"><b>Table 2.1. Normalizers</b></p><div class="table-contents"> |
| |
| <table summary="Normalizers" border="1"><colgroup><col><col></colgroup><thead><tr><th>Normalizer</th><th>Description</th></tr></thead><tbody><tr><td>EmojiCharSequenceNormalizer</td><td>Replaces emojis by blank space</td></tr><tr><td>UrlCharSequenceNormalizer</td><td>Replaces URLs and E-Mails by a blank space.</td></tr><tr><td>TwitterCharSequenceNormalizer</td><td>Replaces hashtags and Twitter user names by blank spaces.</td></tr><tr><td>NumberCharSequenceNormalizer</td><td>Replaces number sequences by blank spaces</td></tr><tr><td>ShrinkCharSequenceNormalizer</td><td>Shrink characters that repeats three or more times to only two repetitions.</td></tr></tbody></table> |
| </div></div><p><br class="table-break"> |
| </p> |
| </div> |
| |
| <div class="section" title="Language Detector Tool"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="tools.langdetect.classifying.cmdline"></a>Language Detector Tool</h2></div></div></div> |
| |
| <p> |
| The easiest way to try out the language detector is the command line tool. The tool is only |
| intended for demonstration and testing. The following command shows how to use the language detector tool. |
| </p><pre class="screen"> |
| |
| $ bin/opennlp LanguageDetector model |
| </pre><p> |
| The input is read from standard input and output is written to standard output, unless they are redirected |
| or piped. |
| </p> |
| </div> |
| <div class="section" title="Language Detector API"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="tools.langdetect.classifying.api"></a>Language Detector API</h2></div></div></div> |
| |
| <p> |
| To perform classification you will need a machine learning model - |
| these are encapsulated in the LanguageDetectorModel class of OpenNLP tools. |
| </p> |
| <p> |
| First you need to grab the bytes from the serialized model on an InputStream - |
| we'll leave it you to do that, since you were the one who serialized it to begin with. Now for the easy part: |
| </p><pre class="programlisting"> |
| |
| InputStream is = ... |
| LanguageDetectorModel m = <b class="hl-keyword">new</b> LanguageDetectorModel(is); |
| </pre><p> |
| With the LanguageDetectorModel in hand we are just about there: |
| </p><pre class="programlisting"> |
| |
| String inputText = ... |
| LanguageDetector myCategorizer = <b class="hl-keyword">new</b> LanguageDetectorME(m); |
| |
| <i class="hl-comment" style="color: silver">// Get the most probable language</i> |
| Language bestLanguage = myCategorizer.predictLanguage(inputText); |
| System.out.println(<b class="hl-string"><i style="color:red">"Best language: "</i></b> + bestLanguage.getLang()); |
| System.out.println(<b class="hl-string"><i style="color:red">"Best language confidence: "</i></b> + bestLanguage.getConfidence()); |
| |
| <i class="hl-comment" style="color: silver">// Get an array with the most probable languages</i> |
| Language[] languages = myCategorizer.predictLanguages(null); |
| </pre><p> |
| |
| Note that the both the API or the CLI will consider the complete text to choose the most probable languages. |
| To handle mixed language one can analyze smaller chunks of text to find language regions. |
| </p> |
| </div> |
| <div class="section" title="Training"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="tools.langdetect.training"></a>Training</h2></div></div></div><div class="toc"><dl><dt><span class="section"><a href="#tools.langdetect.training.tool">Training Tool</a></span></dt><dt><span class="section"><a href="#tools.langdetect.training.leipzig">Training with Leipzig</a></span></dt><dt><span class="section"><a href="#tools.langdetect.training.api">Training API</a></span></dt></dl></div> |
| |
| <p> |
| The Language Detector can be trained on annotated training material. The data |
| can be in OpenNLP Language Detector training format. This is one document per line, |
| containing the ISO-639-3 language code and text separated by a tab. Other formats can also be |
| available. |
| The following sample shows the sample from above in the required format. |
| </p><pre class="screen"> |
| |
| spa A la fecha tres calles bonaerenses recuerdan su nombre (en Ituzaingó, Merlo y Campana). A la fecha, unas 50 \ |
| naves y 20 aviones se han perdido en esa área particular del océano Atlántico. |
| deu Alle Jahre wieder: Millionen Spanier haben am Dienstag die Auslosung in der größten Lotterie der Welt verfolgt.\ |
| Alle Jahre wieder: So gelingt der stressfreie Geschenke-Umtausch Artikel per E-Mail empfehlen So gelingt der \ |
| stressfre ie Geschenke-Umtausch Nicht immer liegt am Ende das unter dem Weihnachtsbaum, was man sich gewünscht hat. |
| srp Већина становника боравила је кућама од блата или шаторима, како би радили на својим удаљеним пољима у долини \ |
| Јордана и напасали своје стадо оваца и коза. Већина становника говори оба језика. |
| lav Egija Tri-Active procedūru īpaši iesaka izmantot siltākajos gadalaikos, jo ziemā aukstums var šķist arī \ |
| nepatīkams. Valdība vienojās, ka izmaiņas nodokļu politikā tiek konceptuāli atbalstītas, tomēr deva \ |
| nedēļu laika Ekonomikas ministrijai, Finanšu ministrijai un Labklājības ministrijai, lai ar vienotu \ |
| pozīciju atgrieztos pie jautājuma izskatīšanas. |
| </pre><p> |
| Note: The line breaks marked with a backslash are just inserted for formatting purposes and must not be |
| included in the training data. |
| </p> |
| <div class="section" title="Training Tool"><div class="titlepage"><div><div><h3 class="title"><a name="tools.langdetect.training.tool"></a>Training Tool</h3></div></div></div> |
| |
| <p> |
| The following command will train the language detector and write the model to langdetect.bin: |
| </p><pre class="screen"> |
| |
| $ bin/opennlp LanguageDetectorTrainer[.leipzig] -model modelFile [-params paramsFile] [-factory factoryName] -data sampleData [-encoding charsetName] |
| |
| </pre><p> |
| Note: To customize the language detector, extend the class opennlp.tools.langdetect.LanguageDetectorFactory |
| add it to the classpath and pass it in the -factory argument. |
| </p> |
| </div> |
| <div class="section" title="Training with Leipzig"><div class="titlepage"><div><div><h3 class="title"><a name="tools.langdetect.training.leipzig"></a>Training with Leipzig</h3></div></div></div> |
| |
| <p> |
| The Leipzig Corpora collection presents corpora in different languages. The corpora is a collection |
| of individual sentences collected from the web and newspapers. The Corpora is available as plain text |
| and as MySQL database tables. The OpenNLP integration can only use the plain text version. |
| The individual plain text packages can be downloaded here: |
| <a class="ulink" href="http://corpora.uni-leipzig.de/download.html" target="_top">http://corpora.uni-leipzig.de/download.html</a> |
| </p> |
| <p> |
| This corpora is specially good to train Language Detector and a converter is provided. First, you need to |
| download the files that compose the Leipzig Corpora collection to a folder. Apache OpenNLP Language |
| Detector supports training, evaluation and cross validation using the Leipzig Corpora. For example, |
| the following command shows how to train a model. |
| |
| </p><pre class="screen"> |
| |
| $ bin/opennlp LanguageDetectorTrainer.leipzig -model modelFile [-params paramsFile] [-factory factoryName] \ |
| -sentencesDir sentencesDir -sentencesPerSample sentencesPerSample -samplesPerLanguage samplesPerLanguage \ |
| [-encoding charsetName] |
| |
| </pre><p> |
| |
| </p> |
| <p> |
| The following sequence of commands shows how to convert the Leipzig Corpora collection at folder |
| leipzig-train/ to the default Language Detector format, by creating groups of 5 sentences as documents |
| and limiting to 10000 documents per language. Them, it shuffles the result and select the first |
| 100000 lines as train corpus and the last 20000 as evaluation corpus: |
| </p><pre class="screen"> |
| |
| $ bin/opennlp LanguageDetectorConverter leipzig -sentencesDir leipzig-train/ -sentencesPerSample 5 -samplesPerLanguage 10000 > leipzig.txt |
| $ perl -MList::Util=shuffle -e 'print shuffle(<STDIN>);' < leipzig.txt > leipzig_shuf.txt |
| $ head -100000 < leipzig_shuf.txt > leipzig.train |
| $ tail -20000 < leipzig_shuf.txt > leipzig.eval |
| |
| </pre><p> |
| </p> |
| </div> |
| <div class="section" title="Training API"><div class="titlepage"><div><div><h3 class="title"><a name="tools.langdetect.training.api"></a>Training API</h3></div></div></div> |
| |
| <p> |
| The following example shows how to train a model from API. |
| </p><pre class="programlisting"> |
| |
| InputStreamFactory inputStreamFactory = <b class="hl-keyword">new</b> MarkableFileInputStreamFactory(<b class="hl-keyword">new</b> File(<b class="hl-string"><i style="color:red">"corpus.txt"</i></b>)); |
| |
| ObjectStream<String> lineStream = |
| <b class="hl-keyword">new</b> PlainTextByLineStream(inputStreamFactory, StandardCharsets.UTF_<span class="hl-number">8</span>); |
| ObjectStream<LanguageSample> sampleStream = <b class="hl-keyword">new</b> LanguageDetectorSampleStream(lineStream); |
| |
| TrainingParameters params = ModelUtil.createDefaultTrainingParameters(); |
| params.put(TrainingParameters.ALGORITHM_PARAM, |
| PerceptronTrainer.PERCEPTRON_VALUE); |
| params.put(TrainingParameters.CUTOFF_PARAM, <span class="hl-number">0</span>); |
| |
| LanguageDetectorFactory factory = <b class="hl-keyword">new</b> LanguageDetectorFactory(); |
| |
| LanguageDetectorModel model = LanguageDetectorME.train(sampleStream, params, factory); |
| model.serialize(<b class="hl-keyword">new</b> File(<b class="hl-string"><i style="color:red">"langdetect.bin"</i></b>)); |
| } |
| |
| </pre><p> |
| </p> |
| </div> |
| </div> |
| </div> |
| <div class="chapter" title="Chapter 3. Sentence Detector"><div class="titlepage"><div><div><h2 class="title"><a name="tools.sentdetect"></a>Chapter 3. Sentence Detector</h2></div></div></div><div class="toc"><p><b>Table of Contents</b></p><dl><dt><span class="section"><a href="#tools.sentdetect.detection">Sentence Detection</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.sentdetect.detection.cmdline">Sentence Detection Tool</a></span></dt><dt><span class="section"><a href="#tools.sentdetect.detection.api">Sentence Detection API</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.sentdetect.training">Sentence Detector Training</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.sentdetect.training.tool">Training Tool</a></span></dt><dt><span class="section"><a href="#tools.sentdetect.training.api">Training API</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.sentdetect.eval">Evaluation</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.sentdetect.eval.tool">Evaluation Tool</a></span></dt></dl></dd></dl></div> |
| |
| |
| |
| <div class="section" title="Sentence Detection"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="tools.sentdetect.detection"></a>Sentence Detection</h2></div></div></div><div class="toc"><dl><dt><span class="section"><a href="#tools.sentdetect.detection.cmdline">Sentence Detection Tool</a></span></dt><dt><span class="section"><a href="#tools.sentdetect.detection.api">Sentence Detection API</a></span></dt></dl></div> |
| |
| <p> |
| The OpenNLP Sentence Detector can detect that a punctuation character |
| marks the end of a sentence or not. In this sense a sentence is defined |
| as the longest white space trimmed character sequence between two punctuation |
| marks. The first and last sentence make an exception to this rule. The first |
| non whitespace character is assumed to be the begin of a sentence, and the |
| last non whitespace character is assumed to be a sentence end. |
| The sample text below should be segmented into its sentences. |
| </p><pre class="screen"> |
| |
| Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29. Mr. Vinken is |
| chairman of Elsevier N.V., the Dutch publishing group. Rudolph Agnew, 55 years |
| old and former chairman of Consolidated Gold Fields PLC, was named a director of this |
| British industrial conglomerate. |
| </pre><p> |
| After detecting the sentence boundaries each sentence is written in its own line. |
| </p><pre class="screen"> |
| |
| Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29. |
| Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group. |
| Rudolph Agnew, 55 years old and former chairman of Consolidated Gold Fields PLC, |
| was named a director of this British industrial conglomerate. |
| </pre><p> |
| Usually Sentence Detection is done before the text is tokenized and that's the way the pre-trained models on the web site are trained, |
| but it is also possible to perform tokenization first and let the Sentence Detector process the already tokenized text. |
| The OpenNLP Sentence Detector cannot identify sentence boundaries based on the contents of the sentence. A prominent example is the first sentence in an article where the title is mistakenly identified to be the first part of the first sentence. |
| Most components in OpenNLP expect input which is segmented into sentences. |
| </p> |
| |
| <div class="section" title="Sentence Detection Tool"><div class="titlepage"><div><div><h3 class="title"><a name="tools.sentdetect.detection.cmdline"></a>Sentence Detection Tool</h3></div></div></div> |
| |
| <p> |
| The easiest way to try out the Sentence Detector is the command line tool. The tool is only intended for demonstration and testing. |
| Download the english sentence detector model and start the Sentence Detector Tool with this command: |
| </p><pre class="screen"> |
| |
| $ opennlp SentenceDetector en-sent.bin |
| </pre><p> |
| Just copy the sample text from above to the console. The Sentence Detector will read it and echo one sentence per line to the console. |
| Usually the input is read from a file and the output is redirected to another file. This can be achieved with the following command. |
| </p><pre class="screen"> |
| |
| $ opennlp SentenceDetector en-sent.bin < input.txt > output.txt |
| </pre><p> |
| For the english sentence model from the website the input text should not be tokenized. |
| </p> |
| </div> |
| <div class="section" title="Sentence Detection API"><div class="titlepage"><div><div><h3 class="title"><a name="tools.sentdetect.detection.api"></a>Sentence Detection API</h3></div></div></div> |
| |
| <p> |
| The Sentence Detector can be easily integrated into an application via its API. |
| To instantiate the Sentence Detector the sentence model must be loaded first. |
| </p><pre class="programlisting"> |
| |
| |
| <b class="hl-keyword">try</b> (InputStream modelIn = <b class="hl-keyword">new</b> FileInputStream(<b class="hl-string"><i style="color:red">"en-sent.bin"</i></b>)) { |
| SentenceModel model = <b class="hl-keyword">new</b> SentenceModel(modelIn); |
| } |
| </pre><p> |
| After the model is loaded the SentenceDetectorME can be instantiated. |
| </p><pre class="programlisting"> |
| |
| SentenceDetectorME sentenceDetector = <b class="hl-keyword">new</b> SentenceDetectorME(model); |
| </pre><p> |
| The Sentence Detector can output an array of Strings, where each String is one sentence. |
| </p><pre class="programlisting"> |
| |
| String sentences[] = sentenceDetector.sentDetect(<b class="hl-string"><i style="color:red">" First sentence. Second sentence. "</i></b>); |
| </pre><p> |
| The result array now contains two entries. The first String is "First sentence." and the |
| second String is "Second sentence." The whitespace before, between and after the input String is removed. |
| The API also offers a method which simply returns the span of the sentence in the input string. |
| </p><pre class="programlisting"> |
| |
| Span sentences[] = sentenceDetector.sentPosDetect(<b class="hl-string"><i style="color:red">" First sentence. Second sentence. "</i></b>); |
| </pre><p> |
| The result array again contains two entries. The first span beings at index 2 and ends at |
| 17. The second span begins at 18 and ends at 34. The utility method Span.getCoveredText can be used to create a substring which only covers the chars in the span. |
| </p> |
| </div> |
| </div> |
| <div class="section" title="Sentence Detector Training"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="tools.sentdetect.training"></a>Sentence Detector Training</h2></div></div></div><div class="toc"><dl><dt><span class="section"><a href="#tools.sentdetect.training.tool">Training Tool</a></span></dt><dt><span class="section"><a href="#tools.sentdetect.training.api">Training API</a></span></dt></dl></div> |
| |
| <p></p> |
| <div class="section" title="Training Tool"><div class="titlepage"><div><div><h3 class="title"><a name="tools.sentdetect.training.tool"></a>Training Tool</h3></div></div></div> |
| |
| <p> |
| OpenNLP has a command line tool which is used to train the models available from the model |
| download page on various corpora. The data must be converted to the OpenNLP Sentence Detector |
| training format. Which is one sentence per line. An empty line indicates a document boundary. |
| In case the document boundary is unknown, its recommended to have an empty line every few ten |
| sentences. Exactly like the output in the sample above. |
| Usage of the tool: |
| </p><pre class="screen"> |
| |
| $ opennlp SentenceDetectorTrainer |
| Usage: opennlp SentenceDetectorTrainer[.namefinder|.conllx|.pos] [-abbDict path] \ |
| [-params paramsFile] [-iterations num] [-cutoff num] -model modelFile \ |
| -lang language -data sampleData [-encoding charsetName] |
| |
| Arguments description: |
| -abbDict path |
| abbreviation dictionary in XML format. |
| -params paramsFile |
| training parameters file. |
| -iterations num |
| number of training iterations, ignored if -params is used. |
| -cutoff num |
| minimal number of times a feature must be seen, ignored if -params is used. |
| -model modelFile |
| output model file. |
| -lang language |
| language which is being processed. |
| -data sampleData |
| data to be used, usually a file name. |
| -encoding charsetName |
| encoding for reading and writing text, if absent the system default is used. |
| </pre><p> |
| To train an English sentence detector use the following command: |
| </p><pre class="screen"> |
| |
| $ opennlp SentenceDetectorTrainer -model en-sent.bin -lang en -data en-sent.train -encoding UTF-8 |
| |
| </pre><p> |
| It should produce the following output: |
| </p><pre class="screen"> |
| |
| Indexing events using cutoff of 5 |
| |
| Computing event counts... done. 4883 events |
| Indexing... done. |
| Sorting and merging events... done. Reduced 4883 events to 2945. |
| Done indexing. |
| Incorporating indexed data for training... |
| done. |
| Number of Event Tokens: 2945 |
| Number of Outcomes: 2 |
| Number of Predicates: 467 |
| ...done. |
| Computing model parameters... |
| Performing 100 iterations. |
| 1: .. loglikelihood=-3384.6376826743144 0.38951464263772273 |
| 2: .. loglikelihood=-2191.9266688597672 0.9397911120212984 |
| 3: .. loglikelihood=-1645.8640771555981 0.9643661683391358 |
| 4: .. loglikelihood=-1340.386303774519 0.9739913987302887 |
| 5: .. loglikelihood=-1148.4141548519624 0.9748105672742167 |
| |
| ...<skipping a bunch of iterations>... |
| |
| 95: .. loglikelihood=-288.25556805874436 0.9834118369854598 |
| 96: .. loglikelihood=-287.2283680343481 0.9834118369854598 |
| 97: .. loglikelihood=-286.2174830344526 0.9834118369854598 |
| 98: .. loglikelihood=-285.222486981048 0.9834118369854598 |
| 99: .. loglikelihood=-284.24296917223916 0.9834118369854598 |
| 100: .. loglikelihood=-283.2785335773966 0.9834118369854598 |
| Wrote sentence detector model. |
| Path: en-sent.bin |
| |
| </pre><p> |
| </p> |
| </div> |
| <div class="section" title="Training API"><div class="titlepage"><div><div><h3 class="title"><a name="tools.sentdetect.training.api"></a>Training API</h3></div></div></div> |
| |
| <p> |
| The Sentence Detector also offers an API to train a new sentence detection model. |
| Basically three steps are necessary to train it: |
| </p><div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem"> |
| <p>The application must open a sample data stream</p> |
| </li><li class="listitem"> |
| <p>Call the SentenceDetectorME.train method</p> |
| </li><li class="listitem"> |
| <p>Save the SentenceModel to a file or directly use it</p> |
| </li></ul></div><p> |
| The following sample code illustrates these steps: |
| </p><pre class="programlisting"> |
| |
| |
| ObjectStream<String> lineStream = |
| <b class="hl-keyword">new</b> PlainTextByLineStream(<b class="hl-keyword">new</b> MarkableFileInputStreamFactory(<b class="hl-keyword">new</b> File(<b class="hl-string"><i style="color:red">"en-sent.train"</i></b>)), StandardCharsets.UTF_<span class="hl-number">8</span>); |
| |
| SentenceModel model; |
| |
| <b class="hl-keyword">try</b> (ObjectStream<SentenceSample> sampleStream = <b class="hl-keyword">new</b> SentenceSampleStream(lineStream)) { |
| model = SentenceDetectorME.train(<b class="hl-string"><i style="color:red">"eng"</i></b>, sampleStream, |
| <b class="hl-keyword">new</b> SentenceDetectorFactory(<b class="hl-string"><i style="color:red">"eng"</i></b>, true, null, null), TrainingParameters.defaultParams()); |
| } |
| |
| <b class="hl-keyword">try</b> (OutputStream modelOut = <b class="hl-keyword">new</b> BufferedOutputStream(<b class="hl-keyword">new</b> FileOutputStream(modelFile))) { |
| model.serialize(modelOut); |
| } |
| </pre><p> |
| </p> |
| </div> |
| </div> |
| <div class="section" title="Evaluation"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="tools.sentdetect.eval"></a>Evaluation</h2></div></div></div><div class="toc"><dl><dt><span class="section"><a href="#tools.sentdetect.eval.tool">Evaluation Tool</a></span></dt></dl></div> |
| |
| <p> |
| </p> |
| <div class="section" title="Evaluation Tool"><div class="titlepage"><div><div><h3 class="title"><a name="tools.sentdetect.eval.tool"></a>Evaluation Tool</h3></div></div></div> |
| |
| <p> |
| The command shows how the evaluator tool can be run: |
| </p><pre class="screen"> |
| |
| $ opennlp SentenceDetectorEvaluator -model en-sent.bin -data en-sent.eval -encoding UTF-8 |
| |
| Loading model ... done |
| Evaluating ... done |
| |
| Precision: 0.9465737514518002 |
| Recall: 0.9095982142857143 |
| F-Measure: 0.9277177006260672 |
| </pre><p> |
| The en-sent.eval file has the same format as the training data. |
| </p> |
| </div> |
| </div> |
| </div> |
| <div class="chapter" title="Chapter 4. Tokenizer"><div class="titlepage"><div><div><h2 class="title"><a name="tools.tokenizer"></a>Chapter 4. Tokenizer</h2></div></div></div><div class="toc"><p><b>Table of Contents</b></p><dl><dt><span class="section"><a href="#tools.tokenizer.introduction">Tokenization</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.tokenizer.cmdline">Tokenizer Tools</a></span></dt><dt><span class="section"><a href="#tools.tokenizer.api">Tokenizer API</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.tokenizer.training">Tokenizer Training</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.tokenizer.training.tool">Training Tool</a></span></dt><dt><span class="section"><a href="#tools.tokenizer.training.api">Training API</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.tokenizer.detokenizing">Detokenizing</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.tokenizer.detokenizing.api">Detokenizing API</a></span></dt><dt><span class="section"><a href="#tools.tokenizer.detokenizing.dict">Detokenizer Dictionary</a></span></dt></dl></dd></dl></div> |
| |
| |
| |
| <div class="section" title="Tokenization"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="tools.tokenizer.introduction"></a>Tokenization</h2></div></div></div><div class="toc"><dl><dt><span class="section"><a href="#tools.tokenizer.cmdline">Tokenizer Tools</a></span></dt><dt><span class="section"><a href="#tools.tokenizer.api">Tokenizer API</a></span></dt></dl></div> |
| |
| <p> |
| The OpenNLP Tokenizers segment an input character sequence into |
| tokens. Tokens are usually |
| words, punctuation, numbers, etc. |
| |
| </p><pre class="screen"> |
| |
| Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29. |
| Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group. |
| Rudolph Agnew, 55 years old and former chairman of Consolidated Gold Fields |
| PLC, was named a director of this British industrial conglomerate. |
| |
| </pre><p> |
| |
| The following result shows the individual tokens in a whitespace |
| separated representation. |
| |
| </p><pre class="screen"> |
| |
| Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 . |
| Mr. Vinken is chairman of Elsevier N.V. , the Dutch publishing group . |
| Rudolph Agnew , 55 years old and former chairman of Consolidated Gold Fields PLC , |
| was named a nonexecutive director of this British industrial conglomerate . |
| A form of asbestos once used to make Kent cigarette filters has caused a high |
| percentage of cancer deaths among a group of workers exposed to it more than 30 years ago , |
| researchers reported . |
| |
| </pre><p> |
| |
| OpenNLP offers multiple tokenizer implementations: |
| </p><div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem"> |
| <p>Whitespace Tokenizer - A whitespace tokenizer, non whitespace |
| sequences are identified as tokens</p> |
| </li><li class="listitem"> |
| <p>Simple Tokenizer - A character class tokenizer, sequences of |
| the same character class are tokens</p> |
| </li><li class="listitem"> |
| <p>Learnable Tokenizer - A maximum entropy tokenizer, detects |
| token boundaries based on probability model</p> |
| </li></ul></div><p> |
| |
| Most part-of-speech taggers, parsers and so on, work with text |
| tokenized in this manner. It is important to ensure that your |
| tokenizer |
| produces tokens of the type expected by your later text |
| processing |
| components. |
| </p> |
| |
| <p> |
| With OpenNLP (as with many systems), tokenization is a two-stage |
| process: |
| first, sentence boundaries are identified, then tokens within |
| each |
| sentence are identified. |
| </p> |
| |
| <div class="section" title="Tokenizer Tools"><div class="titlepage"><div><div><h3 class="title"><a name="tools.tokenizer.cmdline"></a>Tokenizer Tools</h3></div></div></div> |
| |
| <p>The easiest way to try out the tokenizers are the command line |
| tools. The tools are only intended for demonstration and testing. |
| </p> |
| <p>There are two tools, one for the Simple Tokenizer and one for |
| the learnable tokenizer. A command line tool the for the Whitespace |
| Tokenizer does not exist, because the whitespace separated output |
| would be identical to the input.</p> |
| <p> |
| The following command shows how to use the Simple Tokenizer Tool. |
| |
| </p><pre class="screen"> |
| |
| $ opennlp SimpleTokenizer |
| </pre><p> |
| To use the learnable tokenizer download the english token model from |
| our website. |
| </p><pre class="screen"> |
| |
| $ opennlp TokenizerME en-token.bin |
| </pre><p> |
| To test the tokenizer copy the sample from above to the console. The |
| whitespace separated tokens will be written back to the |
| console. |
| </p> |
| <p> |
| Usually the input is read from a file and written to a file. |
| </p><pre class="screen"> |
| |
| $ opennlp TokenizerME en-token.bin < article.txt > article-tokenized.txt |
| </pre><p> |
| It can be done in the same way for the Simple Tokenizer. |
| </p> |
| <p> |
| Since most text comes truly raw and doesn't have sentence boundaries |
| and such, its possible to create a pipe which first performs sentence |
| boundary detection and tokenization. The following sample illustrates |
| that. |
| </p><pre class="screen"> |
| |
| $ opennlp SentenceDetector sentdetect.model < article.txt | opennlp TokenizerME tokenize.model | more |
| Loading model ... Loading model ... done |
| done |
| Showa Shell gained 20 to 1,570 and Mitsubishi Oil rose 50 to 1,500. |
| Sumitomo Metal Mining fell five yen to 692 and Nippon Mining added 15 to 960 . |
| Among other winners Wednesday was Nippon Shokubai , which was up 80 at 2,410 . |
| Marubeni advanced 11 to 890 . |
| London share prices were bolstered largely by continued gains on Wall Street and technical |
| factors affecting demand for London 's blue-chip stocks . |
| ...etc... |
| </pre><p> |
| Of course this is all on the command line. Many people use the models |
| directly in their Java code by creating SentenceDetector and |
| Tokenizer objects and calling their methods as appropriate. The |
| following section will explain how the Tokenizers can be used |
| directly from java. |
| </p> |
| </div> |
| |
| <div class="section" title="Tokenizer API"><div class="titlepage"><div><div><h3 class="title"><a name="tools.tokenizer.api"></a>Tokenizer API</h3></div></div></div> |
| |
| <p> |
| The Tokenizers can be integrated into an application by the defined |
| API. |
| The shared instance of the WhitespaceTokenizer can be retrieved from a |
| static field WhitespaceTokenizer.INSTANCE. The shared instance of the |
| SimpleTokenizer can be retrieved in the same way from |
| SimpleTokenizer.INSTANCE. |
| To instantiate the TokenizerME (the learnable tokenizer) a Token Model |
| must be created first. The following code sample shows how a model |
| can be loaded. |
| </p><pre class="programlisting"> |
| |
| |
| <b class="hl-keyword">try</b> (InputStream modelIn = <b class="hl-keyword">new</b> FileInputStream(<b class="hl-string"><i style="color:red">"en-token.bin"</i></b>)) { |
| TokenizerModel model = <b class="hl-keyword">new</b> TokenizerModel(modelIn); |
| } |
| </pre><p> |
| After the model is loaded the TokenizerME can be instantiated. |
| </p><pre class="programlisting"> |
| |
| Tokenizer tokenizer = <b class="hl-keyword">new</b> TokenizerME(model); |
| </pre><p> |
| The tokenizer offers two tokenize methods, both expect an input |
| String object which contains the untokenized text. If possible it |
| should be a sentence, but depending on the training of the learnable |
| tokenizer this is not required. The first returns an array of |
| Strings, where each String is one token. |
| </p><pre class="programlisting"> |
| |
| String tokens[] = tokenizer.tokenize(<b class="hl-string"><i style="color:red">"An input sample sentence."</i></b>); |
| </pre><p> |
| The output will be an array with these tokens. |
| </p><pre class="programlisting"> |
| |
| "An", "input", "sample", "sentence", "." |
| </pre><p> |
| The second method, tokenizePos returns an array of Spans, each Span |
| contain the begin and end character offsets of the token in the input |
| String. |
| </p><pre class="programlisting"> |
| |
| Span tokenSpans[] = tokenizer.tokenizePos(<b class="hl-string"><i style="color:red">"An input sample sentence."</i></b>); |
| </pre><p> |
| The tokenSpans array now contain 5 elements. To get the text for one |
| span call Span.getCoveredText which takes a span and the input text. |
| |
| The TokenizerME is able to output the probabilities for the detected |
| tokens. The getTokenProbabilities method must be called directly |
| after one of the tokenize methods was called. |
| </p><pre class="programlisting"> |
| |
| TokenizerME tokenizer = ... |
| |
| String tokens[] = tokenizer.tokenize(...); |
| <b class="hl-keyword">double</b> tokenProbs[] = tokenizer.getTokenProbabilities(); |
| </pre><p> |
| The tokenProbs array now contains one double value per token, the |
| value is between 0 and 1, where 1 is the highest possible probability |
| and 0 the lowest possible probability. |
| </p> |
| </div> |
| </div> |
| |
| <div class="section" title="Tokenizer Training"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="tools.tokenizer.training"></a>Tokenizer Training</h2></div></div></div><div class="toc"><dl><dt><span class="section"><a href="#tools.tokenizer.training.tool">Training Tool</a></span></dt><dt><span class="section"><a href="#tools.tokenizer.training.api">Training API</a></span></dt></dl></div> |
| |
| |
| <div class="section" title="Training Tool"><div class="titlepage"><div><div><h3 class="title"><a name="tools.tokenizer.training.tool"></a>Training Tool</h3></div></div></div> |
| |
| <p> |
| OpenNLP has a command line tool which is used to train the models |
| available from the model download page on various corpora. The data |
| can be converted to the OpenNLP Tokenizer training format or used directly. |
| The OpenNLP format contains one sentence per line. Tokens are either separated by a |
| whitespace or by a special <SPLIT> tag. Tokens are split automaticaly on whitespace |
| and at least one <SPLIT> tag must be present in the training text. |
| |
| The following sample shows the sample from above in the correct format. |
| </p><pre class="screen"> |
| |
| Pierre Vinken<SPLIT>, 61 years old<SPLIT>, will join the board as a nonexecutive director Nov. 29<SPLIT>. |
| Mr. Vinken is chairman of Elsevier N.V.<SPLIT>, the Dutch publishing group<SPLIT>. |
| Rudolph Agnew<SPLIT>, 55 years old and former chairman of Consolidated Gold Fields PLC<SPLIT>, |
| was named a nonexecutive director of this British industrial conglomerate<SPLIT>. |
| </pre><p> |
| Usage of the tool: |
| </p><pre class="screen"> |
| |
| $ opennlp TokenizerTrainer |
| Usage: opennlp TokenizerTrainer[.namefinder|.conllx|.pos] [-abbDict path] \ |
| [-alphaNumOpt isAlphaNumOpt] [-params paramsFile] [-iterations num] \ |
| [-cutoff num] -model modelFile -lang language -data sampleData \ |
| [-encoding charsetName] |
| |
| Arguments description: |
| -abbDict path |
| abbreviation dictionary in XML format. |
| -alphaNumOpt isAlphaNumOpt |
| Optimization flag to skip alpha numeric tokens for further tokenization |
| -params paramsFile |
| training parameters file. |
| -iterations num |
| number of training iterations, ignored if -params is used. |
| -cutoff num |
| minimal number of times a feature must be seen, ignored if -params is used. |
| -model modelFile |
| output model file. |
| -lang language |
| language which is being processed. |
| -data sampleData |
| data to be used, usually a file name. |
| -encoding charsetName |
| encoding for reading and writing text, if absent the system default is used. |
| </pre><p> |
| To train the english tokenizer use the following command: |
| </p><pre class="screen"> |
| |
| $ opennlp TokenizerTrainer -model en-token.bin -alphaNumOpt true -lang en -data en-token.train -encoding UTF-8 |
| |
| Indexing events with TwoPass using cutoff of 5 |
| |
| Computing event counts... done. 45 events |
| Indexing... done. |
| Sorting and merging events... done. Reduced 45 events to 25. |
| Done indexing in 0,09 s. |
| Incorporating indexed data for training... |
| done. |
| Number of Event Tokens: 25 |
| Number of Outcomes: 2 |
| Number of Predicates: 18 |
| ...done. |
| Computing model parameters ... |
| Performing 100 iterations. |
| 1: ... loglikelihood=-31.191623125197527 0.8222222222222222 |
| 2: ... loglikelihood=-21.036561339080343 0.8666666666666667 |
| 3: ... loglikelihood=-16.397882721809086 0.9333333333333333 |
| 4: ... loglikelihood=-13.624159882595462 0.9333333333333333 |
| 5: ... loglikelihood=-11.762067054883842 0.9777777777777777 |
| |
| ...<skipping a bunch of iterations>... |
| |
| 95: ... loglikelihood=-2.0234942537226366 1.0 |
| 96: ... loglikelihood=-2.0107265117555935 1.0 |
| 97: ... loglikelihood=-1.998139365828305 1.0 |
| 98: ... loglikelihood=-1.9857283791639697 1.0 |
| 99: ... loglikelihood=-1.9734892753591327 1.0 |
| 100: ... loglikelihood=-1.9614179307958106 1.0 |
| Writing tokenizer model ... done (0,044s) |
| |
| Wrote tokenizer model to |
| Path: en-token.bin |
| </pre><p> |
| </p> |
| </div> |
| <div class="section" title="Training API"><div class="titlepage"><div><div><h3 class="title"><a name="tools.tokenizer.training.api"></a>Training API</h3></div></div></div> |
| |
| <p> |
| The Tokenizer offers an API to train a new tokenization model. Basically three steps |
| are necessary to train it: |
| </p><div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem"> |
| <p>The application must open a sample data stream</p> |
| </li><li class="listitem"> |
| <p>Call the TokenizerME.train method</p> |
| </li><li class="listitem"> |
| <p>Save the TokenizerModel to a file or directly use it</p> |
| </li></ul></div><p> |
| The following sample code illustrates these steps: |
| </p><pre class="programlisting"> |
| |
| ObjectStream<String> lineStream = <b class="hl-keyword">new</b> PlainTextByLineStream(<b class="hl-keyword">new</b> MarkableFileInputStreamFactory(<b class="hl-keyword">new</b> File(<b class="hl-string"><i style="color:red">"en-sent.train"</i></b>)), |
| StandardCharsets.UTF_<span class="hl-number">8</span>); |
| |
| ObjectStream<TokenSample> sampleStream = <b class="hl-keyword">new</b> TokenSampleStream(lineStream); |
| |
| TokenizerModel model; |
| |
| <b class="hl-keyword">try</b> { |
| model = TokenizerME.train(sampleStream, |
| TokenizerFactory.create(null, <b class="hl-string"><i style="color:red">"eng"</i></b>, null, true, null), TrainingParameters.defaultParams()); |
| } |
| <b class="hl-keyword">finally</b> { |
| sampleStream.close(); |
| } |
| |
| OutputStream modelOut = null; |
| <b class="hl-keyword">try</b> { |
| modelOut = <b class="hl-keyword">new</b> BufferedOutputStream(<b class="hl-keyword">new</b> FileOutputStream(modelFile)); |
| model.serialize(modelOut); |
| } <b class="hl-keyword">finally</b> { |
| <b class="hl-keyword">if</b> (modelOut != null) |
| modelOut.close(); |
| } |
| </pre><p> |
| </p> |
| </div> |
| </div> |
| |
| <div class="section" title="Detokenizing"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="tools.tokenizer.detokenizing"></a>Detokenizing</h2></div></div></div><div class="toc"><dl><dt><span class="section"><a href="#tools.tokenizer.detokenizing.api">Detokenizing API</a></span></dt><dt><span class="section"><a href="#tools.tokenizer.detokenizing.dict">Detokenizer Dictionary</a></span></dt></dl></div> |
| |
| <p> |
| Detokenizing is simple the opposite of tokenization, the original non-tokenized string should |
| be constructed out of a token sequence. The OpenNLP implementation was created to undo the tokenization |
| of training data for the tokenizer. It can also be used to undo the tokenization of such a trained |
| tokenizer. The implementation is strictly rule based and defines how tokens should be attached |
| to a sentence wise character sequence. |
| </p> |
| <p> |
| The rule dictionary assign to every token an operation which describes how it should be attached |
| to one continuous character sequence. |
| </p> |
| <p> |
| The following rules can be assigned to a token: |
| </p><div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem"> |
| <p>MERGE_TO_LEFT - Merges the token to the left side.</p> |
| </li><li class="listitem"> |
| <p>MERGE_TO_RIGHT - Merges the token to the right side.</p> |
| </li><li class="listitem"> |
| <p>RIGHT_LEFT_MATCHING - Merges the token to the right side on first occurrence |
| and to the left side on second occurrence.</p> |
| </li></ul></div><p> |
| |
| The following sample will illustrate how the detokenizer with a small |
| rule dictionary (illustration format, not the xml data format): |
| </p><pre class="programlisting"> |
| |
| . MERGE_TO_LEFT |
| " RIGHT_LEFT_MATCHING |
| </pre><p> |
| The dictionary should be used to de-tokenize the following whitespace tokenized sentence: |
| </p><pre class="programlisting"> |
| |
| He said " This is a test " . |
| </pre><p> |
| The tokens would get these tags based on the dictionary: |
| </p><pre class="programlisting"> |
| |
| He -> NO_OPERATION |
| said -> NO_OPERATION |
| " -> MERGE_TO_RIGHT |
| This -> NO_OPERATION |
| is -> NO_OPERATION |
| a -> NO_OPERATION |
| test -> NO_OPERATION |
| " -> MERGE_TO_LEFT |
| . -> MERGE_TO_LEFT |
| </pre><p> |
| That will result in the following character sequence: |
| </p><pre class="programlisting"> |
| |
| He said "This is a test". |
| </pre><p> |
| </p> |
| <div class="section" title="Detokenizing API"><div class="titlepage"><div><div><h3 class="title"><a name="tools.tokenizer.detokenizing.api"></a>Detokenizing API</h3></div></div></div> |
| |
| <p> |
| The Detokenizer can be used to detokenize the tokens to String. |
| To instantiate the Detokenizer (a rule based detokenizer) |
| a DetokenizationDictionary (the rule of dictionary) must be created first. |
| The following code sample shows how a rule dictionary can be loaded. |
| </p><pre class="programlisting"> |
| |
| InputStream dictIn = <b class="hl-keyword">new</b> FileInputStream(<b class="hl-string"><i style="color:red">"latin-detokenizer.xml"</i></b>); |
| DetokenizationDictionary dict = <b class="hl-keyword">new</b> DetokenizationDictionary(dictIn); |
| </pre><p> |
| After the rule dictionary is loadeed the DictionaryDetokenizer can be instantiated. |
| </p><pre class="programlisting"> |
| |
| Detokenizer detokenizer = <b class="hl-keyword">new</b> DictionaryDetokenizer(dict); |
| </pre><p> |
| The detokenizer offers two detokenize methods,the first detokenize the input tokens into a String. |
| </p><pre class="programlisting"> |
| |
| String[] tokens = <b class="hl-keyword">new</b> String[]{<b class="hl-string"><i style="color:red">"A"</i></b>, <b class="hl-string"><i style="color:red">"co"</i></b>, <b class="hl-string"><i style="color:red">"-"</i></b>, <b class="hl-string"><i style="color:red">"worker"</i></b>, <b class="hl-string"><i style="color:red">"helped"</i></b>, <b class="hl-string"><i style="color:red">"."</i></b>}; |
| String sentence = detokenizer.detokenize(tokens, null); |
| Assert.assertEquals(<b class="hl-string"><i style="color:red">"A co-worker helped."</i></b>, sentence); |
| </pre><p> |
| Tokens which are connected without a space in-between can be separated by a split marker. |
| </p><pre class="programlisting"> |
| |
| String sentence = detokenizer.detokenize(tokens, <b class="hl-string"><i style="color:red">"<SPLIT>"</i></b>); |
| Assert.assertEquals(<b class="hl-string"><i style="color:red">"A co<SPLIT>-<SPLIT>worker helped<SPLIT>."</i></b>, sentence); |
| </pre><p> |
| The API also offers a method which simply returns operations array in the input tokens array. |
| </p><pre class="programlisting"> |
| |
| DetokenizationOperation[] operations = detokenizer.detokenize(tokens); |
| <b class="hl-keyword">for</b> (DetokenizationOperation operation : operations) { |
| System.out.println(operation); |
| } |
| </pre><p> |
| Output: |
| </p><pre class="programlisting"> |
| |
| NO_OPERATION |
| NO_OPERATION |
| MERGE_BOTH |
| NO_OPERATION |
| NO_OPERATION |
| MERGE_TO_LEFT |
| </pre><p> |
| </p> |
| </div> |
| <div class="section" title="Detokenizer Dictionary"><div class="titlepage"><div><div><h3 class="title"><a name="tools.tokenizer.detokenizing.dict"></a>Detokenizer Dictionary</h3></div></div></div> |
| |
| <p> |
| Detokenization Dictionary is the rule dictionary about detokenizer. |
| tokens - an array of tokens that should be detokenized according to an operation. |
| operations - an array of operations which specifies which operation |
| should be used for the provided tokens. |
| The following code sample shows how a rule dictionary can be created. |
| </p><pre class="programlisting"> |
| |
| String[] tokens = <b class="hl-keyword">new</b> String[]{<b class="hl-string"><i style="color:red">"."</i></b>, <b class="hl-string"><i style="color:red">"!"</i></b>, <b class="hl-string"><i style="color:red">"("</i></b>, <b class="hl-string"><i style="color:red">")"</i></b>, <b class="hl-string"><i style="color:red">"\""</i></b>, <b class="hl-string"><i style="color:red">"-"</i></b>}; |
| Operation[] operations = <b class="hl-keyword">new</b> Operation[]{ |
| Operation.MOVE_LEFT, |
| Operation.MOVE_LEFT, |
| Operation.MOVE_RIGHT, |
| Operation.MOVE_LEFT, |
| Operation.RIGHT_LEFT_MATCHING, |
| Operation.MOVE_BOTH}; |
| DetokenizationDictionary dict = <b class="hl-keyword">new</b> DetokenizationDictionary(tokens, operations); |
| </pre><p> |
| </p> |
| </div> |
| </div> |
| </div> |
| <div class="chapter" title="Chapter 5. Name Finder"><div class="titlepage"><div><div><h2 class="title"><a name="tools.namefind"></a>Chapter 5. Name Finder</h2></div></div></div><div class="toc"><p><b>Table of Contents</b></p><dl><dt><span class="section"><a href="#tools.namefind.recognition">Named Entity Recognition</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.namefind.recognition.cmdline">Name Finder Tool</a></span></dt><dt><span class="section"><a href="#tools.namefind.recognition.api">Name Finder API</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.namefind.api.onnx">Using an ONNX Model</a></span></dt></dl></dd></dl></dd><dt><span class="section"><a href="#tools.namefind.training">Name Finder Training</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.namefind.training.tool">Training Tool</a></span></dt><dt><span class="section"><a href="#tools.namefind.training.api">Training API</a></span></dt><dt><span class="section"><a href="#tools.namefind.training.featuregen">Custom Feature Generation</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.namefind.training.featuregen.api">Feature Generation defined by API</a></span></dt><dt><span class="section"><a href="#tools.namefind.training.featuregen.xml">Feature Generation defined by XML Descriptor</a></span></dt></dl></dd></dl></dd><dt><span class="section"><a href="#tools.namefind.eval">Evaluation</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.namefind.eval.tool">Evaluation Tool</a></span></dt><dt><span class="section"><a href="#tools.namefind.eval.api">Evaluation API</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.namefind.annotation_guides">Named Entity Annotation Guidelines</a></span></dt></dl></div> |
| |
| |
| |
| <div class="section" title="Named Entity Recognition"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="tools.namefind.recognition"></a>Named Entity Recognition</h2></div></div></div><div class="toc"><dl><dt><span class="section"><a href="#tools.namefind.recognition.cmdline">Name Finder Tool</a></span></dt><dt><span class="section"><a href="#tools.namefind.recognition.api">Name Finder API</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.namefind.api.onnx">Using an ONNX Model</a></span></dt></dl></dd></dl></div> |
| |
| <p> |
| The Name Finder can detect named entities and numbers in text. To be able to |
| detect entities the Name Finder needs a model. The model is dependent on the |
| language and entity type it was trained for. The OpenNLP projects offers a number |
| of pre-trained name finder models which are trained on various freely available corpora. |
| They can be downloaded at our model download page. To find names in raw text the text |
| must be segmented into tokens and sentences. A detailed description is given in the |
| sentence detector and tokenizer tutorial. It is important that the tokenization for |
| the training data and the input text is identical. |
| </p> |
| |
| <div class="section" title="Name Finder Tool"><div class="titlepage"><div><div><h3 class="title"><a name="tools.namefind.recognition.cmdline"></a>Name Finder Tool</h3></div></div></div> |
| |
| <p> |
| The easiest way to try out the Name Finder is the command line tool. |
| The tool is only intended for demonstration and testing. Download the |
| English |
| person model and start the Name Finder Tool with this command: |
| </p><pre class="screen"> |
| |
| $ opennlp TokenNameFinder en-ner-person.bin |
| </pre><p> |
| |
| The name finder now reads a tokenized sentence per line from stdin, an empty |
| line indicates a document boundary and resets the adaptive feature generators. |
| Just copy this text to the terminal: |
| |
| </p><pre class="screen"> |
| |
| Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 . |
| Mr . Vinken is chairman of Elsevier N.V. , the Dutch publishing group . |
| Rudolph Agnew , 55 years old and former chairman of Consolidated Gold Fields PLC , was named |
| a director of this British industrial conglomerate . |
| </pre><p> |
| the name finder will now output the text with markup for person names: |
| </p><pre class="screen"> |
| |
| <START:person> Pierre Vinken <END> , 61 years old , will join the board as a nonexecutive director Nov. 29 . |
| Mr . <START:person> Vinken <END> is chairman of Elsevier N.V. , the Dutch publishing group . |
| <START:person> Rudolph Agnew <END> , 55 years old and former chairman of Consolidated Gold Fields PLC , |
| was named a director of this British industrial conglomerate . |
| </pre><p> |
| </p> |
| </div> |
| <div class="section" title="Name Finder API"><div class="titlepage"><div><div><h3 class="title"><a name="tools.namefind.recognition.api"></a>Name Finder API</h3></div></div></div><div class="toc"><dl><dt><span class="section"><a href="#tools.namefind.api.onnx">Using an ONNX Model</a></span></dt></dl></div> |
| |
| <p> |
| To use the Name Finder in a production system it is strongly recommended to embed it |
| directly into the application instead of using the command line interface. |
| First the name finder model must be loaded into memory from disk or an other source. |
| In the sample below it is loaded from disk. |
| </p><pre class="programlisting"> |
| |
| <b class="hl-keyword">try</b> (InputStream modelIn = <b class="hl-keyword">new</b> FileInputStream(<b class="hl-string"><i style="color:red">"en-ner-person.bin"</i></b>)){ |
| TokenNameFinderModel model = <b class="hl-keyword">new</b> TokenNameFinderModel(modelIn); |
| } |
| |
| </pre><p> |
| There is a number of reasons why the model loading can fail: |
| </p><div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem"> |
| <p>Issues with the underlying I/O</p> |
| </li><li class="listitem"> |
| <p>The version of the model is not compatible with the OpenNLP version</p> |
| </li><li class="listitem"> |
| <p>The model is loaded into the wrong component, |
| for example a tokenizer model is loaded with TokenNameFinderModel class.</p> |
| </li><li class="listitem"> |
| <p>The model content is not valid for some other reason</p> |
| </li></ul></div><p> |
| After the model is loaded the NameFinderME can be instantiated. |
| </p><pre class="programlisting"> |
| |
| NameFinderME nameFinder = <b class="hl-keyword">new</b> NameFinderME(model); |
| </pre><p> |
| The initialization is now finished and the Name Finder can be used. The NameFinderME |
| class is not thread safe, it must only be called from one thread. To use multiple threads |
| multiple NameFinderME instances sharing the same model instance can be created. |
| The input text should be segmented into documents, sentences and tokens. |
| To perform entity detection an application calls the find method for every sentence in the |
| document. After every document clearAdaptiveData must be called to clear the adaptive data in |
| the feature generators. Not calling clearAdaptiveData can lead to a sharp drop in the detection |
| rate after a few documents. |
| The following code illustrates that: |
| </p><pre class="programlisting"> |
| |
| <b class="hl-keyword">for</b> (String document[][] : documents) { |
| |
| <b class="hl-keyword">for</b> (String[] sentence : document) { |
| Span nameSpans[] = nameFinder.find(sentence); |
| <i class="hl-comment" style="color: silver">// do something with the names</i> |
| } |
| |
| nameFinder.clearAdaptiveData() |
| } |
| </pre><p> |
| the following snippet shows a call to find |
| </p><pre class="programlisting"> |
| |
| String sentence[] = <b class="hl-keyword">new</b> String[]{ |
| <b class="hl-string"><i style="color:red">"Pierre"</i></b>, |
| <b class="hl-string"><i style="color:red">"Vinken"</i></b>, |
| <b class="hl-string"><i style="color:red">"is"</i></b>, |
| <b class="hl-string"><i style="color:red">"61"</i></b>, |
| <b class="hl-string"><i style="color:red">"years"</i></b> |
| <b class="hl-string"><i style="color:red">"old"</i></b>, |
| <b class="hl-string"><i style="color:red">"."</i></b> |
| }; |
| |
| Span nameSpans[] = nameFinder.find(sentence); |
| </pre><p> |
| The nameSpans arrays contains now exactly one Span which marks the name Pierre Vinken. |
| The elements between the begin and end offsets are the name tokens. In this case the begin |
| offset is 0 and the end offset is 2. The Span object also knows the type of the entity. |
| In this case it is person (defined by the model). It can be retrieved with a call to Span.getType(). |
| Additionally to the statistical Name Finder, OpenNLP also offers a dictionary and a regular |
| expression name finder implementation. |
| </p> |
| <div class="section" title="Using an ONNX Model"><div class="titlepage"><div><div><h4 class="title"><a name="tools.namefind.api.onnx"></a>Using an ONNX Model</h4></div></div></div> |
| |
| <p> |
| Using an ONNX model is similar, except we will utilize the <code class="code">NameFinderDL</code> class instead. |
| You must provide the path to the model file and the vocabulary file to the name finder. |
| (There is no need to load the model as an InputStream as in the previous example.) The name finder |
| requires a tokenized list of strings as input. The output will be an array of spans. |
| </p><pre class="programlisting"> |
| |
| File model = <b class="hl-keyword">new</b> File(<b class="hl-string"><i style="color:red">"/path/to/model.onnx"</i></b>); |
| File vocab = <b class="hl-keyword">new</b> File(<b class="hl-string"><i style="color:red">"/path/to/vocab.txt"</i></b>); |
| Map<Integer, String> categories = <b class="hl-keyword">new</b> HashMap<>(); |
| String[] tokens = <b class="hl-keyword">new</b> String[]{<b class="hl-string"><i style="color:red">"George"</i></b>, <b class="hl-string"><i style="color:red">"Washington"</i></b>, <b class="hl-string"><i style="color:red">"was"</i></b>, <b class="hl-string"><i style="color:red">"president"</i></b>, <b class="hl-string"><i style="color:red">"of"</i></b>, <b class="hl-string"><i style="color:red">"the"</i></b>, <b class="hl-string"><i style="color:red">"United"</i></b>, <b class="hl-string"><i style="color:red">"States"</i></b>, <b class="hl-string"><i style="color:red">"."</i></b>}; |
| NameFinderDL nameFinderDL = <b class="hl-keyword">new</b> NameFinderDL(model, vocab, false, getIds2Labels()); |
| Span[] spans = nameFinderDL.find(tokens); |
| </pre><p> |
| For additional examples, refer to the <code class="code">NameFinderDLEval</code> class. |
| </p> |
| </div> |
| </div> |
| </div> |
| <div class="section" title="Name Finder Training"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="tools.namefind.training"></a>Name Finder Training</h2></div></div></div><div class="toc"><dl><dt><span class="section"><a href="#tools.namefind.training.tool">Training Tool</a></span></dt><dt><span class="section"><a href="#tools.namefind.training.api">Training API</a></span></dt><dt><span class="section"><a href="#tools.namefind.training.featuregen">Custom Feature Generation</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.namefind.training.featuregen.api">Feature Generation defined by API</a></span></dt><dt><span class="section"><a href="#tools.namefind.training.featuregen.xml">Feature Generation defined by XML Descriptor</a></span></dt></dl></dd></dl></div> |
| |
| <p> |
| The pre-trained models might not be available for a desired language, can not detect |
| important entities or the performance is not good enough outside the news domain. |
| These are the typical reason to do custom training of the name finder on a new corpus |
| or on a corpus which is extended by private training data taken from the data which should be analyzed. |
| </p> |
| |
| <div class="section" title="Training Tool"><div class="titlepage"><div><div><h3 class="title"><a name="tools.namefind.training.tool"></a>Training Tool</h3></div></div></div> |
| |
| <p> |
| OpenNLP has a command line tool which is used to train the models available from the model |
| download page on various corpora. |
| </p> |
| <p> |
| Note that ONNX model support is not available through the command line tool. The models that can be trained |
| using the tool are OpenNLP models. ONNX models are trained through deep learning frameworks and then |
| utilized by OpenNLP. |
| </p> |
| <p> |
| The data can be converted to the OpenNLP name finder training format. Which is one |
| sentence per line. Some other formats are available as well. |
| The sentence must be tokenized and contain spans which mark the entities. Documents are separated by |
| empty lines which trigger the reset of the adaptive feature generators. A training file can contain |
| multiple types. If the training file contains multiple types the created model will also be able to |
| detect these multiple types. |
| </p> |
| <p> |
| Sample sentence of the data: |
| </p><pre class="screen"> |
| |
| <START:person> Pierre Vinken <END> , 61 years old , will join the board as a nonexecutive director Nov. 29 . |
| Mr . <START:person> Vinken <END> is chairman of Elsevier N.V. , the Dutch publishing group . |
| </pre><p> |
| The training data should contain at least 15000 sentences to create a model which performs well. |
| Usage of the tool: |
| </p><pre class="screen"> |
| |
| $ opennlp TokenNameFinderTrainer |
| Usage: opennlp TokenNameFinderTrainer[.evalita|.ad|.conll03|.bionlp2004|.conll02|.muc6|.ontonotes|.brat] \ |
| [-featuregen featuregenFile] [-nameTypes types] [-sequenceCodec codec] [-factory factoryName] \ |
| [-resources resourcesDir] [-type typeOverride] [-params paramsFile] -lang language \ |
| -model modelFile -data sampleData [-encoding charsetName] |
| |
| Arguments description: |
| -featuregen featuregenFile |
| The feature generator descriptor file |
| -nameTypes types |
| name types to use for training |
| -sequenceCodec codec |
| sequence codec used to code name spans |
| -factory factoryName |
| A sub-class of TokenNameFinderFactory |
| -resources resourcesDir |
| The resources directory |
| -type typeOverride |
| Overrides the type parameter in the provided samples |
| -params paramsFile |
| training parameters file. |
| -lang language |
| language which is being processed. |
| -model modelFile |
| output model file. |
| -data sampleData |
| data to be used, usually a file name. |
| -encoding charsetName |
| encoding for reading and writing text, if absent the system default is used. |
| </pre><p> |
| It is now assumed that the english person name finder model should be trained from a file |
| called en-ner-person.train which is encoded as UTF-8. The following command will train |
| the name finder and write the model to en-ner-person.bin: |
| </p><pre class="screen"> |
| |
| $ opennlp TokenNameFinderTrainer -model en-ner-person.bin -lang en -data en-ner-person.train -encoding UTF-8 |
| </pre><p> |
| The example above will train models with a pre-defined feature set. It is also possible to use the -resources parameter to generate features based on external knowledge such as those based on word representation (clustering) features. The external resources must all be placed in a resource directory which is then passed as a parameter. If this option is used it is then required to pass, via the -featuregen parameter, a XML custom feature generator which includes some of the clustering features shipped with the TokenNameFinder. Currently three formats of clustering lexicons are accepted: |
| </p><div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem"> |
| <p>Space separated two column file specifying the token and the cluster class as generated by toolkits such as <a class="ulink" href="https://code.google.com/p/word2vec/" target="_top">word2vec</a>.</p> |
| </li><li class="listitem"> |
| <p>Space separated three column file specifying the token, clustering class and weight as such as <a class="ulink" href="https://github.com/ninjin/clark_pos_induction" target="_top">Clark's clusters</a>.</p> |
| </li><li class="listitem"> |
| <p>Tab separated three column Brown clusters as generated by <a class="ulink" href="https://github.com/percyliang/brown-cluster" target="_top"> |
| Liang's toolkit</a>.</p> |
| </li></ul></div><p> |
| Additionally it is possible to specify the number of iterations, |
| the cutoff and to overwrite all types in the training data with a single type. Finally, the -sequenceCodec parameter allows to specify a BIO (Begin, Inside, Out) or BILOU (Begin, Inside, Last, Out, Unit) encoding to represent the Named Entities. An example of one such command would be as follows: |
| </p><pre class="screen"> |
| |
| $ opennlp TokenNameFinderTrainer -featuregen brown.xml -sequenceCodec BILOU -resources clusters/ \ |
| -params PerceptronTrainerParams.txt -lang en -model ner-test.bin -data en-train.opennlp -encoding UTF-8 |
| </pre><p> |
| </p> |
| </div> |
| <div class="section" title="Training API"><div class="titlepage"><div><div><h3 class="title"><a name="tools.namefind.training.api"></a>Training API</h3></div></div></div> |
| |
| <p> |
| To train the name finder from within an application it is recommended to use the training |
| API instead of the command line tool. |
| Basically three steps are necessary to train it: |
| </p><div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem"> |
| <p>The application must open a sample data stream</p> |
| </li><li class="listitem"> |
| <p>Call the NameFinderME.train method</p> |
| </li><li class="listitem"> |
| <p>Save the TokenNameFinderModel to a file</p> |
| </li></ul></div><p> |
| The three steps are illustrated by the following sample code: |
| </p><pre class="programlisting"> |
| |
| ObjectStream<String> lineStream = |
| <b class="hl-keyword">new</b> PlainTextByLineStream(<b class="hl-keyword">new</b> MarkableFileInputStreamFactory(<b class="hl-keyword">new</b> File(<b class="hl-string"><i style="color:red">"en-ner-person.train"</i></b>)), StandardCharsets.UTF_<span class="hl-number">8</span>); |
| |
| TokenNameFinderModel model; |
| |
| <b class="hl-keyword">try</b> (ObjectStream<NameSample> sampleStream = <b class="hl-keyword">new</b> NameSampleDataStream(lineStream)) { |
| model = NameFinderME.train(<b class="hl-string"><i style="color:red">"eng"</i></b>, <b class="hl-string"><i style="color:red">"person"</i></b>, sampleStream, TrainingParameters.defaultParams(), nameFinderFactory); |
| } |
| |
| <b class="hl-keyword">try</b> (ObjectStream modelOut = <b class="hl-keyword">new</b> BufferedOutputStream(<b class="hl-keyword">new</b> FileOutputStream(modelFile)){ |
| model.serialize(modelOut); |
| } |
| </pre><p> |
| </p> |
| </div> |
| |
| <div class="section" title="Custom Feature Generation"><div class="titlepage"><div><div><h3 class="title"><a name="tools.namefind.training.featuregen"></a>Custom Feature Generation</h3></div></div></div><div class="toc"><dl><dt><span class="section"><a href="#tools.namefind.training.featuregen.api">Feature Generation defined by API</a></span></dt><dt><span class="section"><a href="#tools.namefind.training.featuregen.xml">Feature Generation defined by XML Descriptor</a></span></dt></dl></div> |
| |
| <p> |
| OpenNLP defines a default feature generation which is used when no custom feature |
| generation is specified. Users which want to experiment with the feature generation |
| can provide a custom feature generator. Either via API or via an xml descriptor file. |
| </p> |
| <div class="section" title="Feature Generation defined by API"><div class="titlepage"><div><div><h4 class="title"><a name="tools.namefind.training.featuregen.api"></a>Feature Generation defined by API</h4></div></div></div> |
| |
| <p> |
| The custom generator must be used for training |
| and for detecting the names. If the feature generation during training time and detection |
| time is different the name finder might not be able to detect names. |
| The following lines show how to construct a custom feature generator |
| </p><pre class="programlisting"> |
| |
| AdaptiveFeatureGenerator featureGenerator = <b class="hl-keyword">new</b> CachedFeatureGenerator( |
| <b class="hl-keyword">new</b> AdaptiveFeatureGenerator[]{ |
| <b class="hl-keyword">new</b> WindowFeatureGenerator(<b class="hl-keyword">new</b> TokenFeatureGenerator(), <span class="hl-number">2</span>, <span class="hl-number">2</span>), |
| <b class="hl-keyword">new</b> WindowFeatureGenerator(<b class="hl-keyword">new</b> TokenClassFeatureGenerator(true), <span class="hl-number">2</span>, <span class="hl-number">2</span>), |
| <b class="hl-keyword">new</b> OutcomePriorFeatureGenerator(), |
| <b class="hl-keyword">new</b> PreviousMapFeatureGenerator(), |
| <b class="hl-keyword">new</b> BigramNameFeatureGenerator(), |
| <b class="hl-keyword">new</b> SentenceFeatureGenerator(true, false), |
| <b class="hl-keyword">new</b> BrownTokenFeatureGenerator(BrownCluster dictResource) |
| }); |
| </pre><p> |
| which is similar to the default feature generator but with a BrownTokenFeature added. |
| The javadoc of the feature generator classes explain what the individual feature generators do. |
| To write a custom feature generator please implement the AdaptiveFeatureGenerator interface or |
| if it must not be adaptive extend the FeatureGeneratorAdapter. |
| The train method which should be used is defined as |
| </p><pre class="programlisting"> |
| |
| <b class="hl-keyword">public</b> <b class="hl-keyword">static</b> TokenNameFinderModel train(String languageCode, String type, |
| ObjectStream<NameSample> samples, TrainingParameters trainParams, |
| TokenNameFinderFactory factory) <b class="hl-keyword">throws</b> IOException |
| </pre><p> |
| where the TokenNameFinderFactory allows to specify a custom feature generator. |
| To detect names the model which was returned from the train method must be passed to the NameFinderME constructor. |
| </p><pre class="programlisting"> |
| |
| <b class="hl-keyword">new</b> NameFinderME(model); |
| </pre><p> |
| </p> |
| </div> |
| <div class="section" title="Feature Generation defined by XML Descriptor"><div class="titlepage"><div><div><h4 class="title"><a name="tools.namefind.training.featuregen.xml"></a>Feature Generation defined by XML Descriptor</h4></div></div></div> |
| |
| <p> |
| OpenNLP can also use a xml descriptor file to configure the feature generation. The |
| descriptor |
| file is stored inside the model after training and the feature generators are configured |
| correctly when the name finder is instantiated. |
| |
| The following sample shows a xml descriptor which contains the default feature generator plus several types of clustering features: |
| </p><pre class="programlisting"> |
| |
| <b class="hl-tag" style="color: #000096"><featureGenerators</b> <span class="hl-attribute" style="color: #F5844C">cache</span>=<span class="hl-value" style="color: #993300">"true"</span> <span class="hl-attribute" style="color: #F5844C">name</span>=<span class="hl-value" style="color: #993300">"nameFinder"</span><b class="hl-tag" style="color: #000096">></b> |
| <b class="hl-tag" style="color: #000096"><generator</b> <span class="hl-attribute" style="color: #F5844C">class</span>=<span class="hl-value" style="color: #993300">"opennlp.tools.util.featuregen.WindowFeatureGeneratorFactory"</span><b class="hl-tag" style="color: #000096">></b> |
| <b class="hl-tag" style="color: #000096"><int</b> <span class="hl-attribute" style="color: #F5844C">name</span>=<span class="hl-value" style="color: #993300">"prevLength"</span><b class="hl-tag" style="color: #000096">></b>2<b class="hl-tag" style="color: #000096"></int></b> |
| <b class="hl-tag" style="color: #000096"><int</b> <span class="hl-attribute" style="color: #F5844C">name</span>=<span class="hl-value" style="color: #993300">"nextLength"</span><b class="hl-tag" style="color: #000096">></b>2<b class="hl-tag" style="color: #000096"></int></b> |
| <b class="hl-tag" style="color: #000096"><generator</b> <span class="hl-attribute" style="color: #F5844C">class</span>=<span class="hl-value" style="color: #993300">"opennlp.tools.util.featuregen.TokenClassFeatureGeneratorFactory"</span><b class="hl-tag" style="color: #000096">/></b> |
| <b class="hl-tag" style="color: #000096"></generator></b> |
| <b class="hl-tag" style="color: #000096"><generator</b> <span class="hl-attribute" style="color: #F5844C">class</span>=<span class="hl-value" style="color: #993300">"opennlp.tools.util.featuregen.WindowFeatureGeneratorFactory"</span><b class="hl-tag" style="color: #000096">></b> |
| <b class="hl-tag" style="color: #000096"><int</b> <span class="hl-attribute" style="color: #F5844C">name</span>=<span class="hl-value" style="color: #993300">"prevLength"</span><b class="hl-tag" style="color: #000096">></b>2<b class="hl-tag" style="color: #000096"></int></b> |
| <b class="hl-tag" style="color: #000096"><int</b> <span class="hl-attribute" style="color: #F5844C">name</span>=<span class="hl-value" style="color: #993300">"nextLength"</span><b class="hl-tag" style="color: #000096">></b>2<b class="hl-tag" style="color: #000096"></int></b> |
| <b class="hl-tag" style="color: #000096"><generator</b> <span class="hl-attribute" style="color: #F5844C">class</span>=<span class="hl-value" style="color: #993300">"opennlp.tools.util.featuregen.TokenFeatureGeneratorFactory"</span><b class="hl-tag" style="color: #000096">/></b> |
| <b class="hl-tag" style="color: #000096"></generator></b> |
| <b class="hl-tag" style="color: #000096"><generator</b> <span class="hl-attribute" style="color: #F5844C">class</span>=<span class="hl-value" style="color: #993300">"opennlp.tools.util.featuregen.DefinitionFeatureGeneratorFactory"</span><b class="hl-tag" style="color: #000096">/></b> |
| <b class="hl-tag" style="color: #000096"><generator</b> <span class="hl-attribute" style="color: #F5844C">class</span>=<span class="hl-value" style="color: #993300">"opennlp.tools.util.featuregen.PreviousMapFeatureGeneratorFactory"</span><b class="hl-tag" style="color: #000096">/></b> |
| <b class="hl-tag" style="color: #000096"><generator</b> <span class="hl-attribute" style="color: #F5844C">class</span>=<span class="hl-value" style="color: #993300">"opennlp.tools.util.featuregen.BigramNameFeatureGeneratorFactory"</span><b class="hl-tag" style="color: #000096">/></b> |
| <b class="hl-tag" style="color: #000096"><generator</b> <span class="hl-attribute" style="color: #F5844C">class</span>=<span class="hl-value" style="color: #993300">"opennlp.tools.util.featuregen.SentenceFeatureGeneratorFactory"</span><b class="hl-tag" style="color: #000096">></b> |
| <b class="hl-tag" style="color: #000096"><bool</b> <span class="hl-attribute" style="color: #F5844C">name</span>=<span class="hl-value" style="color: #993300">"begin"</span><b class="hl-tag" style="color: #000096">></b>true<b class="hl-tag" style="color: #000096"></bool></b> |
| <b class="hl-tag" style="color: #000096"><bool</b> <span class="hl-attribute" style="color: #F5844C">name</span>=<span class="hl-value" style="color: #993300">"end"</span><b class="hl-tag" style="color: #000096">></b>false<b class="hl-tag" style="color: #000096"></bool></b> |
| <b class="hl-tag" style="color: #000096"></generator></b> |
| <b class="hl-tag" style="color: #000096"><generator</b> <span class="hl-attribute" style="color: #F5844C">class</span>=<span class="hl-value" style="color: #993300">"opennlp.tools.util.featuregen.WindowFeatureGeneratorFactory"</span><b class="hl-tag" style="color: #000096">></b> |
| <b class="hl-tag" style="color: #000096"><int</b> <span class="hl-attribute" style="color: #F5844C">name</span>=<span class="hl-value" style="color: #993300">"prevLength"</span><b class="hl-tag" style="color: #000096">></b>2<b class="hl-tag" style="color: #000096"></int></b> |
| <b class="hl-tag" style="color: #000096"><int</b> <span class="hl-attribute" style="color: #F5844C">name</span>=<span class="hl-value" style="color: #993300">"nextLength"</span><b class="hl-tag" style="color: #000096">></b>2<b class="hl-tag" style="color: #000096"></int></b> |
| <b class="hl-tag" style="color: #000096"><generator</b> <span class="hl-attribute" style="color: #F5844C">class</span>=<span class="hl-value" style="color: #993300">"opennlp.tools.util.featuregen.BrownClusterTokenClassFeatureGeneratorFactory"</span><b class="hl-tag" style="color: #000096">></b> |
| <b class="hl-tag" style="color: #000096"><str</b> <span class="hl-attribute" style="color: #F5844C">name</span>=<span class="hl-value" style="color: #993300">"dict"</span><b class="hl-tag" style="color: #000096">></b>brownCluster<b class="hl-tag" style="color: #000096"></str></b> |
| <b class="hl-tag" style="color: #000096"></generator></b> |
| <b class="hl-tag" style="color: #000096"></generator></b> |
| <b class="hl-tag" style="color: #000096"><generator</b> <span class="hl-attribute" style="color: #F5844C">class</span>=<span class="hl-value" style="color: #993300">"opennlp.tools.util.featuregen.BrownClusterTokenFeatureGeneratorFactory"</span><b class="hl-tag" style="color: #000096">></b> |
| <b class="hl-tag" style="color: #000096"><str</b> <span class="hl-attribute" style="color: #F5844C">name</span>=<span class="hl-value" style="color: #993300">"dict"</span><b class="hl-tag" style="color: #000096">></b>brownCluster<b class="hl-tag" style="color: #000096"></str></b> |
| <b class="hl-tag" style="color: #000096"></generator></b> |
| <b class="hl-tag" style="color: #000096"><generator</b> <span class="hl-attribute" style="color: #F5844C">class</span>=<span class="hl-value" style="color: #993300">"opennlp.tools.util.featuregen.BrownClusterBigramFeatureGeneratorFactory"</span><b class="hl-tag" style="color: #000096">></b> |
| <b class="hl-tag" style="color: #000096"><str</b> <span class="hl-attribute" style="color: #F5844C">name</span>=<span class="hl-value" style="color: #993300">"dict"</span><b class="hl-tag" style="color: #000096">></b>brownCluster<b class="hl-tag" style="color: #000096"></str></b> |
| <b class="hl-tag" style="color: #000096"></generator></b> |
| <b class="hl-tag" style="color: #000096"><generator</b> <span class="hl-attribute" style="color: #F5844C">class</span>=<span class="hl-value" style="color: #993300">"opennlp.tools.util.featuregen.WordClusterFeatureGeneratorFactory"</span><b class="hl-tag" style="color: #000096">></b> |
| <b class="hl-tag" style="color: #000096"><str</b> <span class="hl-attribute" style="color: #F5844C">name</span>=<span class="hl-value" style="color: #993300">"dict"</span><b class="hl-tag" style="color: #000096">></b>word2vec.cluster<b class="hl-tag" style="color: #000096"></str></b> |
| <b class="hl-tag" style="color: #000096"></generator></b> |
| <b class="hl-tag" style="color: #000096"><generator</b> <span class="hl-attribute" style="color: #F5844C">class</span>=<span class="hl-value" style="color: #993300">"opennlp.tools.util.featuregen.WordClusterFeatureGeneratorFactory"</span><b class="hl-tag" style="color: #000096">></b> |
| <b class="hl-tag" style="color: #000096"><str</b> <span class="hl-attribute" style="color: #F5844C">name</span>=<span class="hl-value" style="color: #993300">"dict"</span><b class="hl-tag" style="color: #000096">></b>clark.cluster<b class="hl-tag" style="color: #000096"></str></b> |
| <b class="hl-tag" style="color: #000096"></generator></b> |
| <b class="hl-tag" style="color: #000096"></featureGenerators></b> |
| </pre><p> |
| The root element must be featureGenerators, each sub-element adds a feature generator to the configuration. |
| The sample xml contains additional feature generators with respect to the API defined above. |
| </p> |
| <p> |
| The following table shows the supported feature generators (you must specify the Factory's FQDN): |
| </p><div class="table"><a name="d4e361"></a><p class="title"><b>Table 5.1. Feature Generators</b></p><div class="table-contents"> |
| |
| <table summary="Feature Generators" border="1"><colgroup><col><col></colgroup><thead><tr><th>Feature Generator</th><th>Parameters</th></tr></thead><tbody><tr><td>CharacterNgramFeatureGeneratorFactory</td><td><span class="emphasis"><em>min</em></span> and <span class="emphasis"><em>max</em></span> specify the length of the generated character ngrams</td></tr><tr><td>DefinitionFeatureGeneratorFactory</td><td>none</td></tr><tr><td>DictionaryFeatureGeneratorFactory</td><td><span class="emphasis"><em>dict</em></span> is the key of the dictionary resource to use, |
| and <span class="emphasis"><em>prefix</em></span> is a feature prefix string</td></tr><tr><td>PreviousMapFeatureGeneratorFactory</td><td>none</td></tr><tr><td>SentenceFeatureGeneratorFactory</td><td><span class="emphasis"><em>begin</em></span> and <span class="emphasis"><em>end</em></span> to generate begin or end features, both are optional and are boolean values</td></tr><tr><td>TokenClassFeatureGeneratorFactory</td><td>none</td></tr><tr><td>TokenFeatureGeneratorFactory</td><td>none</td></tr><tr><td>BigramNameFeatureGeneratorFactory</td><td>none</td></tr><tr><td>TokenPatternFeatureGeneratorFactory</td><td>none</td></tr><tr><td>POSTaggerNameFeatureGeneratorFactory</td><td><span class="emphasis"><em>model</em></span> is the file name of the POS Tagger model to use</td></tr><tr><td>WordClusterFeatureGeneratorFactory</td><td><span class="emphasis"><em>dict</em></span> is the key of the clustering resource to use</td></tr><tr><td>BrownClusterTokenFeatureGeneratorFactory</td><td><span class="emphasis"><em>dict</em></span> is the key of the clustering resource to use</td></tr><tr><td>BrownClusterTokenClassFeatureGeneratorFactory</td><td><span class="emphasis"><em>dict</em></span> is the key of the clustering resource to use</td></tr><tr><td>BrownClusterBigramFeatureGeneratorFactory</td><td><span class="emphasis"><em>dict</em></span> is the key of the clustering resource to use</td></tr><tr><td>WindowFeatureGeneratorFactory</td><td><span class="emphasis"><em>prevLength</em></span> and <span class="emphasis"><em>nextLength</em></span> must be integers ans specify the window size</td></tr></tbody></table> |
| </div></div><p><br class="table-break"> |
| Window feature generator can contain other generators. |
| </p> |
| </div> |
| </div> |
| </div> |
| <div class="section" title="Evaluation"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="tools.namefind.eval"></a>Evaluation</h2></div></div></div><div class="toc"><dl><dt><span class="section"><a href="#tools.namefind.eval.tool">Evaluation Tool</a></span></dt><dt><span class="section"><a href="#tools.namefind.eval.api">Evaluation API</a></span></dt></dl></div> |
| |
| <p> |
| The built in evaluation can measure the named entity recognition performance of the name finder. |
| The performance is either measured on a test dataset or via cross validation. |
| </p> |
| <div class="section" title="Evaluation Tool"><div class="titlepage"><div><div><h3 class="title"><a name="tools.namefind.eval.tool"></a>Evaluation Tool</h3></div></div></div> |
| |
| <p> |
| The following command shows how the tool can be run: |
| </p><pre class="screen"> |
| |
| $ opennlp TokenNameFinderEvaluator -model en-ner-person.bin -data en-ner-person.test -encoding UTF-8 |
| |
| Precision: 0.8005071889818507 |
| Recall: 0.7450581122145297 |
| F-Measure: 0.7717879983140168 |
| </pre><p> |
| Note: The command line interface does not support cross evaluation in the current version. |
| </p> |
| </div> |
| <div class="section" title="Evaluation API"><div class="titlepage"><div><div><h3 class="title"><a name="tools.namefind.eval.api"></a>Evaluation API</h3></div></div></div> |
| |
| <p> |
| The evaluation can be performed on a pre-trained model and a test dataset or via cross validation. |
| In the first case the model must be loaded and a NameSample ObjectStream must be created (see code samples above), |
| assuming these two objects exist the following code shows how to perform the evaluation: |
| </p><pre class="programlisting"> |
| |
| TokenNameFinderEvaluator evaluator = <b class="hl-keyword">new</b> TokenNameFinderEvaluator(<b class="hl-keyword">new</b> NameFinderME(model)); |
| evaluator.evaluate(sampleStream); |
| |
| FMeasure result = evaluator.getFMeasure(); |
| |
| System.out.println(result.toString()); |
| </pre><p> |
| In the cross validation case all the training arguments must be |
| provided (see the Training API section above). |
| To perform cross validation the ObjectStream must be resettable. |
| </p><pre class="programlisting"> |
| |
| InputStreamFactory dataIn = <b class="hl-keyword">new</b> MarkableFileInputStreamFactory(<b class="hl-keyword">new</b> File(<b class="hl-string"><i style="color:red">"en-ner-person.train"</i></b>)); |
| ObjectStream<NameSample> sampleStream = <b class="hl-keyword">new</b> NameSampleDataStream( |
| <b class="hl-keyword">new</b> PlainTextByLineStream(dataIn, StandardCharsets.UTF_<span class="hl-number">8</span>)); |
| TokenNameFinderCrossValidator evaluator = <b class="hl-keyword">new</b> TokenNameFinderCrossValidator(<b class="hl-string"><i style="color:red">"eng"</i></b>, |
| null, TrainingParameters.defaultParams(), null, (TokenNameFinderEvaluationMonitor) null); |
| evaluator.evaluate(sampleStream, <span class="hl-number">10</span>); |
| |
| FMeasure result = evaluator.getFMeasure(); |
| |
| System.out.println(result.toString()); |
| </pre><p> |
| </p> |
| </div> |
| </div> |
| <div class="section" title="Named Entity Annotation Guidelines"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="tools.namefind.annotation_guides"></a>Named Entity Annotation Guidelines</h2></div></div></div> |
| |
| <p> |
| Annotation guidelines define what should be labeled as an entity. To build |
| a private corpus it is important to know these guidelines and maybe write a |
| custom one. |
| Here is a list of publicly available annotation guidelines: |
| </p><div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem"> |
| <p> |
| <a class="ulink" href="http://cs.nyu.edu/cs/faculty/grishman/NEtask20.book_1.html" target="_top"> |
| MUC6 |
| </a> |
| </p> |
| </li><li class="listitem"> |
| <p> |
| <a class="ulink" href="http://acl.ldc.upenn.edu/muc7/ne_task.html" target="_top"> |
| MUC7 |
| </a> |
| </p> |
| </li><li class="listitem"> |
| <p> |
| <a class="ulink" href="https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/english-entities-guidelines-v6.6.pdf" target="_top"> |
| ACE |
| </a> |
| </p> |
| </li><li class="listitem"> |
| <p> |
| <a class="ulink" href="https://www.clips.uantwerpen.be/conll2002/ner/" target="_top"> |
| CONLL 2002 |
| </a> |
| </p> |
| </li><li class="listitem"> |
| <p> |
| <a class="ulink" href="https://www.clips.uantwerpen.be/conll2003/ner/" target="_top"> |
| CONLL 2003 |
| </a> |
| </p> |
| </li></ul></div><p> |
| </p> |
| </div> |
| </div> |
| <div class="chapter" title="Chapter 6. Document Categorizer"><div class="titlepage"><div><div><h2 class="title"><a name="tools.doccat"></a>Chapter 6. Document Categorizer</h2></div></div></div><div class="toc"><p><b>Table of Contents</b></p><dl><dt><span class="section"><a href="#tools.doccat.classifying">Classifying</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.doccat.classifying.cmdline">Document Categorizer Tool</a></span></dt><dt><span class="section"><a href="#tools.doccat.classifying.api">Document Categorizer API</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.namefind.api.onnx">Using an ONNX Model</a></span></dt></dl></dd></dl></dd><dt><span class="section"><a href="#tools.doccat.training">Training</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.doccat.training.tool">Training Tool</a></span></dt><dt><span class="section"><a href="#tools.doccat.training.api">Training API</a></span></dt></dl></dd></dl></div> |
| |
| <div class="section" title="Classifying"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="tools.doccat.classifying"></a>Classifying</h2></div></div></div><div class="toc"><dl><dt><span class="section"><a href="#tools.doccat.classifying.cmdline">Document Categorizer Tool</a></span></dt><dt><span class="section"><a href="#tools.doccat.classifying.api">Document Categorizer API</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.namefind.api.onnx">Using an ONNX Model</a></span></dt></dl></dd></dl></div> |
| |
| <p> |
| The OpenNLP Document Categorizer can classify text into pre-defined categories. |
| It is based on maximum entropy framework. For someone interested in Gross Margin, |
| the sample text given below could be classified as GMDecrease |
| </p><pre class="screen"> |
| |
| Major acquisitions that have a lower gross margin than the existing network |
| also had a negative impact on the overall gross margin, but it should improve |
| following the implementation of its integration strategies. |
| </pre><p> |
| and the text below could be classified as GMIncrease |
| </p><pre class="screen"> |
| |
| The upward movement of gross margin resulted from amounts pursuant to |
| adjustments to obligations towards dealers. |
| </pre><p> |
| To be able to classify a text, the document categorizer needs a model. |
| The classifications are requirements-specific |
| and hence there is no pre-built model for document categorizer under OpenNLP project. |
| </p> |
| |
| <div class="section" title="Document Categorizer Tool"><div class="titlepage"><div><div><h3 class="title"><a name="tools.doccat.classifying.cmdline"></a>Document Categorizer Tool</h3></div></div></div> |
| |
| <p> |
| Note that ONNX model support is not available through the command line tool. The models that can be trained |
| using the tool are OpenNLP models. ONNX models are trained through deep learning frameworks and then |
| utilized by OpenNLP. |
| </p> |
| <p> |
| The easiest way to try out the document categorizer is the command line tool. The tool is only |
| intended for demonstration and testing. The following command shows how to use the document categorizer tool. |
| </p><pre class="screen"> |
| |
| $ opennlp Doccat model |
| </pre><p> |
| The input is read from standard input and output is written to standard output, unless they are redirected |
| or piped. As with most components in OpenNLP, document categorizer expects input which is segmented into sentences. |
| </p> |
| </div> |
| <div class="section" title="Document Categorizer API"><div class="titlepage"><div><div><h3 class="title"><a name="tools.doccat.classifying.api"></a>Document Categorizer API</h3></div></div></div><div class="toc"><dl><dt><span class="section"><a href="#tools.namefind.api.onnx">Using an ONNX Model</a></span></dt></dl></div> |
| |
| <p> |
| To perform classification you will need a maxent model - |
| these are encapsulated in the DoccatModel class of OpenNLP tools - or an ONNX model trained |
| for document classification. |
| </p> |
| <p> |
| Using an OpenNLP model, first you need to grab the bytes from the serialized model on an InputStream: |
| </p><pre class="programlisting"> |
| |
| InputStream is = ... |
| DoccatModel m = <b class="hl-keyword">new</b> DoccatModel(is); |
| </pre><p> |
| With the DoccatModel in hand we are just about there: |
| </p><pre class="programlisting"> |
| |
| String inputText = ... |
| DocumentCategorizerME myCategorizer = <b class="hl-keyword">new</b> DocumentCategorizerME(m); |
| <b class="hl-keyword">double</b>[] outcomes = myCategorizer.categorize(inputText); |
| String category = myCategorizer.getBestCategory(outcomes); |
| </pre><p> |
| </p> |
| <div class="section" title="Using an ONNX Model"><div class="titlepage"><div><div><h4 class="title"><a name="tools.namefind.api.onnx"></a>Using an ONNX Model</h4></div></div></div> |
| |
| <p> |
| Using an ONNX model is similar, except we will utilize the <code class="code">DocumentCategorizerDL</code> class instead. |
| You must provide the path to the model file and the vocabulary file to the document categorizer. |
| (There is no need to load the model as an InputStream as in the previous example.) |
| </p><pre class="programlisting"> |
| |
| File model = <b class="hl-keyword">new</b> File(<b class="hl-string"><i style="color:red">"/path/to/model.onnx"</i></b>); |
| File vocab = <b class="hl-keyword">new</b> File(<b class="hl-string"><i style="color:red">"/path/to/vocab.txt"</i></b>); |
| Map<Integer, String> categories = <b class="hl-keyword">new</b> HashMap<>(); |
| String[] inputText = <b class="hl-keyword">new</b> String[]{<b class="hl-string"><i style="color:red">"My input text is great."</i></b>}; |
| <b class="hl-keyword">final</b> DocumentCategorizerDL myCategorizer = <b class="hl-keyword">new</b> DocumentCategorizerDL(model, vocab, categories); |
| <b class="hl-keyword">double</b>[] outcomes = myCategorizer.categorize(inputText); |
| String category = myCategorizer.getBestCategory(outcomes); |
| </pre><p> |
| For additional examples, refer to the <code class="code">DocumentCategorizerDLEval</code> class. |
| </p> |
| </div> |
| </div> |
| </div> |
| <div class="section" title="Training"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="tools.doccat.training"></a>Training</h2></div></div></div><div class="toc"><dl><dt><span class="section"><a href="#tools.doccat.training.tool">Training Tool</a></span></dt><dt><span class="section"><a href="#tools.doccat.training.api">Training API</a></span></dt></dl></div> |
| |
| <p> |
| The Document Categorizer can be trained on annotated training material. The data |
| can be in OpenNLP Document Categorizer training format. This is one document per line, |
| containing category and text separated by a whitespace. Other formats can also be |
| available. |
| The following sample shows the sample from above in the required format. Here GMDecrease and GMIncrease |
| are the categories. |
| </p><pre class="screen"> |
| |
| GMDecrease Major acquisitions that have a lower gross margin than the existing network also \ |
| had a negative impact on the overall gross margin, but it should improve following \ |
| the implementation of its integration strategies . |
| GMIncrease The upward movement of gross margin resulted from amounts pursuant to adjustments \ |
| to obligations towards dealers . |
| </pre><p> |
| Note: The line breaks marked with a backslash are just inserted for formatting purposes and must not be |
| included in the training data. |
| </p> |
| <div class="section" title="Training Tool"><div class="titlepage"><div><div><h3 class="title"><a name="tools.doccat.training.tool"></a>Training Tool</h3></div></div></div> |
| |
| <p> |
| The following command will train the document categorizer and write the model to en-doccat.bin: |
| </p><pre class="screen"> |
| |
| $ opennlp DoccatTrainer -model en-doccat.bin -lang en -data en-doccat.train -encoding UTF-8 |
| </pre><p> |
| Additionally it is possible to specify the number of iterations, and the cutoff. |
| </p> |
| </div> |
| <div class="section" title="Training API"><div class="titlepage"><div><div><h3 class="title"><a name="tools.doccat.training.api"></a>Training API</h3></div></div></div> |
| |
| <p> |
| So, naturally you will need some access to many pre-classified events to train your model. |
| The class opennlp.tools.doccat.DocumentSample encapsulates a text document and its classification. |
| DocumentSample has two constructors. Each take the text's category as one argument. The other argument can either be raw |
| text, or an array of tokens. By default, the raw text will be split into tokens by whitespace. So, let's say |
| your training data was contained in a text file, where the format is as described above. |
| Then you might want to write something like this to create a collection of DocumentSamples: |
| </p><pre class="programlisting"> |
| |
| DoccatModel model = null; |
| <b class="hl-keyword">try</b> { |
| ObjectStream<String> lineStream = |
| <b class="hl-keyword">new</b> PlainTextByLineStream(<b class="hl-keyword">new</b> MarkableFileInputStreamFactory(<b class="hl-keyword">new</b> File(<b class="hl-string"><i style="color:red">"en-sentiment.train"</i></b>)), StandardCharsets.UTF_<span class="hl-number">8</span>); |
| |
| ObjectStream<DocumentSample> sampleStream = <b class="hl-keyword">new</b> DocumentSampleStream(lineStream); |
| |
| model = DocumentCategorizerME.train(<b class="hl-string"><i style="color:red">"eng"</i></b>, sampleStream, |
| TrainingParameters.defaultParams(), <b class="hl-keyword">new</b> DoccatFactory()); |
| } <b class="hl-keyword">catch</b> (IOException e) { |
| e.printStackTrace(); |
| } |
| |
| </pre><p> |
| Now might be a good time to cruise over to Hulu or something, because this could take a while if you've got a large training set. |
| You may see a lot of output as well. Once you're done, you can pretty quickly step to classification directly, |
| but first we'll cover serialization. Feel free to skim. |
| </p> |
| <p> |
| </p><pre class="programlisting"> |
| |
| <b class="hl-keyword">try</b> (OutputStream modelOut = <b class="hl-keyword">new</b> BufferedOutputStream(<b class="hl-keyword">new</b> FileOutputStream(modelFile))) { |
| model.serialize(modelOut); |
| } |
| |
| </pre><p> |
| </p> |
| </div> |
| </div> |
| </div> |
| <div class="chapter" title="Chapter 7. Part-of-Speech Tagger"><div class="titlepage"><div><div><h2 class="title"><a name="tools.postagger"></a>Chapter 7. Part-of-Speech Tagger</h2></div></div></div><div class="toc"><p><b>Table of Contents</b></p><dl><dt><span class="section"><a href="#tools.postagger.tagging">Tagging</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.postagger.tagging.cmdline">POS Tagger Tool</a></span></dt><dt><span class="section"><a href="#tools.postagger.tagging.api">POS Tagger API</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.postagger.training">Training</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.postagger.training.tool">Training Tool</a></span></dt><dt><span class="section"><a href="#tools.postagger.training.api">Training API</a></span></dt><dt><span class="section"><a href="#tools.postagger.training.tagdict">Tag Dictionary</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.postagger.eval">Evaluation</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.postagger.eval.tool">Evaluation Tool</a></span></dt></dl></dd></dl></div> |
| |
| <div class="section" title="Tagging"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="tools.postagger.tagging"></a>Tagging</h2></div></div></div><div class="toc"><dl><dt><span class="section"><a href="#tools.postagger.tagging.cmdline">POS Tagger Tool</a></span></dt><dt><span class="section"><a href="#tools.postagger.tagging.api">POS Tagger API</a></span></dt></dl></div> |
| |
| <p> |
| The Part of Speech Tagger marks tokens with their corresponding word type |
| based on the token itself and the context of the token. A token might have |
| multiple pos tags depending on the token and the context. The OpenNLP POS Tagger |
| uses a probability model to predict the correct pos tag out of the tag set. |
| To limit the possible tags for a token a tag dictionary can be used which increases |
| the tagging and runtime performance of the tagger. |
| </p> |
| <div class="section" title="POS Tagger Tool"><div class="titlepage"><div><div><h3 class="title"><a name="tools.postagger.tagging.cmdline"></a>POS Tagger Tool</h3></div></div></div> |
| |
| <p> |
| The easiest way to try out the POS Tagger is the command line tool. The tool is |
| only intended for demonstration and testing. |
| Download the English maxent pos model and start the POS Tagger Tool with this command: |
| </p><pre class="screen"> |
| |
| $ opennlp POSTagger en-pos-maxent.bin |
| </pre><p> |
| The POS Tagger now reads a tokenized sentence per line from stdin. |
| Copy these two sentences to the console: |
| </p><pre class="screen"> |
| |
| Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 . |
| Mr. Vinken is chairman of Elsevier N.V. , the Dutch publishing group . |
| </pre><p> |
| The POS Tagger will now echo the sentences with pos tags to the console: |
| </p><pre class="screen"> |
| |
| Pierre_NNP Vinken_NNP ,_, 61_CD years_NNS old_JJ ,_, will_MD join_VB the_DT board_NN as_IN |
| a_DT nonexecutive_JJ director_NN Nov._NNP 29_CD ._. |
| Mr._NNP Vinken_NNP is_VBZ chairman_NN of_IN Elsevier_NNP N.V._NNP ,_, the_DT Dutch_NNP publishing_VBG group_NN |
| </pre><p> |
| The tag set used by the English pos model is the <a class="ulink" href="https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html" target="_top">Penn Treebank tag set</a>. |
| </p> |
| </div> |
| |
| <div class="section" title="POS Tagger API"><div class="titlepage"><div><div><h3 class="title"><a name="tools.postagger.tagging.api"></a>POS Tagger API</h3></div></div></div> |
| |
| <p> |
| The POS Tagger can be embedded into an application via its API. |
| First the pos model must be loaded into memory from disk or another source. |
| In the sample below it is loaded from disk. |
| </p><pre class="programlisting"> |
| |
| <b class="hl-keyword">try</b> (InputStream modelIn = <b class="hl-keyword">new</b> FileInputStream(<b class="hl-string"><i style="color:red">"en-pos-maxent.bin"</i></b>){ |
| POSModel model = <b class="hl-keyword">new</b> POSModel(modelIn); |
| } |
| </pre><p> |
| After the model is loaded the POSTaggerME can be instantiated. |
| </p><pre class="programlisting"> |
| |
| POSTaggerME tagger = <b class="hl-keyword">new</b> POSTaggerME(model); |
| </pre><p> |
| The POS Tagger instance is now ready to tag data. It expects a tokenized sentence |
| as input, which is represented as a String array, each String object in the array |
| is one token. |
| </p> |
| <p> |
| The following code shows how to determine the most likely pos tag sequence for a sentence. |
| </p><pre class="programlisting"> |
| |
| String sent[] = <b class="hl-keyword">new</b> String[]{<b class="hl-string"><i style="color:red">"Most"</i></b>, <b class="hl-string"><i style="color:red">"large"</i></b>, <b class="hl-string"><i style="color:red">"cities"</i></b>, <b class="hl-string"><i style="color:red">"in"</i></b>, <b class="hl-string"><i style="color:red">"the"</i></b>, <b class="hl-string"><i style="color:red">"US"</i></b>, <b class="hl-string"><i style="color:red">"had"</i></b>, |
| <b class="hl-string"><i style="color:red">"morning"</i></b>, <b class="hl-string"><i style="color:red">"and"</i></b>, <b class="hl-string"><i style="color:red">"afternoon"</i></b>, <b class="hl-string"><i style="color:red">"newspapers"</i></b>, <b class="hl-string"><i style="color:red">"."</i></b>}; |
| String tags[] = tagger.tag(sent); |
| </pre><p> |
| The tags array contains one part-of-speech tag for each token in the input array. The corresponding |
| tag can be found at the same index as the token has in the input array. |
| The confidence scores for the returned tags can be easily retrieved from |
| a POSTaggerME with the following method call: |
| </p><pre class="programlisting"> |
| |
| <b class="hl-keyword">double</b> probs[] = tagger.probs(); |
| </pre><p> |
| The call to probs is stateful and will always return the probabilities of the last |
| tagged sentence. The probs method should only be called when the tag method |
| was called before, otherwise the behavior is undefined. |
| </p> |
| <p> |
| Some applications need to retrieve the n-best pos tag sequences and not |
| only the best sequence. |
| The topKSequences method is capable of returning the top sequences. |
| It can be called in a similar way as tag. |
| </p><pre class="programlisting"> |
| |
| Sequence topSequences[] = tagger.topKSequences(sent); |
| </pre><p> |
| Each Sequence object contains one sequence. The sequence can be retrieved |
| via Sequence.getOutcomes() which returns a tags array |
| and Sequence.getProbs() returns the probability array for this sequence. |
| </p> |
| </div> |
| </div> |
| <div class="section" title="Training"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="tools.postagger.training"></a>Training</h2></div></div></div><div class="toc"><dl><dt><span class="section"><a href="#tools.postagger.training.tool">Training Tool</a></span></dt><dt><span class="section"><a href="#tools.postagger.training.api">Training API</a></span></dt><dt><span class="section"><a href="#tools.postagger.training.tagdict">Tag Dictionary</a></span></dt></dl></div> |
| |
| <p> |
| The POS Tagger can be trained on annotated training material. The training material |
| is a collection of tokenized sentences where each token has the assigned part-of-speech tag. |
| The native POS Tagger training material looks like this: |
| </p><pre class="screen"> |
| |
| About_IN 10_CD Euro_NNP ,_, I_PRP reckon_VBP ._. |
| That_DT sounds_VBZ good_JJ ._. |
| </pre><p> |
| Each sentence must be in one line. The token/tag pairs are combined with "_". |
| The token/tag pairs are whitespace separated. The data format does not |
| define a document boundary. If a document boundary should be included in the |
| training material it is suggested to use an empty line. |
| </p> |
| <p>The Part-of-Speech Tagger can either be trained with a command line tool, |
| or via an training API. |
| </p> |
| |
| <div class="section" title="Training Tool"><div class="titlepage"><div><div><h3 class="title"><a name="tools.postagger.training.tool"></a>Training Tool</h3></div></div></div> |
| |
| <p> |
| OpenNLP has a command line tool which is used to train the models available from the model |
| download page on various corpora. |
| </p> |
| <p> |
| Usage of the tool: |
| </p><pre class="screen"> |
| |
| $ opennlp POSTaggerTrainer |
| Usage: opennlp POSTaggerTrainer[.conllx] [-type maxent|perceptron|perceptron_sequence] \ |
| [-dict dictionaryPath] [-ngram cutoff] [-params paramsFile] [-iterations num] \ |
| [-cutoff num] -model modelFile -lang language -data sampleData \ |
| [-encoding charsetName] |
| |
| Arguments description: |
| -type maxent|perceptron|perceptron_sequence |
| The type of the token name finder model. One of maxent|perceptron|perceptron_sequence. |
| -dict dictionaryPath |
| The XML tag dictionary file |
| -ngram cutoff |
| NGram cutoff. If not specified will not create ngram dictionary. |
| -params paramsFile |
| training parameters file. |
| -iterations num |
| number of training iterations, ignored if -params is used. |
| -cutoff num |
| minimal number of times a feature must be seen, ignored if -params is used. |
| -model modelFile |
| output model file. |
| -lang language |
| language which is being processed. |
| -data sampleData |
| data to be used, usually a file name. |
| -encoding charsetName |
| encoding for reading and writing text, if absent the system default is used. |
| </pre><p> |
| </p> |
| <p> |
| The following command illustrates how an English part-of-speech model can be trained: |
| </p><pre class="screen"> |
| |
| $ opennlp POSTaggerTrainer -type maxent -model en-pos-maxent.bin \ |
| -lang en -data en-pos.train -encoding UTF-8 |
| </pre><p> |
| </p> |
| </div> |
| <div class="section" title="Training API"><div class="titlepage"><div><div><h3 class="title"><a name="tools.postagger.training.api"></a>Training API</h3></div></div></div> |
| |
| <p> |
| The Part-of-Speech Tagger training API supports the training of a new pos model. |
| Basically three steps are necessary to train it: |
| </p><div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem"> |
| <p>The application must open a sample data stream</p> |
| </li><li class="listitem"> |
| <p>Call the POSTagger.train method</p> |
| </li><li class="listitem"> |
| <p>Save the POSModel to a file</p> |
| </li></ul></div><p> |
| The following code illustrates that: |
| </p><pre class="programlisting"> |
| |
| POSModel model = null; |
| |
| <b class="hl-keyword">try</b> { |
| ObjectStream<String> lineStream = <b class="hl-keyword">new</b> PlainTextByLineStream(<b class="hl-keyword">new</b> MarkableFileInputStreamFactory(<b class="hl-keyword">new</b> File(<b class="hl-string"><i style="color:red">"en-pos.train"</i></b>)), StandardCharsets.UTF_<span class="hl-number">8</span>); |
| |
| ObjectStream<POSSample> sampleStream = <b class="hl-keyword">new</b> WordTagSampleStream(lineStream); |
| |
| model = POSTaggerME.train(<b class="hl-string"><i style="color:red">"eng"</i></b>, sampleStream, TrainingParameters.defaultParams(), <b class="hl-keyword">new</b> POSTaggerFactory()); |
| } <b class="hl-keyword">catch</b> (IOException e) { |
| e.printStackTrace(); |
| } |
| </pre><p> |
| The above code performs the first two steps, opening the data and training |
| the model. The trained model must still be saved into an OutputStream, in |
| the sample below it is written into a file. |
| </p><pre class="programlisting"> |
| |
| <b class="hl-keyword">try</b> (OutputStream modelOut = <b class="hl-keyword">new</b> BufferedOutputStream(<b class="hl-keyword">new</b> FileOutputStream(modelFile))){ |
| model.serialize(modelOut); |
| } |
| </pre><p> |
| </p> |
| </div> |
| <div class="section" title="Tag Dictionary"><div class="titlepage"><div><div><h3 class="title"><a name="tools.postagger.training.tagdict"></a>Tag Dictionary</h3></div></div></div> |
| |
| <p> |
| The tag dictionary is a word dictionary which specifies which tags a specific token can have. Using a tag |
| dictionary has two advantages, inappropriate tags can not been assigned to tokens in the dictionary and the |
| beam search algorithm has to consider less possibilities and can search faster. |
| </p> |
| <p> |
| The dictionary is defined in an xml format and can be created and stored with the POSDictionary class. |
| Please for now checkout the javadoc and source code of that class. |
| </p> |
| <p>Note: The format should be documented and sample code should show how to use the dictionary. |
| Any contributions are very welcome. If you want to contribute please contact us on the mailing list |
| or comment on the jira issue <a class="ulink" href="https://issues.apache.org/jira/browse/OPENNLP-287" target="_top">OPENNLP-287</a>. |
| </p> |
| </div> |
| </div> |
| |
| <div class="section" title="Evaluation"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="tools.postagger.eval"></a>Evaluation</h2></div></div></div><div class="toc"><dl><dt><span class="section"><a href="#tools.postagger.eval.tool">Evaluation Tool</a></span></dt></dl></div> |
| |
| <p> |
| The built-in evaluation can measure the accuracy of the pos tagger. |
| The accuracy can be measured on a test data set or via cross validation. |
| </p> |
| <div class="section" title="Evaluation Tool"><div class="titlepage"><div><div><h3 class="title"><a name="tools.postagger.eval.tool"></a>Evaluation Tool</h3></div></div></div> |
| |
| <p> |
| There is a command line tool to evaluate a given model on a test data set. |
| The following command shows how the tool can be run: |
| </p><pre class="screen"> |
| |
| $ opennlp POSTaggerEvaluator -model pt.postagger.bin -data pt.postagger.test -encoding utf-8 |
| </pre><p> |
| This will display the resulting accuracy score, e.g.: |
| </p><pre class="screen"> |
| |
| Loading model ... done |
| Evaluating ... done |
| |
| Accuracy: 0.9659110277825124 |
| </pre><p> |
| </p> |
| <p> |
| There is a command line tool for cross-validation of the test data set. |
| The following command shows how the tool can be run: |
| </p><pre class="screen"> |
| |
| $ opennlp POSTaggerCrossValidator -lang pt -data pt.postagger.test -encoding utf-8 |
| </pre><p> |
| This will display the resulting accuracy score, e.g.: |
| </p><pre class="screen"> |
| |
| Accuracy: 0.9659110277825124 |
| </pre><p> |
| </p> |
| |
| </div> |
| </div> |
| </div> |
| <div class="chapter" title="Chapter 8. Lemmatizer"><div class="titlepage"><div><div><h2 class="title"><a name="tools.lemmatizer"></a>Chapter 8. Lemmatizer</h2></div></div></div><div class="toc"><p><b>Table of Contents</b></p><dl><dt><span class="section"><a href="#tools.lemmatizer.tagging.cmdline">Lemmatizer Tool</a></span></dt><dt><span class="section"><a href="#tools.lemmatizer.tagging.api">Lemmatizer API</a></span></dt><dt><span class="section"><a href="#tools.lemmatizer.training">Lemmatizer Training</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.lemmatizer.training.tool">Training Tool</a></span></dt><dt><span class="section"><a href="#tools.lemmatizer.training.api">Training API</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.lemmatizer.evaluation">Lemmatizer Evaluation</a></span></dt></dl></div> |
| |
| <p> |
| The lemmatizer returns, for a given word form (token) and Part of Speech |
| tag, |
| the dictionary form of a word, which is usually referred to as its |
| lemma. A token could |
| ambiguously be derived from several basic forms or dictionary words which is why |
| the |
| postag of the word is required to find the lemma. For example, the form |
| `show' may refer |
| to either the verb "to show" or to the noun "show". |
| Currently OpenNLP implement statistical and dictionary-based lemmatizers. |
| </p> |
| <div class="section" title="Lemmatizer Tool"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="tools.lemmatizer.tagging.cmdline"></a>Lemmatizer Tool</h2></div></div></div> |
| |
| <p> |
| The easiest way to try out the Lemmatizer is the command line tool, |
| which provides access to the statistical |
| lemmatizer. Note that the tool is only intended for demonstration and testing. |
| </p> |
| <p> |
| Once you have trained a lemmatizer model (see below for instructions), |
| you can start the Lemmatizer Tool with this command: |
| </p> |
| <p> |
| </p><pre class="screen"> |
| |
| $ opennlp LemmatizerME en-lemmatizer.bin < sentences |
| </pre><p> |
| The Lemmatizer now reads a pos tagged sentence(s) per line from |
| standard input. For example, you can copy this sentence to the |
| console: |
| </p><pre class="screen"> |
| |
| Rockwell_NNP International_NNP Corp._NNP 's_POS Tulsa_NNP unit_NN said_VBD it_PRP |
| signed_VBD a_DT tentative_JJ agreement_NN extending_VBG its_PRP$ contract_NN with_IN |
| Boeing_NNP Co._NNP to_TO provide_VB structural_JJ parts_NNS for_IN Boeing_NNP 's_POS |
| 747_CD jetliners_NNS ._. |
| </pre><p> |
| The Lemmatizer will now echo the lemmas for each word postag pair to |
| the console: |
| </p><pre class="screen"> |
| |
| Rockwell NNP rockwell |
| International NNP international |
| Corp. NNP corp. |
| 's POS 's |
| Tulsa NNP tulsa |
| unit NN unit |
| said VBD say |
| it PRP it |
| signed VBD sign |
| ... |
| |
| </pre><p> |
| </p> |
| </div> |
| <div class="section" title="Lemmatizer API"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="tools.lemmatizer.tagging.api"></a>Lemmatizer API</h2></div></div></div> |
| |
| <p> |
| The Lemmatizer can be embedded into an application via its API. |
| Currently a statistical |
| and DictionaryLemmatizer are available. Note that these two methods are |
| complementary and |
| the DictionaryLemmatizer can also be used as a way of post-processing |
| the output of the statistical |
| lemmatizer. |
| </p> |
| <p> |
| The statistical lemmatizer requires that a trained model is loaded |
| into memory from disk or from another source. |
| In the example below it is loaded from disk: |
| </p><pre class="programlisting"> |
| |
| LemmatizerModel model = null; |
| <b class="hl-keyword">try</b> (InputStream modelIn = <b class="hl-keyword">new</b> FileInputStream(<b class="hl-string"><i style="color:red">"en-lemmatizer.bin"</i></b>))) { |
| model = <b class="hl-keyword">new</b> LemmatizerModel(modelIn); |
| } |
| |
| </pre><p> |
| After the model is loaded a LemmatizerME can be instantiated. |
| </p><pre class="programlisting"> |
| |
| LemmatizerME lemmatizer = <b class="hl-keyword">new</b> LemmatizerME(model); |
| </pre><p> |
| The Lemmatizer instance is now ready to lemmatize data. It expects a |
| tokenized sentence |
| as input, which is represented as a String array, each String object |
| in the array |
| is one token, and the POS tags associated with each token. |
| </p> |
| <p> |
| The following code shows how to determine the most likely lemma for |
| a sentence. |
| </p><pre class="programlisting"> |
| |
| String[] tokens = <b class="hl-keyword">new</b> String[] { <b class="hl-string"><i style="color:red">"Rockwell"</i></b>, <b class="hl-string"><i style="color:red">"International"</i></b>, <b class="hl-string"><i style="color:red">"Corp."</i></b>, <b class="hl-string"><i style="color:red">"'s"</i></b>, |
| <b class="hl-string"><i style="color:red">"Tulsa"</i></b>, <b class="hl-string"><i style="color:red">"unit"</i></b>, <b class="hl-string"><i style="color:red">"said"</i></b>, <b class="hl-string"><i style="color:red">"it"</i></b>, <b class="hl-string"><i style="color:red">"signed"</i></b>, <b class="hl-string"><i style="color:red">"a"</i></b>, <b class="hl-string"><i style="color:red">"tentative"</i></b>, <b class="hl-string"><i style="color:red">"agreement"</i></b>, |
| <b class="hl-string"><i style="color:red">"extending"</i></b>, <b class="hl-string"><i style="color:red">"its"</i></b>, <b class="hl-string"><i style="color:red">"contract"</i></b>, <b class="hl-string"><i style="color:red">"with"</i></b>, <b class="hl-string"><i style="color:red">"Boeing"</i></b>, <b class="hl-string"><i style="color:red">"Co."</i></b>, <b class="hl-string"><i style="color:red">"to"</i></b>, |
| <b class="hl-string"><i style="color:red">"provide"</i></b>, <b class="hl-string"><i style="color:red">"structural"</i></b>, <b class="hl-string"><i style="color:red">"parts"</i></b>, <b class="hl-string"><i style="color:red">"for"</i></b>, <b class="hl-string"><i style="color:red">"Boeing"</i></b>, <b class="hl-string"><i style="color:red">"'s"</i></b>, <b class="hl-string"><i style="color:red">"747"</i></b>, |
| <b class="hl-string"><i style="color:red">"jetliners"</i></b>, <b class="hl-string"><i style="color:red">"."</i></b> }; |
| |
| String[] postags = <b class="hl-keyword">new</b> String[] { <b class="hl-string"><i style="color:red">"NNP"</i></b>, <b class="hl-string"><i style="color:red">"NNP"</i></b>, <b class="hl-string"><i style="color:red">"NNP"</i></b>, <b class="hl-string"><i style="color:red">"POS"</i></b>, <b class="hl-string"><i style="color:red">"NNP"</i></b>, <b class="hl-string"><i style="color:red">"NN"</i></b>, |
| <b class="hl-string"><i style="color:red">"VBD"</i></b>, <b class="hl-string"><i style="color:red">"PRP"</i></b>, <b class="hl-string"><i style="color:red">"VBD"</i></b>, <b class="hl-string"><i style="color:red">"DT"</i></b>, <b class="hl-string"><i style="color:red">"JJ"</i></b>, <b class="hl-string"><i style="color:red">"NN"</i></b>, <b class="hl-string"><i style="color:red">"VBG"</i></b>, <b class="hl-string"><i style="color:red">"PRP$"</i></b>, <b class="hl-string"><i style="color:red">"NN"</i></b>, <b class="hl-string"><i style="color:red">"IN"</i></b>, |
| <b class="hl-string"><i style="color:red">"NNP"</i></b>, <b class="hl-string"><i style="color:red">"NNP"</i></b>, <b class="hl-string"><i style="color:red">"TO"</i></b>, <b class="hl-string"><i style="color:red">"VB"</i></b>, <b class="hl-string"><i style="color:red">"JJ"</i></b>, <b class="hl-string"><i style="color:red">"NNS"</i></b>, <b class="hl-string"><i style="color:red">"IN"</i></b>, <b class="hl-string"><i style="color:red">"NNP"</i></b>, <b class="hl-string"><i style="color:red">"POS"</i></b>, <b class="hl-string"><i style="color:red">"CD"</i></b>, <b class="hl-string"><i style="color:red">"NNS"</i></b>, |
| <b class="hl-string"><i style="color:red">"."</i></b> }; |
| |
| String[] lemmas = lemmatizer.lemmatize(tokens, postags); |
| </pre><p> |
| The lemmas array contains one lemma for each token in the |
| input array. The corresponding |
| tag and lemma can be found at the same index as the token has in the |
| input array. |
| </p> |
| |
| <p> |
| The DictionaryLemmatizer is constructed |
| by passing the InputStream of a lemmatizer dictionary. Such dictionary |
| consists of a text file containing, for each row, a word, its postag and the |
| corresponding lemma, each column separated by a tab character. |
| </p><pre class="screen"> |
| |
| show NN show |
| showcase NN showcase |
| showcases NNS showcase |
| showdown NN showdown |
| showdowns NNS showdown |
| shower NN shower |
| showers NNS shower |
| showman NN showman |
| showmanship NN showmanship |
| showmen NNS showman |
| showroom NN showroom |
| showrooms NNS showroom |
| shows NNS show |
| shrapnel NN shrapnel |
| |
| </pre><p> |
| Alternatively, if a (word,postag) pair can output multiple lemmas, the |
| the lemmatizer dictionary would consists of a text file containing, for |
| each row, a word, its postag and the corresponding lemmas separated by "#": |
| </p><pre class="screen"> |
| |
| muestras NN muestra |
| cantaba V cantar |
| fue V ir#ser |
| entramos V entrar |
| |
| </pre><p> |
| First the dictionary must be loaded into memory from disk or another |
| source. |
| In the sample below it is loaded from disk. |
| </p><pre class="programlisting"> |
| |
| InputStream dictLemmatizer = null; |
| |
| <b class="hl-keyword">try</b> (dictLemmatizer = <b class="hl-keyword">new</b> FileInputStream(<b class="hl-string"><i style="color:red">"english-lemmatizer.txt"</i></b>)) { |
| |
| } |
| |
| </pre><p> |
| After the dictionary is loaded the DictionaryLemmatizer can be |
| instantiated. |
| </p><pre class="programlisting"> |
| |
| DictionaryLemmatizer lemmatizer = <b class="hl-keyword">new</b> DictionaryLemmatizer(dictLemmatizer); |
| </pre><p> |
| The DictionaryLemmatizer instance is now ready. It expects two |
| String arrays as input, |
| a containing the tokens and another one their respective postags. |
| </p> |
| <p> |
| The following code shows how to find a lemma using a |
| DictionaryLemmatizer. |
| </p><pre class="programlisting"> |
| |
| String[] tokens = <b class="hl-keyword">new</b> String[]{<b class="hl-string"><i style="color:red">"Most"</i></b>, <b class="hl-string"><i style="color:red">"large"</i></b>, <b class="hl-string"><i style="color:red">"cities"</i></b>, <b class="hl-string"><i style="color:red">"in"</i></b>, <b class="hl-string"><i style="color:red">"the"</i></b>, <b class="hl-string"><i style="color:red">"US"</i></b>, <b class="hl-string"><i style="color:red">"had"</i></b>, |
| <b class="hl-string"><i style="color:red">"morning"</i></b>, <b class="hl-string"><i style="color:red">"and"</i></b>, <b class="hl-string"><i style="color:red">"afternoon"</i></b>, <b class="hl-string"><i style="color:red">"newspapers"</i></b>, <b class="hl-string"><i style="color:red">"."</i></b>}; |
| String[] tags = tagger.tag(sent); |
| String[] lemmas = lemmatizer.lemmatize(tokens, postags); |
| |
| </pre><p> |
| The tags array contains one part-of-speech tag for each token in the |
| input array. The corresponding |
| tag and lemmas can be found at the same index as the token has in the |
| input array. |
| </p> |
| </div> |
| <div class="section" title="Lemmatizer Training"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="tools.lemmatizer.training"></a>Lemmatizer Training</h2></div></div></div><div class="toc"><dl><dt><span class="section"><a href="#tools.lemmatizer.training.tool">Training Tool</a></span></dt><dt><span class="section"><a href="#tools.lemmatizer.training.api">Training API</a></span></dt></dl></div> |
| |
| <p> |
| The training data consist of three columns separated by spaces. Each |
| word has been put on a |
| separate line and there is an empty line after each sentence. The first |
| column contains |
| the current word, the second its part-of-speech tag and the third its |
| lemma. |
| Here is an example of the file format: |
| </p> |
| <p> |
| Sample sentence of the training data: |
| </p><pre class="screen"> |
| |
| He PRP he |
| reckons VBZ reckon |
| the DT the |
| current JJ current |
| accounts NNS account |
| deficit NN deficit |
| will MD will |
| narrow VB narrow |
| to TO to |
| only RB only |
| # # # |
| 1.8 CD 1.8 |
| millions CD million |
| in IN in |
| September NNP september |
| . . O |
| </pre><p> |
| The Universal Dependencies Treebank and the CoNLL 2009 datasets |
| distribute training data for many languages. |
| </p> |
| <div class="section" title="Training Tool"><div class="titlepage"><div><div><h3 class="title"><a name="tools.lemmatizer.training.tool"></a>Training Tool</h3></div></div></div> |
| |
| <p> |
| OpenNLP has a command line tool which is used to train the models on |
| various corpora. |
| </p> |
| <p> |
| Usage of the tool: |
| </p><pre class="screen"> |
| |
| $ opennlp LemmatizerTrainerME |
| Usage: opennlp LemmatizerTrainerME [-factory factoryName] [-params paramsFile] -lang language -model modelFile -data sampleData [-encoding charsetName] |
| |
| Arguments description: |
| -factory factoryName |
| A sub-class of LemmatizerFactory where to get implementation and resources. |
| -params paramsFile |
| training parameters file. |
| -lang language |
| language which is being processed. |
| -model modelFile |
| output model file. |
| -data sampleData |
| data to be used, usually a file name. |
| -encoding charsetName |
| encoding for reading and writing text, if absent the system default is used. |
| |
| </pre><p> |
| Its now assumed that the english lemmatizer model should be trained |
| from a file called |
| en-lemmatizer.train which is encoded as UTF-8. The following command will train the |
| lemmatizer and write the model to en-lemmatizer.bin: |
| </p><pre class="screen"> |
| |
| $ opennlp LemmatizerTrainerME -model en-lemmatizer.bin -params PerceptronTrainerParams.txt -lang en -data en-lemmatizer.train -encoding UTF-8 |
| </pre><p> |
| </p> |
| </div> |
| <div class="section" title="Training API"><div class="titlepage"><div><div><h3 class="title"><a name="tools.lemmatizer.training.api"></a>Training API</h3></div></div></div> |
| |
| <p> |
| The Lemmatizer offers an API to train a new lemmatizer model. First |
| a training parameters |
| file needs to be instantiated: |
| </p><pre class="programlisting"> |
| |
| TrainingParameters mlParams = CmdLineUtil.loadTrainingParameters(params.getParams(), false); |
| <b class="hl-keyword">if</b> (mlParams == null) { |
| mlParams = ModelUtil.createDefaultTrainingParameters(); |
| } |
| </pre><p> |
| Then we read the training data: |
| </p><pre class="programlisting"> |
| |
| InputStreamFactory inputStreamFactory = null; |
| <b class="hl-keyword">try</b> { |
| inputStreamFactory = <b class="hl-keyword">new</b> MarkableFileInputStreamFactory( |
| <b class="hl-keyword">new</b> File(en-lemmatizer.train)); |
| } <b class="hl-keyword">catch</b> (FileNotFoundException e) { |
| e.printStackTrace(); |
| } |
| ObjectStream<String> lineStream = null; |
| LemmaSampleStream lemmaStream = null; |
| <b class="hl-keyword">try</b> { |
| lineStream = <b class="hl-keyword">new</b> PlainTextByLineStream( |
| (inputStreamFactory), StandardCharsets.UTF_<span class="hl-number">8</span>); |
| lemmaStream = <b class="hl-keyword">new</b> LemmaSampleStream(lineStream); |
| } <b class="hl-keyword">catch</b> (IOException e) { |
| CmdLineUtil.handleCreateObjectStreamError(e); |
| } |
| |
| </pre><p> |
| The following step proceeds to train the model: |
| </p><pre class="programlisting"> |
| LemmatizerModel model; |
| try { |
| LemmatizerFactory lemmatizerFactory = LemmatizerFactory |
| .create(params.getFactory()); |
| model = LemmatizerME.train(params.getLang(), lemmaStream, mlParams, |
| lemmatizerFactory); |
| } catch (IOException e) { |
| throw new TerminateToolException(-1, |
| "IO error while reading training data or indexing data: " |
| + e.getMessage(), |
| e); |
| } finally { |
| try { |
| sampleStream.close(); |
| } catch (IOException e) { |
| } |
| } |
| </pre><p> |
| </p> |
| </div> |
| </div> |
| <div class="section" title="Lemmatizer Evaluation"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="tools.lemmatizer.evaluation"></a>Lemmatizer Evaluation</h2></div></div></div> |
| |
| <p> |
| The built in evaluation can measure the accuracy of the statistical |
| lemmatizer. |
| The accuracy can be measured on a test data set. |
| </p> |
| <p> |
| There is a command line tool to evaluate a given model on a test |
| data set. |
| The following command shows how the tool can be run: |
| </p><pre class="screen"> |
| |
| $ opennlp LemmatizerEvaluator -model en-lemmatizer.bin -data en-lemmatizer.test -encoding utf-8 |
| </pre><p> |
| This will display the resulting accuracy score, e.g.: |
| </p><pre class="screen"> |
| |
| Loading model ... done |
| Evaluating ... done |
| |
| Accuracy: 0.9659110277825124 |
| </pre><p> |
| </p> |
| </div> |
| </div> |
| <div class="chapter" title="Chapter 9. Chunker"><div class="titlepage"><div><div><h2 class="title"><a name="tools.chunker"></a>Chapter 9. Chunker</h2></div></div></div><div class="toc"><p><b>Table of Contents</b></p><dl><dt><span class="section"><a href="#tools.parser.chunking">Chunking</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.parser.chunking.cmdline">Chunker Tool</a></span></dt><dt><span class="section"><a href="#tools.parser.chunking.api">Chunking API</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.chunker.training">Chunker Training</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.chunker.training.tool">Training Tool</a></span></dt><dt><span class="section"><a href="#tools.chunker.training.api">Training API</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.chunker.evaluation">Chunker Evaluation</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.chunker.evaluation.tool">Chunker Evaluation Tool</a></span></dt></dl></dd></dl></div> |
| |
| |
| |
| <div class="section" title="Chunking"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="tools.parser.chunking"></a>Chunking</h2></div></div></div><div class="toc"><dl><dt><span class="section"><a href="#tools.parser.chunking.cmdline">Chunker Tool</a></span></dt><dt><span class="section"><a href="#tools.parser.chunking.api">Chunking API</a></span></dt></dl></div> |
| |
| <p> |
| Text chunking consists of dividing a text in syntactically correlated parts of words, |
| like noun groups, verb groups, but does not specify their internal structure, nor their role in the main sentence. |
| </p> |
| |
| <div class="section" title="Chunker Tool"><div class="titlepage"><div><div><h3 class="title"><a name="tools.parser.chunking.cmdline"></a>Chunker Tool</h3></div></div></div> |
| |
| <p> |
| The easiest way to try out the Chunker is the command line tool. The tool is only intended |
| for demonstration and testing. |
| </p> |
| <p> |
| Download the English maxent chunker model from the website and start the Chunker Tool with this command: |
| </p> |
| <p> |
| </p><pre class="screen"> |
| |
| $ opennlp ChunkerME en-chunker.bin |
| </pre><p> |
| The Chunker now reads a pos tagged sentence per line from stdin. |
| Copy these two sentences to the console: |
| </p><pre class="screen"> |
| |
| Rockwell_NNP International_NNP Corp._NNP 's_POS Tulsa_NNP unit_NN said_VBD it_PRP signed_VBD |
| a_DT tentative_JJ agreement_NN extending_VBG its_PRP$ contract_NN with_IN Boeing_NNP Co._NNP |
| to_TO provide_VB structural_JJ parts_NNS for_IN Boeing_NNP 's_POS 747_CD jetliners_NNS ._. |
| Rockwell_NNP said_VBD the_DT agreement_NN calls_VBZ for_IN it_PRP to_TO supply_VB 200_CD |
| additional_JJ so-called_JJ shipsets_NNS for_IN the_DT planes_NNS ._. |
| </pre><p> |
| The Chunker will now echo the sentences grouped tokens to the console: |
| </p><pre class="screen"> |
| |
| [NP Rockwell_NNP International_NNP Corp._NNP ] [NP 's_POS Tulsa_NNP unit_NN ] [VP said_VBD ] |
| [NP it_PRP ] [VP signed_VBD ] [NP a_DT tentative_JJ agreement_NN ] [VP extending_VBG ] |
| [NP its_PRP$ contract_NN ] [PP with_IN ] [NP Boeing_NNP Co._NNP ] [VP to_TO provide_VB ] |
| [NP structural_JJ parts_NNS ] [PP for_IN ] [NP Boeing_NNP ] [NP 's_POS 747_CD jetliners_NNS ] ._. |
| [NP Rockwell_NNP ] [VP said_VBD ] [NP the_DT agreement_NN ] [VP calls_VBZ ] [SBAR for_IN ] |
| [NP it_PRP ] [VP to_TO supply_VB ] [NP 200_CD additional_JJ so-called_JJ shipsets_NNS ] |
| [PP for_IN ] [NP the_DT planes_NNS ] ._. |
| </pre><p> |
| The tag set used by the English pos model is the <a class="ulink" href="https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html" target="_top">Penn Treebank tag set</a>. |
| </p> |
| </div> |
| <div class="section" title="Chunking API"><div class="titlepage"><div><div><h3 class="title"><a name="tools.parser.chunking.api"></a>Chunking API</h3></div></div></div> |
| |
| <p> |
| The Chunker can be embedded into an application via its API. |
| First the chunker model must be loaded into memory from disk or another source. |
| In the sample below it is loaded from disk. |
| </p><pre class="programlisting"> |
| |
| InputStream modelIn = null; |
| ChunkerModel model = null; |
| |
| <b class="hl-keyword">try</b> (modelIn = <b class="hl-keyword">new</b> FileInputStream(<b class="hl-string"><i style="color:red">"en-chunker.bin"</i></b>)){ |
| model = <b class="hl-keyword">new</b> ChunkerModel(modelIn); |
| } |
| </pre><p> |
| After the model is loaded a Chunker can be instantiated. |
| </p><pre class="programlisting"> |
| |
| ChunkerME chunker = <b class="hl-keyword">new</b> ChunkerME(model); |
| </pre><p> |
| The Chunker instance is now ready to tag data. It expects a tokenized sentence |
| as input, which is represented as a String array, each String object in the array |
| is one token, and the POS tags associated with each token. |
| </p> |
| <p> |
| The following code shows how to determine the most likely chunk tag sequence for a sentence. |
| </p><pre class="programlisting"> |
| |
| String sent[] = <b class="hl-keyword">new</b> String[] { <b class="hl-string"><i style="color:red">"Rockwell"</i></b>, <b class="hl-string"><i style="color:red">"International"</i></b>, <b class="hl-string"><i style="color:red">"Corp."</i></b>, <b class="hl-string"><i style="color:red">"'s"</i></b>, |
| <b class="hl-string"><i style="color:red">"Tulsa"</i></b>, <b class="hl-string"><i style="color:red">"unit"</i></b>, <b class="hl-string"><i style="color:red">"said"</i></b>, <b class="hl-string"><i style="color:red">"it"</i></b>, <b class="hl-string"><i style="color:red">"signed"</i></b>, <b class="hl-string"><i style="color:red">"a"</i></b>, <b class="hl-string"><i style="color:red">"tentative"</i></b>, <b class="hl-string"><i style="color:red">"agreement"</i></b>, |
| <b class="hl-string"><i style="color:red">"extending"</i></b>, <b class="hl-string"><i style="color:red">"its"</i></b>, <b class="hl-string"><i style="color:red">"contract"</i></b>, <b class="hl-string"><i style="color:red">"with"</i></b>, <b class="hl-string"><i style="color:red">"Boeing"</i></b>, <b class="hl-string"><i style="color:red">"Co."</i></b>, <b class="hl-string"><i style="color:red">"to"</i></b>, |
| <b class="hl-string"><i style="color:red">"provide"</i></b>, <b class="hl-string"><i style="color:red">"structural"</i></b>, <b class="hl-string"><i style="color:red">"parts"</i></b>, <b class="hl-string"><i style="color:red">"for"</i></b>, <b class="hl-string"><i style="color:red">"Boeing"</i></b>, <b class="hl-string"><i style="color:red">"'s"</i></b>, <b class="hl-string"><i style="color:red">"747"</i></b>, |
| <b class="hl-string"><i style="color:red">"jetliners"</i></b>, <b class="hl-string"><i style="color:red">"."</i></b> }; |
| |
| String pos[] = <b class="hl-keyword">new</b> String[] { <b class="hl-string"><i style="color:red">"NNP"</i></b>, <b class="hl-string"><i style="color:red">"NNP"</i></b>, <b class="hl-string"><i style="color:red">"NNP"</i></b>, <b class="hl-string"><i style="color:red">"POS"</i></b>, <b class="hl-string"><i style="color:red">"NNP"</i></b>, <b class="hl-string"><i style="color:red">"NN"</i></b>, |
| <b class="hl-string"><i style="color:red">"VBD"</i></b>, <b class="hl-string"><i style="color:red">"PRP"</i></b>, <b class="hl-string"><i style="color:red">"VBD"</i></b>, <b class="hl-string"><i style="color:red">"DT"</i></b>, <b class="hl-string"><i style="color:red">"JJ"</i></b>, <b class="hl-string"><i style="color:red">"NN"</i></b>, <b class="hl-string"><i style="color:red">"VBG"</i></b>, <b class="hl-string"><i style="color:red">"PRP$"</i></b>, <b class="hl-string"><i style="color:red">"NN"</i></b>, <b class="hl-string"><i style="color:red">"IN"</i></b>, |
| <b class="hl-string"><i style="color:red">"NNP"</i></b>, <b class="hl-string"><i style="color:red">"NNP"</i></b>, <b class="hl-string"><i style="color:red">"TO"</i></b>, <b class="hl-string"><i style="color:red">"VB"</i></b>, <b class="hl-string"><i style="color:red">"JJ"</i></b>, <b class="hl-string"><i style="color:red">"NNS"</i></b>, <b class="hl-string"><i style="color:red">"IN"</i></b>, <b class="hl-string"><i style="color:red">"NNP"</i></b>, <b class="hl-string"><i style="color:red">"POS"</i></b>, <b class="hl-string"><i style="color:red">"CD"</i></b>, <b class="hl-string"><i style="color:red">"NNS"</i></b>, |
| <b class="hl-string"><i style="color:red">"."</i></b> }; |
| |
| String tag[] = chunker.chunk(sent, pos); |
| </pre><p> |
| The tags array contains one chunk tag for each token in the input array. The corresponding |
| tag can be found at the same index as the token has in the input array. |
| The confidence scores for the returned tags can be easily retrieved from |
| a ChunkerME with the following method call: |
| </p><pre class="programlisting"> |
| |
| <b class="hl-keyword">double</b> probs[] = chunker.probs(); |
| </pre><p> |
| The call to probs is stateful and will always return the probabilities of the last |
| tagged sentence. The probs method should only be called when the tag method |
| was called before, otherwise the behavior is undefined. |
| </p> |
| <p> |
| Some applications need to retrieve the n-best chunk tag sequences and not |
| only the best sequence. |
| The topKSequences method is capable of returning the top sequences. |
| It can be called in a similar way as chunk. |
| </p><pre class="programlisting"> |
| |
| Sequence topSequences[] = chunk.topKSequences(sent, pos); |
| </pre><p> |
| Each Sequence object contains one sequence. The sequence can be retrieved |
| via Sequence.getOutcomes() which returns a tags array |
| and Sequence.getProbs() returns the probability array for this sequence. |
| </p> |
| </div> |
| </div> |
| <div class="section" title="Chunker Training"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="tools.chunker.training"></a>Chunker Training</h2></div></div></div><div class="toc"><dl><dt><span class="section"><a href="#tools.chunker.training.tool">Training Tool</a></span></dt><dt><span class="section"><a href="#tools.chunker.training.api">Training API</a></span></dt></dl></div> |
| |
| <p> |
| The pre-trained models might not be available for a desired language, |
| can not detect important entities or the performance is not good enough outside the news domain. |
| </p> |
| <p> |
| These are the typical reason to do custom training of the chunker on a new |
| corpus or on a corpus which is extended by private training data taken from the data which should be analyzed. |
| </p> |
| <p> |
| The training data can be converted to the OpenNLP chunker training format, |
| which is based on <a class="ulink" href="http://www.cnts.ua.ac.be/conll2000/chunking" target="_top">CoNLL2000</a>. |
| Other formats may also be available. |
| The training data consist of three columns separated one single space. Each word has been put on a |
| separate line and there is an empty line after each sentence. The first column contains |
| the current word, the second its part-of-speech tag and the third its chunk tag. |
| The chunk tags contain the name of the chunk type, for example I-NP for noun phrase words |
| and I-VP for verb phrase words. Most chunk types have two types of chunk tags, |
| B-CHUNK for the first word of the chunk and I-CHUNK for each other word in the chunk. |
| Here is an example of the file format: |
| </p> |
| <p> |
| Sample sentence of the training data: |
| </p><pre class="screen"> |
| |
| He PRP B-NP |
| reckons VBZ B-VP |
| the DT B-NP |
| current JJ I-NP |
| account NN I-NP |
| deficit NN I-NP |
| will MD B-VP |
| narrow VB I-VP |
| to TO B-PP |
| only RB B-NP |
| # # I-NP |
| 1.8 CD I-NP |
| billion CD I-NP |
| in IN B-PP |
| September NNP B-NP |
| . . O |
| </pre><p> |
| Note that for improved visualization the example above uses tabs instead of a single space as column separator. |
| </p> |
| <div class="section" title="Training Tool"><div class="titlepage"><div><div><h3 class="title"><a name="tools.chunker.training.tool"></a>Training Tool</h3></div></div></div> |
| |
| <p> |
| OpenNLP has a command line tool which is used to train the models available from the |
| model download page on various corpora. |
| </p> |
| <p> |
| Usage of the tool: |
| </p><pre class="screen"> |
| |
| $ opennlp ChunkerTrainerME |
| Usage: opennlp ChunkerTrainerME[.ad] [-params paramsFile] [-iterations num] [-cutoff num] \ |
| -model modelFile -lang language -data sampleData [-encoding charsetName] |
| |
| Arguments description: |
| -params paramsFile |
| training parameters file. |
| -iterations num |
| number of training iterations, ignored if -params is used. |
| -cutoff num |
| minimal number of times a feature must be seen, ignored if -params is used. |
| -model modelFile |
| output model file. |
| -lang language |
| language which is being processed. |
| -data sampleData |
| data to be used, usually a file name. |
| -encoding charsetName |
| encoding for reading and writing text, if absent the system default is used. |
| </pre><p> |
| Its now assumed that the English chunker model should be trained from a file called |
| en-chunker.train which is encoded as UTF-8. The following command will train the |
| name finder and write the model to en-chunker.bin: |
| </p><pre class="screen"> |
| |
| $ opennlp ChunkerTrainerME -model en-chunker.bin -lang en -data en-chunker.train -encoding UTF-8 |
| </pre><p> |
| Additionally its possible to specify the number of iterations, the cutoff and to overwrite |
| all types in the training data with a single type. |
| </p> |
| </div> |
| <div class="section" title="Training API"><div class="titlepage"><div><div><h3 class="title"><a name="tools.chunker.training.api"></a>Training API</h3></div></div></div> |
| |
| <p> |
| The Chunker offers an API to train a new chunker model. The following sample code |
| illustrates how to do it: |
| </p><pre class="programlisting"> |
| |
| ObjectStream<String> lineStream = |
| <b class="hl-keyword">new</b> PlainTextByLineStream(<b class="hl-keyword">new</b> MarkableFileInputStreamFactory(<b class="hl-keyword">new</b> File(<b class="hl-string"><i style="color:red">"en-chunker.train"</i></b>)), StandardCharsets.UTF_<span class="hl-number">8</span>); |
| |
| ChunkerModel model; |
| |
| <b class="hl-keyword">try</b>(ObjectStream<ChunkSample> sampleStream = <b class="hl-keyword">new</b> ChunkSampleStream(lineStream)) { |
| model = ChunkerME.train(<b class="hl-string"><i style="color:red">"eng"</i></b>, sampleStream, |
| TrainingParameters.defaultParams(), <b class="hl-keyword">new</b> ChunkerFactory()); |
| } |
| |
| <b class="hl-keyword">try</b> (OutputStream modelOut = <b class="hl-keyword">new</b> BufferedOutputStream(<b class="hl-keyword">new</b> FileOutputStream(modelFile))) { |
| model.serialize(modelOut); |
| } |
| </pre><p> |
| </p> |
| </div> |
| </div> |
| |
| <div class="section" title="Chunker Evaluation"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="tools.chunker.evaluation"></a>Chunker Evaluation</h2></div></div></div><div class="toc"><dl><dt><span class="section"><a href="#tools.chunker.evaluation.tool">Chunker Evaluation Tool</a></span></dt></dl></div> |
| |
| <p> |
| The built-in evaluation can measure the chunker performance. The performance is either |
| measured on a test dataset or via cross validation. |
| </p> |
| <div class="section" title="Chunker Evaluation Tool"><div class="titlepage"><div><div><h3 class="title"><a name="tools.chunker.evaluation.tool"></a>Chunker Evaluation Tool</h3></div></div></div> |
| |
| <p> |
| The following command shows how the tool can be run: |
| </p><pre class="screen"> |
| |
| $ opennlp ChunkerEvaluator |
| Usage: opennlp ChunkerEvaluator[.ad] -model model [-misclassified true|false] \ |
| [-detailedF true|false] -lang language -data sampleData [-encoding charsetName] |
| </pre><p> |
| A sample of the command considering you have a data sample named en-chunker.eval |
| and you trained a model called en-chunker.bin: |
| </p><pre class="screen"> |
| |
| $ opennlp ChunkerEvaluator -model en-chunker.bin -data en-chunker.eval -encoding UTF-8 |
| </pre><p> |
| and here is a sample output: |
| </p><pre class="screen"> |
| |
| Precision: 0.9255923572240226 |
| Recall: 0.9220610430991112 |
| F-Measure: 0.9238233255623465 |
| </pre><p> |
| You can also use the tool to perform 10-fold cross validation of the Chunker. |
| The following command shows how the tool can be run: |
| </p><pre class="screen"> |
| |
| $ opennlp ChunkerCrossValidator |
| Usage: opennlp ChunkerCrossValidator[.ad] [-params paramsFile] [-iterations num] [-cutoff num] \ |
| [-misclassified true|false] [-folds num] [-detailedF true|false] \ |
| -lang language -data sampleData [-encoding charsetName] |
| |
| Arguments description: |
| -params paramsFile |
| training parameters file. |
| -iterations num |
| number of training iterations, ignored if -params is used. |
| -cutoff num |
| minimal number of times a feature must be seen, ignored if -params is used. |
| -misclassified true|false |
| if true will print false negatives and false positives. |
| -folds num |
| number of folds, default is 10. |
| -detailedF true|false |
| if true will print detailed FMeasure results. |
| -lang language |
| language which is being processed. |
| -data sampleData |
| data to be used, usually a file name. |
| -encoding charsetName |
| encoding for reading and writing text, if absent the system default is used. |
| </pre><p> |
| It is not necessary to pass a model. The tool will automatically split the data to train and evaluate: |
| </p><pre class="screen"> |
| |
| $ opennlp ChunkerCrossValidator -lang pt -data en-chunker.cross -encoding UTF-8 |
| </pre><p> |
| </p> |
| </div> |
| </div> |
| </div> |
| <div class="chapter" title="Chapter 10. Parser"><div class="titlepage"><div><div><h2 class="title"><a name="tools.parser"></a>Chapter 10. Parser</h2></div></div></div><div class="toc"><p><b>Table of Contents</b></p><dl><dt><span class="section"><a href="#tools.parser.parsing">Parsing</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.parser.parsing.cmdline">Parser Tool</a></span></dt><dt><span class="section"><a href="#tools.parser.parsing.api">Parsing API</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.parser.training">Parser Training</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.parser.training.tool">Training Tool</a></span></dt><dt><span class="section"><a href="#tools.parser.training.api">Training API</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.parser.evaluation">Parser Evaluation</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.parser.evaluation.tool">Parser Evaluation Tool</a></span></dt><dt><span class="section"><a href="#tools.parser.evaluation.api">Evaluation API</a></span></dt></dl></dd></dl></div> |
| |
| |
| |
| <div class="section" title="Parsing"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="tools.parser.parsing"></a>Parsing</h2></div></div></div><div class="toc"><dl><dt><span class="section"><a href="#tools.parser.parsing.cmdline">Parser Tool</a></span></dt><dt><span class="section"><a href="#tools.parser.parsing.api">Parsing API</a></span></dt></dl></div> |
| |
| <p> |
| A parser returns a parse tree from a sentence according to a phrase structure grammar. A parse tree specifies |
| the internal structure of a sentence. For example, the following image represents a parse tree for |
| the sentence 'The cellphone was broken in two days': |
| |
| <img src="images/parsetree1.png"> |
| |
| A parse tree can be used to determine the role of subtrees or constituents in the sentence. For example, it is possible to |
| know that 'The cellphone' is the subject of the sentence and the verb (action) is 'was broken.' |
| </p> |
| |
| <div class="section" title="Parser Tool"><div class="titlepage"><div><div><h3 class="title"><a name="tools.parser.parsing.cmdline"></a>Parser Tool</h3></div></div></div> |
| |
| <p> |
| The easiest way to try out the Parser is the command line tool. |
| The tool is only intended for demonstration and testing. |
| Download the English chunking parser model from the our website and start the Parse |
| Tool with the following command. |
| </p><pre class="screen"> |
| |
| $ opennlp Parser en-parser-chunking.bin |
| </pre><p> |
| Loading the big parser model can take several seconds, be patient. |
| Copy this sample sentence to the console. |
| </p><pre class="screen"> |
| |
| The cellphone was broken in two days . |
| </pre><p> |
| The parser should now print the following to the console. |
| </p><pre class="screen"> |
| |
| (TOP (S (NP (DT The) (NN cellphone)) (VP (VBD was) (VP (VBN broken) (PP (IN in) (NP (CD two) (NNS days))))) (. .))) |
| </pre><p> |
| With the following command the input can be read from a file and be written to an output file. |
| </p><pre class="screen"> |
| |
| $ opennlp Parser en-parser-chunking.bin < article-tokenized.txt > article-parsed.txt. |
| </pre><p> |
| The article-tokenized.txt file must contain one sentence per line which is |
| tokenized with the English tokenizer model from our website. |
| See the Tokenizer documentation for further details. |
| </p> |
| </div> |
| <div class="section" title="Parsing API"><div class="titlepage"><div><div><h3 class="title"><a name="tools.parser.parsing.api"></a>Parsing API</h3></div></div></div> |
| |
| <p> |
| The Parser can be easily integrated into an application via its API. |
| To instantiate a Parser the parser model must be loaded first. |
| </p><pre class="programlisting"> |
| |
| InputStream modelIn = <b class="hl-keyword">new</b> FileInputStream(<b class="hl-string"><i style="color:red">"en-parser-chunking.bin"</i></b>); |
| <b class="hl-keyword">try</b> { |
| ParserModel model = <b class="hl-keyword">new</b> ParserModel(modelIn); |
| } |
| <b class="hl-keyword">catch</b> (IOException e) { |
| e.printStackTrace(); |
| } |
| <b class="hl-keyword">finally</b> { |
| <b class="hl-keyword">if</b> (modelIn != null) { |
| <b class="hl-keyword">try</b> { |
| modelIn.close(); |
| } |
| <b class="hl-keyword">catch</b> (IOException e) { |
| } |
| } |
| } |
| </pre><p> |
| Unlike the other components to instantiate the Parser a factory method |
| should be used instead of creating the Parser via the new operator. |
| The parser model is either trained for the chunking parser or the tree |
| insert parser the parser implementation must be chosen correctly. |
| The factory method will read a type parameter from the model and create |
| an instance of the corresponding parser implementation. |
| </p><pre class="programlisting"> |
| |
| Parser parser = ParserFactory.create(model); |
| </pre><p> |
| Right now the tree insert parser is still experimental and there is no pre-trained model for it. |
| The parser expect a whitespace tokenized sentence. A utility method from the command |
| line tool can parse the sentence String. The following code shows how the parser can be called. |
| </p><pre class="programlisting"> |
| |
| String sentence = <b class="hl-string"><i style="color:red">"The quick brown fox jumps over the lazy dog ."</i></b>; |
| Parse topParses[] = ParserTool.parseLine(sentence, parser, <span class="hl-number">1</span>); |
| </pre><p> |
| |
| The topParses array only contains one parse because the number of parses is set to 1. |
| The Parse object contains the parse tree. |
| To display the parse tree call the show method. It either prints the parse to |
| the console or into a provided StringBuffer. Similar to Exception.printStackTrace. |
| </p> |
| <p> |
| TODO: Extend this section with more information about the Parse object. |
| </p> |
| </div> |
| </div> |
| <div class="section" title="Parser Training"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="tools.parser.training"></a>Parser Training</h2></div></div></div><div class="toc"><dl><dt><span class="section"><a href="#tools.parser.training.tool">Training Tool</a></span></dt><dt><span class="section"><a href="#tools.parser.training.api">Training API</a></span></dt></dl></div> |
| |
| <p> |
| The OpenNLP offers two different parser implementations, the chunking parser and the |
| treeinsert parser. The later one is still experimental and not recommended for production use. |
| (TODO: Add a section which explains the two different approaches) |
| The training can either be done with the command line tool or the training API. |
| In the first case the training data must be available in the OpenNLP format. Which is |
| the Penn Treebank format, but with the limitation of a sentence per line. |
| </p><pre class="programlisting"> |
| |
| (TOP (S (NP-SBJ (DT Some) )(VP (VBP say) (NP (NNP November) ))(. .) )) |
| (TOP (S (NP-SBJ (PRP I) )(VP (VBP say) (NP (CD 1992) ))(. .) ('' '') )) |
| </pre><p> |
| Penn Treebank annotation guidelines can be found on the |
| <a class="ulink" href="https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html" target="_top">Penn Treebank home page</a>. |
| A parser model also contains a pos tagger model, depending on the amount of available |
| training data it is recommend to switch the tagger model against a tagger model which |
| was trained on a larger corpus. The pre-trained parser model provided on the website |
| is doing this to achieve a better performance. (TODO: On which data is the model on |
| the website trained, and say on which data the tagger model is trained) |
| </p> |
| <div class="section" title="Training Tool"><div class="titlepage"><div><div><h3 class="title"><a name="tools.parser.training.tool"></a>Training Tool</h3></div></div></div> |
| |
| <p> |
| OpenNLP has a command line tool which is used to train the models available from the |
| model download page on various corpora. The data must be converted to the OpenNLP parser |
| training format, which is shortly explained above. |
| To train the parser a head rules file is also needed. (TODO: Add documentation about the head rules file) |
| Usage of the tool: |
| </p><pre class="screen"> |
| |
| $ opennlp ParserTrainer |
| Usage: opennlp ParserTrainer -headRules headRulesFile [-parserType CHUNKING|TREEINSERT] \ |
| [-params paramsFile] [-iterations num] [-cutoff num] \ |
| -model modelFile -lang language -data sampleData \ |
| [-encoding charsetName] |
| |
| Arguments description: |
| -headRules headRulesFile |
| head rules file. |
| -parserType CHUNKING|TREEINSERT |
| one of CHUNKING or TREEINSERT, default is CHUNKING. |
| -params paramsFile |
| training parameters file. |
| -iterations num |
| number of training iterations, ignored if -params is used. |
| -cutoff num |
| minimal number of times a feature must be seen, ignored if -params is used. |
| -model modelFile |
| output model file. |
| -format formatName |
| data format, might have its own parameters. |
| -encoding charsetName |
| encoding for reading and writing text, if absent the system default is used. |
| -lang language |
| language which is being processed. |
| -data sampleData |
| data to be used, usually a file name. |
| -encoding charsetName |
| encoding for reading and writing text, if absent the system default is used. |
| </pre><p> |
| The model on the website was trained with the following command: |
| </p><pre class="screen"> |
| |
| $ opennlp ParserTrainer -model en-parser-chunking.bin -parserType CHUNKING \ |
| -headRules head_rules \ |
| -lang en -data train.all -encoding ISO-8859-1 |
| |
| </pre><p> |
| Its also possible to specify the cutoff and the number of iterations, these parameters |
| are used for all trained models. The -parserType parameter is an optional parameter, |
| to use the tree insertion parser, specify TREEINSERT as type. The TaggerModelReplacer |
| tool replaces the tagger model inside the parser model with a new one. |
| </p> |
| <p> |
| Note: The original parser model will be overwritten with the new parser model which |
| contains the replaced tagger model. |
| </p><pre class="screen"> |
| |
| $ opennlp TaggerModelReplacer en-parser-chunking.bin en-pos-maxent.bin |
| </pre><p> |
| Additionally there are tools to just retrain the build or the check model. |
| </p> |
| </div> |
| |
| <div class="section" title="Training API"><div class="titlepage"><div><div><h3 class="title"><a name="tools.parser.training.api"></a>Training API</h3></div></div></div> |
| |
| <p> |
| The Parser training API supports the training of a new parser model. |
| Four steps are necessary to train it: |
| </p><div class="itemizedlist"><ul class="itemizedlist" type="disc"><li class="listitem"> |
| <p>A HeadRules class needs to be instantiated: currently EnglishHeadRules and AncoraSpanishHeadRules are available.</p> |
| </li><li class="listitem"> |
| <p>The application must open a sample data stream.</p> |
| </li><li class="listitem"> |
| <p>Call a Parser train method: This can be either the CHUNKING or the TREEINSERT parser.</p> |
| </li><li class="listitem"> |
| <p>Save the ParseModel to a file</p> |
| </li></ul></div><p> |
| The following code snippet shows how to instantiate the HeadRules: |
| </p><pre class="programlisting"> |
| |
| <b class="hl-keyword">static</b> HeadRules createHeadRules(TrainerToolParams params) <b class="hl-keyword">throws</b> IOException { |
| |
| ArtifactSerializer headRulesSerializer = null; |
| |
| <b class="hl-keyword">if</b> (params.getHeadRulesSerializerImpl() != null) { |
| headRulesSerializer = ExtensionLoader.instantiateExtension(ArtifactSerializer.<b class="hl-keyword">class</b>, |
| params.getHeadRulesSerializerImpl()); |
| } |
| <b class="hl-keyword">else</b> { |
| <b class="hl-keyword">if</b> (<b class="hl-string"><i style="color:red">"eng"</i></b>.equals(params.getLang())) { |
| headRulesSerializer = <b class="hl-keyword">new</b> opennlp.tools.parser.lang.en.HeadRules.HeadRulesSerializer(); |
| } |
| <b class="hl-keyword">else</b> <b class="hl-keyword">if</b> (<b class="hl-string"><i style="color:red">"es"</i></b>.equals(params.getLang())) { |
| headRulesSerializer = <b class="hl-keyword">new</b> opennlp.tools.parser.lang.es.AncoraSpanishHeadRules.HeadRulesSerializer(); |
| } |
| <b class="hl-keyword">else</b> { |
| <i class="hl-comment" style="color: silver">// default for now, this case should probably cause an error ...</i> |
| headRulesSerializer = <b class="hl-keyword">new</b> opennlp.tools.parser.lang.en.HeadRules.HeadRulesSerializer(); |
| } |
| } |
| |
| Object headRulesObject = headRulesSerializer.create(<b class="hl-keyword">new</b> FileInputStream(params.getHeadRules())); |
| |
| <b class="hl-keyword">if</b> (headRulesObject <b class="hl-keyword">instanceof</b> HeadRules) { |
| <b class="hl-keyword">return</b> (HeadRules) headRulesObject; |
| } |
| <b class="hl-keyword">else</b> { |
| <b class="hl-keyword">throw</b> <b class="hl-keyword">new</b> TerminateToolException(-<span class="hl-number">1</span>, <b class="hl-string"><i style="color:red">"HeadRules Artifact Serializer must create an object of type HeadRules!"</i></b>); |
| } |
| } |
| </pre><p> |
| The following code illustrates the three other steps, namely, opening the data, training |
| the model and saving the ParserModel into an output file. |
| </p><pre class="programlisting"> |
| |
| ParserModel model = null; |
| File modelOutFile = params.getModel(); |
| CmdLineUtil.checkOutputFile(<b class="hl-string"><i style="color:red">"parser model"</i></b>, modelOutFile); |
| |
| <b class="hl-keyword">try</b> { |
| HeadRules rules = createHeadRules(params); |
| InputStreamFactory inputStreamFactory = <b class="hl-keyword">new</b> MarkableFileInputStreamFactory(<b class="hl-keyword">new</b> File(<b class="hl-string"><i style="color:red">"parsing.train"</i></b>)); |
| ObjectStream<String> stringStream = <b class="hl-keyword">new</b> PlainTextByLineStream(inputStreamFactory, StandardCharsets.UTF_<span class="hl-number">8</span>); |
| ObjectStream<Parse> sampleStream = <b class="hl-keyword">new</b> ParseSample(stringStream); |
| |
| ParserType type = parseParserType(params.getParserType()); |
| <b class="hl-keyword">if</b> (ParserType.CHUNKING.equals(type)) { |
| model = opennlp.tools.parser.chunking.Parser.train( |
| params.getLang(), sampleStream, rules, |
| mlParams); |
| } <b class="hl-keyword">else</b> <b class="hl-keyword">if</b> (ParserType.TREEINSERT.equals(type)) { |
| model = opennlp.tools.parser.treeinsert.Parser.train(params.getLang(), sampleStream, rules, |
| mlParams); |
| } |
| } |
| <b class="hl-keyword">catch</b> (IOException e) { |
| <b class="hl-keyword">throw</b> <b class="hl-keyword">new</b> TerminateToolException(-<span class="hl-number">1</span>, <b class="hl-string"><i style="color:red">"IO error while reading training data or indexing data: "</i></b> |
| + e.getMessage(), e); |
| } |
| <b class="hl-keyword">finally</b> { |
| <b class="hl-keyword">try</b> { |
| sampleStream.close(); |
| } |
| <b class="hl-keyword">catch</b> (IOException e) { |
| <i class="hl-comment" style="color: silver">// sorry that this can fail</i> |
| } |
| } |
| CmdLineUtil.writeModel(<b class="hl-string"><i style="color:red">"parser"</i></b>, modelOutFile, model); |
| |
| </pre><p> |
| </p> |
| </div> |
| </div> |
| <div class="section" title="Parser Evaluation"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="tools.parser.evaluation"></a>Parser Evaluation</h2></div></div></div><div class="toc"><dl><dt><span class="section"><a href="#tools.parser.evaluation.tool">Parser Evaluation Tool</a></span></dt><dt><span class="section"><a href="#tools.parser.evaluation.api">Evaluation API</a></span></dt></dl></div> |
| |
| <p> |
| The built in evaluation can measure the parser performance. The |
| performance is measured |
| on a test dataset. |
| </p> |
| <div class="section" title="Parser Evaluation Tool"><div class="titlepage"><div><div><h3 class="title"><a name="tools.parser.evaluation.tool"></a>Parser Evaluation Tool</h3></div></div></div> |
| |
| <p> |
| The following command shows how the tool can be run: |
| </p><pre class="screen"> |
| |
| $ opennlp ParserEvaluator |
| Usage: opennlp ParserEvaluator[.ontonotes|frenchtreebank] [-misclassified true|false] -model model \ |
| -data sampleData [-encoding charsetName] |
| </pre><p> |
| A sample of the command considering you have a data sample named |
| en-parser-chunking.eval |
| and you trained a model called en-parser-chunking.bin: |
| </p><pre class="screen"> |
| |
| $ opennlp ParserEvaluator -model en-parser-chunking.bin -data en-parser-chunking.eval -encoding UTF-8 |
| </pre><p> |
| and here is a sample output: |
| </p><pre class="screen"> |
| |
| Precision: 0.9009744742967609 |
| Recall: 0.8962012400910446 |
| F-Measure: 0.8985815184245214 |
| </pre><p> |
| </p> |
| <p> |
| The Parser Evaluation tool reimplements the PARSEVAL scoring method |
| as implemented by the |
| <a class="ulink" href="http://nlp.cs.nyu.edu/evalb/" target="_top">EVALB</a> |
| script, which is the most widely used evaluation |
| tool for constituent parsing. Note however that currently the Parser |
| Evaluation tool does not allow |
| to make exceptions in the constituents to be evaluated, in the way |
| Collins or Bikel usually do. Any |
| contributions are very welcome. If you want to contribute please contact us on |
| the mailing list or comment |
| on the jira issue |
| <a class="ulink" href="https://issues.apache.org/jira/browse/OPENNLP-688" target="_top">OPENNLP-688</a>. |
| </p> |
| </div> |
| <div class="section" title="Evaluation API"><div class="titlepage"><div><div><h3 class="title"><a name="tools.parser.evaluation.api"></a>Evaluation API</h3></div></div></div> |
| |
| <p> |
| The evaluation can be performed on a pre-trained model and a test dataset or via cross validation. |
| In the first case the model must be loaded and a Parse ObjectStream must be created (see code samples above), |
| assuming these two objects exist the following code shows how to perform the evaluation: |
| </p><pre class="programlisting"> |
| |
| Parser parser = ParserFactory.create(model); |
| ParserEvaluator evaluator = <b class="hl-keyword">new</b> ParserEvaluator(parser); |
| evaluator.evaluate(sampleStream); |
| |
| FMeasure result = evaluator.getFMeasure(); |
| |
| System.out.println(result.toString()); |
| </pre><p> |
| In the cross validation case all the training arguments must be |
| provided (see the Training API section above). |
| To perform cross validation the ObjectStream must be resettable. |
| </p><pre class="programlisting"> |
| |
| InputStreamFactory inputStreamFactory = <b class="hl-keyword">new</b> MarkableFileInputStreamFactory(<b class="hl-keyword">new</b> File(<b class="hl-string"><i style="color:red">"parsing.train"</i></b>)); |
| ObjectStream<String> stringStream = <b class="hl-keyword">new</b> PlainTextByLineStream(inputStreamFactory, StandardCharsets.UTF_<span class="hl-number">8</span>); |
| ObjectStream<Parse> sampleStream = <b class="hl-keyword">new</b> ParseSample(stringStream); |
| ParserCrossValidator evaluator = <b class="hl-keyword">new</b> ParserCrossValidator(<b class="hl-string"><i style="color:red">"eng"</i></b>, trainParameters, headRules, \ |
| parserType, listeners.toArray(<b class="hl-keyword">new</b> ParserEvaluationMonitor[listeners.size()]))); |
| evaluator.evaluate(sampleStream, <span class="hl-number">10</span>); |
| |
| FMeasure result = evaluator.getFMeasure(); |
| |
| System.out.println(result.toString()); |
| </pre><p> |
| </p> |
| </div> |
| </div> |
| </div> |
| <div class="chapter" title="Chapter 11. Coreference Resolution"><div class="titlepage"><div><div><h2 class="title"><a name="tools.coref"></a>Chapter 11. Coreference Resolution</h2></div></div></div> |
| |
| <p> |
| The OpenNLP Coreference Resolution system links multiple mentions of an |
| entity in a document together. |
| The OpenNLP implementation is currently limited to noun phrase mentions, |
| other mention types cannot be resolved. |
| </p> |
| |
| <p> |
| TODO: Write more documentation about the coref component. Any contributions |
| are very welcome. If you want to contribute please contact us on the mailing list |
| or comment on the jira issue <a class="ulink" href="https://issues.apache.org/jira/browse/OPENNLP-48" target="_top">OPENNLP-48</a>. |
| </p> |
| |
| </div> |
| <div class="chapter" title="Chapter 12. Extending OpenNLP"><div class="titlepage"><div><div><h2 class="title"><a name="tools.extension"></a>Chapter 12. Extending OpenNLP</h2></div></div></div><div class="toc"><p><b>Table of Contents</b></p><dl><dt><span class="section"><a href="#tools.extension.writing">Writing an extension</a></span></dt><dt><span class="section"><a href="#tools.extension.osgi">Running in an OSGi container</a></span></dt></dl></div> |
| |
| <p> |
| In OpenNLP extension can be used to add new functionality and to |
| heavily customize an existing component. Most components define |
| a factory class which can be implemented to customize the creation |
| of it. And some components allow to add new feature generators. |
| </p> |
| |
| <div class="section" title="Writing an extension"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="tools.extension.writing"></a>Writing an extension</h2></div></div></div> |
| |
| <p> |
| In many places it is possible to pass in an extension class name to customize |
| some aspect of OpenNLP. The implementation class needs to implement the specified |
| interface and should have a public no-argument constructor. |
| </p> |
| </div> |
| |
| <div class="section" title="Running in an OSGi container"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="tools.extension.osgi"></a>Running in an OSGi container</h2></div></div></div> |
| |
| <p> |
| The traditional way of loading an extension via Class.forName does not work |
| in an OSGi environment because the class paths of the OpenNLP Tools and extension |
| bundle are isolated. OSGi uses services to provide functionality from one bundle |
| to another. The extension bundle must register its extensions as services so that |
| the OpenNLP tools bundle can use them. |
| The following code illustrates how that can be done: |
| </p><pre class="programlisting"> |
| |
| Dictionary<String, String> props = <b class="hl-keyword">new</b> Hashtable<String, String>(); |
| props.put(ExtensionServiceKeys.ID, <b class="hl-string"><i style="color:red">"org.test.SuperTokenizer"</i></b>); |
| context.registerService(Tokenizer.<b class="hl-keyword">class</b>.getName(), <b class="hl-keyword">new</b> org.test.SuperTokenizer(), props); |
| </pre><p> |
| The service OpenNLP is looking for might not be (yet) available. In this case OpenNLP |
| waits until a timeout is reached. If loading the extension fails an ExtensionNotLoadedException |
| is thrown. This exception is also thrown when the thread is interrupted while it is waiting for the |
| extension, the interrupted flag will be set again and the calling code has a chance to handle it. |
| </p> |
| </div> |
| </div> |
| <div class="chapter" title="Chapter 13. Corpora"><div class="titlepage"><div><div><h2 class="title"><a name="tools.corpora"></a>Chapter 13. Corpora</h2></div></div></div><div class="toc"><p><b>Table of Contents</b></p><dl><dt><span class="section"><a href="#tools.corpora.conll">CONLL</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.corpora.conll.2000">CONLL 2000</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.corpora.conll.2000.getting">Getting the data</a></span></dt><dt><span class="section"><a href="#tools.corpora.conll.2000.converting">Converting the data</a></span></dt><dt><span class="section"><a href="#tools.corpora.conll.2000.training">Training</a></span></dt><dt><span class="section"><a href="#tools.corpora.conll.2000.evaluation">Evaluating</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.corpora.conll.2002">CONLL 2002</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.corpora.conll.2002.getting">Getting the data</a></span></dt><dt><span class="section"><a href="#tools.corpora.conll.2002.converting">Converting the data</a></span></dt><dt><span class="section"><a href="#tools.corpora.conll.2002.training.spanish">Training with Spanish data</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.corpora.conll.2003">CONLL 2003</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.corpora.conll.2003.getting">Getting the data</a></span></dt><dt><span class="section"><a href="#tools.corpora.conll.2003.converting">Converting the data (optional)</a></span></dt><dt><span class="section"><a href="#tools.corpora.conll.2003.training.english">Training with English data</a></span></dt><dt><span class="section"><a href="#tools.corpora.conll.2003.evaluation.english">Evaluating with English data</a></span></dt></dl></dd></dl></dd><dt><span class="section"><a href="#tools.corpora.arvores-deitadas">Arvores Deitadas</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.corpora.arvores-deitadas.getting">Getting the data</a></span></dt><dt><span class="section"><a href="#tools.corpora.arvores-deitadas.converting">Converting the data (optional)</a></span></dt><dt><span class="section"><a href="#tools.corpora.arvores-deitadas.evaluation">Training and Evaluation</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.corpora.ontonotes">OntoNotes Release 4.0</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.corpora.ontonotes.namefinder">Name Finder Training</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.corpora.brat">Brat Format Support</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.corpora.brat.webtool">Sentences and Tokens</a></span></dt><dt><span class="section"><a href="#tools.corpora.brat.training">Training</a></span></dt><dt><span class="section"><a href="#tools.corpora.brat.evaluation">Evaluation</a></span></dt><dt><span class="section"><a href="#tools.corpora.brat.cross-validation">Cross Validation</a></span></dt></dl></dd></dl></div> |
| |
| |
| <p> |
| OpenNLP has built-in support to convert into the native training format or directly use |
| various corpora needed by the different trainable components. |
| </p> |
| <div class="section" title="CONLL"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="tools.corpora.conll"></a>CONLL</h2></div></div></div><div class="toc"><dl><dt><span class="section"><a href="#tools.corpora.conll.2000">CONLL 2000</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.corpora.conll.2000.getting">Getting the data</a></span></dt><dt><span class="section"><a href="#tools.corpora.conll.2000.converting">Converting the data</a></span></dt><dt><span class="section"><a href="#tools.corpora.conll.2000.training">Training</a></span></dt><dt><span class="section"><a href="#tools.corpora.conll.2000.evaluation">Evaluating</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.corpora.conll.2002">CONLL 2002</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.corpora.conll.2002.getting">Getting the data</a></span></dt><dt><span class="section"><a href="#tools.corpora.conll.2002.converting">Converting the data</a></span></dt><dt><span class="section"><a href="#tools.corpora.conll.2002.training.spanish">Training with Spanish data</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.corpora.conll.2003">CONLL 2003</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.corpora.conll.2003.getting">Getting the data</a></span></dt><dt><span class="section"><a href="#tools.corpora.conll.2003.converting">Converting the data (optional)</a></span></dt><dt><span class="section"><a href="#tools.corpora.conll.2003.training.english">Training with English data</a></span></dt><dt><span class="section"><a href="#tools.corpora.conll.2003.evaluation.english">Evaluating with English data</a></span></dt></dl></dd></dl></div> |
| |
| <p> |
| CoNLL stands for the Conference on Computational Natural Language Learning and is not |
| a single project but a consortium of developers attempting to broaden the computing |
| environment. More information about the entire conference series can be obtained here |
| for CoNLL. |
| </p> |
| <div class="section" title="CONLL 2000"><div class="titlepage"><div><div><h3 class="title"><a name="tools.corpora.conll.2000"></a>CONLL 2000</h3></div></div></div><div class="toc"><dl><dt><span class="section"><a href="#tools.corpora.conll.2000.getting">Getting the data</a></span></dt><dt><span class="section"><a href="#tools.corpora.conll.2000.converting">Converting the data</a></span></dt><dt><span class="section"><a href="#tools.corpora.conll.2000.training">Training</a></span></dt><dt><span class="section"><a href="#tools.corpora.conll.2000.evaluation">Evaluating</a></span></dt></dl></div> |
| |
| <p> |
| The shared task of CoNLL-2000 is Chunking. |
| </p> |
| <div class="section" title="Getting the data"><div class="titlepage"><div><div><h4 class="title"><a name="tools.corpora.conll.2000.getting"></a>Getting the data</h4></div></div></div> |
| |
| <p> |
| CoNLL-2000 made available training and test data for the Chunk task in English. |
| The data consists of the same partitions of the Wall Street Journal corpus (WSJ) |
| as the widely used data for noun phrase chunking: sections 15-18 as training data |
| (211727 tokens) and section 20 as test data (47377 tokens). The annotation of the |
| data has been derived from the WSJ corpus by a program written by Sabine Buchholz |
| from Tilburg University, The Netherlands. Both training and test data can be |
| obtained from <a class="ulink" href="https://www.clips.uantwerpen.be/conll2000/chunking/" target="_top">https://www.clips.uantwerpen.be/conll2000/chunking/</a>. |
| </p> |
| </div> |
| <div class="section" title="Converting the data"><div class="titlepage"><div><div><h4 class="title"><a name="tools.corpora.conll.2000.converting"></a>Converting the data</h4></div></div></div> |
| |
| <p> |
| The data don't need to be transformed because Apache OpenNLP Chunker follows |
| the CONLL 2000 format for training. Check <a class="link" href="#tools.chunker.training" title="Chunker Training">Chunker Training</a> section to learn more. |
| </p> |
| </div> |
| <div class="section" title="Training"><div class="titlepage"><div><div><h4 class="title"><a name="tools.corpora.conll.2000.training"></a>Training</h4></div></div></div> |
| |
| <p> |
| We can train the model for the Chunker using the train.txt available at CONLL 2000: |
| </p><pre class="screen"> |
| |
| $ opennlp ChunkerTrainerME -model en-chunker.bin -iterations 500 \ |
| -lang en -data train.txt -encoding UTF-8 |
| </pre><p> |
| </p><pre class="screen"> |
| |
| Indexing events using cutoff of 5 |
| |
| Computing event counts... done. 211727 events |
| Indexing... done. |
| Sorting and merging events... done. Reduced 211727 events to 197252. |
| Done indexing. |
| Incorporating indexed data for training... |
| done. |
| Number of Event Tokens: 197252 |
| Number of Outcomes: 22 |
| Number of Predicates: 107838 |
| ...done. |
| Computing model parameters... |
| Performing 500 iterations. |
| 1: .. loglikelihood=-654457.1455212828 0.2601510435608118 |
| 2: .. loglikelihood=-239513.5583724216 0.9260037690044255 |
| 3: .. loglikelihood=-141313.1386347238 0.9443387003074715 |
| 4: .. loglikelihood=-101083.50853437989 0.954375209585929 |
| ... cut lots of iterations ... |
| 498: .. loglikelihood=-1710.8874647317095 0.9995040783650645 |
| 499: .. loglikelihood=-1708.0908900815848 0.9995040783650645 |
| 500: .. loglikelihood=-1705.3045902366732 0.9995040783650645 |
| Writing chunker model ... done (4.019s) |
| |
| Wrote chunker model to path: .\en-chunker.bin |
| </pre><p> |
| </p> |
| </div> |
| <div class="section" title="Evaluating"><div class="titlepage"><div><div><h4 class="title"><a name="tools.corpora.conll.2000.evaluation"></a>Evaluating</h4></div></div></div> |
| |
| <p> |
| We evaluate the model using the file test.txt available at CONLL 2000: |
| </p><pre class="screen"> |
| |
| $ opennlp ChunkerEvaluator -model en-chunker.bin -lang en -encoding utf8 -data test.txt |
| </pre><p> |
| </p><pre class="screen"> |
| |
| Loading Chunker model ... done (0,665s) |
| current: 85,8 sent/s avg: 85,8 sent/s total: 86 sent |
| current: 88,1 sent/s avg: 87,0 sent/s total: 174 sent |
| current: 156,2 sent/s avg: 110,0 sent/s total: 330 sent |
| current: 192,2 sent/s avg: 130,5 sent/s total: 522 sent |
| current: 167,2 sent/s avg: 137,8 sent/s total: 689 sent |
| current: 179,2 sent/s avg: 144,6 sent/s total: 868 sent |
| current: 183,2 sent/s avg: 150,3 sent/s total: 1052 sent |
| current: 183,2 sent/s avg: 154,4 sent/s total: 1235 sent |
| current: 169,2 sent/s avg: 156,0 sent/s total: 1404 sent |
| current: 178,2 sent/s avg: 158,2 sent/s total: 1582 sent |
| current: 172,2 sent/s avg: 159,4 sent/s total: 1754 sent |
| current: 177,2 sent/s avg: 160,9 sent/s total: 1931 sent |
| |
| |
| Average: 161,6 sent/s |
| Total: 2013 sent |
| Runtime: 12.457s |
| |
| Precision: 0.9244354736974896 |
| Recall: 0.9216837162502096 |
| F-Measure: 0.9230575441395671 |
| </pre><p> |
| </p> |
| </div> |
| </div> |
| <div class="section" title="CONLL 2002"><div class="titlepage"><div><div><h3 class="title"><a name="tools.corpora.conll.2002"></a>CONLL 2002</h3></div></div></div><div class="toc"><dl><dt><span class="section"><a href="#tools.corpora.conll.2002.getting">Getting the data</a></span></dt><dt><span class="section"><a href="#tools.corpora.conll.2002.converting">Converting the data</a></span></dt><dt><span class="section"><a href="#tools.corpora.conll.2002.training.spanish">Training with Spanish data</a></span></dt></dl></div> |
| |
| <p> |
| The shared task of CoNLL-2002 is language independent named entity recognition for Spanish and Dutch. |
| </p> |
| <div class="section" title="Getting the data"><div class="titlepage"><div><div><h4 class="title"><a name="tools.corpora.conll.2002.getting"></a>Getting the data</h4></div></div></div> |
| |
| <p>The data consists of three files per language: one training file and two test files testa and testb. |
| The first test file will be used in the development phase for finding good parameters for the learning system. |
| The second test file will be used for the final evaluation. Currently there are data files available for two languages: |
| Spanish and Dutch. |
| </p> |
| <p> |
| The Spanish data is a collection of news wire articles made available by the Spanish EFE News Agency. The articles are |
| from May 2000. The annotation was carried out by the <a class="ulink" href="http://www.talp.cat/" target="_top">TALP Research Center</a> of the Technical University of Catalonia (UPC) |
| and the <a class="ulink" href="http://clic.ub.edu/" target="_top">Center of Language and Computation (CLiC)</a>of the University of Barcelona (UB), and funded by the European Commission |
| through the NAMIC project (IST-1999-12392). |
| </p> |
| <p> |
| The Dutch data consist of four editions of the Belgian newspaper "De Morgen" of 2000 (June 2, July 1, August 1 and September 1). |
| The data was annotated as a part of the <a class="ulink" href="http://atranos.esat.kuleuven.ac.be/" target="_top">Atranos</a> project at the University of Antwerp. |
| </p> |
| <p> |
| You can find the Spanish files here: |
| <a class="ulink" href="http://www.lsi.upc.edu/~nlp/tools/nerc/nerc.html" target="_top">http://www.lsi.upc.edu/~nlp/tools/nerc/nerc.html</a> |
| You must download esp.train.gz, unzip it and you will see the file esp.train. |
| </p> |
| <p> |
| You can find the Dutch files here: |
| <a class="ulink" href="http://www.cnts.ua.ac.be/conll2002/ner.tgz" target="_top">http://www.cnts.ua.ac.be/conll2002/ner.tgz</a> |
| You must unzip it and go to /ner/data/ned.train.gz, so you unzip it too, and you will see the file ned.train. |
| </p> |
| </div> |
| <div class="section" title="Converting the data"><div class="titlepage"><div><div><h4 class="title"><a name="tools.corpora.conll.2002.converting"></a>Converting the data</h4></div></div></div> |
| |
| <p> |
| I will use Spanish data as reference, but it would be the same operations to Dutch. You just must remember change “-lang es” to “-lang nl” and use |
| the correct training files. So to convert the information to the OpenNLP format: |
| </p><pre class="screen"> |
| |
| $ opennlp TokenNameFinderConverter conll02 -data esp.train -lang es -types per > es_corpus_train_persons.txt |
| </pre><p> |
| Optionally, you can convert the training test samples as well. |
| </p><pre class="screen"> |
| |
| $ opennlp TokenNameFinderConverter conll02 -data esp.testa -lang es -types per > corpus_testa.txt |
| $ opennlp TokenNameFinderConverter conll02 -data esp.testb -lang es -types per > corpus_testb.txt |
| </pre><p> |
| </p> |
| </div> |
| <div class="section" title="Training with Spanish data"><div class="titlepage"><div><div><h4 class="title"><a name="tools.corpora.conll.2002.training.spanish"></a>Training with Spanish data</h4></div></div></div> |
| |
| <p> |
| To train the model for the name finder: |
| </p><pre class="screen"> |
| |
| \bin\opennlp TokenNameFinderTrainer -lang es -encoding u |
| tf8 -iterations 500 -data es_corpus_train_persons.txt -model es_ner_person.bin |
| |
| |
| Indexing events using cutoff of 5 |
| |
| Computing event counts... done. 264715 events |
| Indexing... done. |
| Sorting and merging events... done. Reduced 264715 events to 222660. |
| Done indexing. |
| Incorporating indexed data for training... |
| done. |
| Number of Event Tokens: 222660 |
| Number of Outcomes: 3 |
| Number of Predicates: 71514 |
| ...done. |
| Computing model parameters ... |
| Performing 500 iterations. |
| 1: ... loglikelihood=-290819.1519958615 0.9689326256540053 |
| 2: ... loglikelihood=-37097.17676455632 0.9689326256540053 |
| 3: ... loglikelihood=-22910.372489660916 0.9706476776911017 |
| 4: ... loglikelihood=-17091.547325669497 0.9777874317662392 |
| 5: ... loglikelihood=-13797.620926769372 0.9833821279489262 |
| 6: ... loglikelihood=-11715.806710780415 0.9867140131839903 |
| 7: ... loglikelihood=-10289.222078246517 0.9886859452618855 |
| 8: ... loglikelihood=-9249.208318314624 0.9902310031543358 |
| 9: ... loglikelihood=-8454.169590899777 0.9913227433277298 |
| 10: ... loglikelihood=-7823.742997451327 0.9921953799369133 |
| 11: ... loglikelihood=-7309.375882641964 0.9928224694482746 |
| 12: ... loglikelihood=-6880.131972149693 0.9932946754056249 |
| 13: ... loglikelihood=-6515.3828767792365 0.993638441342576 |
| 14: ... loglikelihood=-6200.82723154046 0.9939595413935742 |
| 15: ... loglikelihood=-5926.213730444915 0.994269308501596 |
| 16: ... loglikelihood=-5683.9821840753275 0.9945299661900534 |
| 17: ... loglikelihood=-5468.4211798176075 0.9948246227074401 |
| 18: ... loglikelihood=-5275.127017232056 0.9950286156810154 |
| |
| ... cut lots of iterations ... |
| |
| 491: ... loglikelihood=-1174.8485558758211 0.998983812779782 |
| 492: ... loglikelihood=-1173.9971776942477 0.998983812779782 |
| 493: ... loglikelihood=-1173.1482915871768 0.998983812779782 |
| 494: ... loglikelihood=-1172.3018855781158 0.998983812779782 |
| 495: ... loglikelihood=-1171.457947774544 0.998983812779782 |
| 496: ... loglikelihood=-1170.6164663670502 0.998983812779782 |
| 497: ... loglikelihood=-1169.7774296286693 0.998983812779782 |
| 498: ... loglikelihood=-1168.94082591387 0.998983812779782 |
| 499: ... loglikelihood=-1168.1066436580463 0.9989875904274408 |
| 500: ... loglikelihood=-1167.2748713765225 0.9989875904274408 |
| Writing name finder model ... done (2,168s) |
| |
| Wrote name finder model to |
| path: .\es_ner_person.bin |
| </pre><p> |
| </p> |
| </div> |
| </div> |
| |
| <div class="section" title="CONLL 2003"><div class="titlepage"><div><div><h3 class="title"><a name="tools.corpora.conll.2003"></a>CONLL 2003</h3></div></div></div><div class="toc"><dl><dt><span class="section"><a href="#tools.corpora.conll.2003.getting">Getting the data</a></span></dt><dt><span class="section"><a href="#tools.corpora.conll.2003.converting">Converting the data (optional)</a></span></dt><dt><span class="section"><a href="#tools.corpora.conll.2003.training.english">Training with English data</a></span></dt><dt><span class="section"><a href="#tools.corpora.conll.2003.evaluation.english">Evaluating with English data</a></span></dt></dl></div> |
| |
| <p> |
| The shared task of CoNLL-2003 is language independent named entity recognition |
| for English and German. |
| </p> |
| <div class="section" title="Getting the data"><div class="titlepage"><div><div><h4 class="title"><a name="tools.corpora.conll.2003.getting"></a>Getting the data</h4></div></div></div> |
| |
| <p> |
| The English data is the Reuters Corpus, which is a collection of news wire articles. |
| The Reuters Corpus can be obtained free of charges from the NIST for research |
| purposes: <a class="ulink" href="http://trec.nist.gov/data/reuters/reuters.html" target="_top">http://trec.nist.gov/data/reuters/reuters.html</a> |
| </p> |
| <p> |
| The German data is a collection of articles from the German newspaper Frankfurter |
| Rundschau. The articles are part of the ECI Multilingual Text Corpus which |
| can be obtained for 75$ (2010) from the Linguistic Data Consortium: |
| <a class="ulink" href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC94T5" target="_top">http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC94T5</a> </p> |
| <p>After one of the corpora is available the data must be |
| transformed as explained in the README file to the CONLL format. |
| The transformed data can be read by the OpenNLP CONLL03 converter. |
| |
| Note that for CoNLL-2003 corpora, the -lang argument should either be "eng" or "deu", instead of "en" or "de". |
| </p> |
| </div> |
| <div class="section" title="Converting the data (optional)"><div class="titlepage"><div><div><h4 class="title"><a name="tools.corpora.conll.2003.converting"></a>Converting the data (optional)</h4></div></div></div> |
| |
| <p> |
| To convert the information to the OpenNLP format: |
| </p><pre class="screen"> |
| |
| $ opennlp TokenNameFinderConverter conll03 -lang eng -types per -data eng.train > corpus_train.txt |
| </pre><p> |
| Optionally, you can convert the training test samples as well. |
| </p><pre class="screen"> |
| |
| $ opennlp TokenNameFinderConverter conll03 -lang eng -types per -data eng.testa > corpus_testa.txt |
| $ opennlp TokenNameFinderConverter conll03 -lang eng -types per -data eng.testb > corpus_testb.txt |
| </pre><p> |
| </p> |
| </div> |
| <div class="section" title="Training with English data"><div class="titlepage"><div><div><h4 class="title"><a name="tools.corpora.conll.2003.training.english"></a>Training with English data</h4></div></div></div> |
| |
| <p> |
| You can train the model for the name finder this way: |
| </p><pre class="screen"> |
| |
| $ opennlp TokenNameFinderTrainer.conll03 -model en_ner_person.bin -iterations 500 \ |
| -lang eng -types per -data eng.train -encoding utf8 |
| </pre><p> |
| </p> |
| <p> |
| If you have converted the data, then you can train the model for the name finder this way: |
| </p><pre class="screen"> |
| |
| $ opennlp TokenNameFinderTrainer -model en_ner_person.bin -iterations 500 \ |
| -lang en -data corpus_train.txt -encoding utf8 |
| </pre><p> |
| </p> |
| <p> |
| Either way you should see the following output during the training process: |
| </p><pre class="screen"> |
| |
| Indexing events using cutoff of 5 |
| |
| Computing event counts... done. 203621 events |
| Indexing... done. |
| Sorting and merging events... done. Reduced 203621 events to 179409. |
| Done indexing. |
| Incorporating indexed data for training... |
| done. |
| Number of Event Tokens: 179409 |
| Number of Outcomes: 3 |
| Number of Predicates: 58814 |
| ...done. |
| Computing model parameters... |
| Performing 500 iterations. |
| 1: .. loglikelihood=-223700.5328318588 0.9453494482396216 |
| 2: .. loglikelihood=-40525.939777363084 0.9467933071736215 |
| 3: .. loglikelihood=-24893.98837874921 0.9598518816821447 |
| 4: .. loglikelihood=-18420.3379471033 0.9712996203731442 |
| ... cut lots of iterations ... |
| 498: .. loglikelihood=-952.8501399442295 0.9988950059178572 |
| 499: .. loglikelihood=-952.0600155746948 0.9988950059178572 |
| 500: .. loglikelihood=-951.2722802086295 0.9988950059178572 |
| Writing name finder model ... done (1.638s) |
| |
| Wrote name finder model to |
| path: .\en_ner_person.bin |
| </pre><p> |
| </p> |
| </div> |
| <div class="section" title="Evaluating with English data"><div class="titlepage"><div><div><h4 class="title"><a name="tools.corpora.conll.2003.evaluation.english"></a>Evaluating with English data</h4></div></div></div> |
| |
| <p> |
| You can evaluate the model for the name finder this way: |
| </p><pre class="screen"> |
| |
| $ opennlp TokenNameFinderEvaluator.conll03 -model en_ner_person.bin \ |
| -lang eng -types per -data eng.testa -encoding utf8 |
| </pre><p> |
| </p> |
| <p> |
| If you converted the test A and B files above, you can use them to evaluate the |
| model. |
| </p><pre class="screen"> |
| |
| $ opennlp TokenNameFinderEvaluator -model en_ner_person.bin -lang en -data corpus_testa.txt \ |
| -encoding utf8 |
| </pre><p> |
| </p> |
| <p> |
| Either way you should see the following output: |
| </p><pre class="screen"> |
| |
| Loading Token Name Finder model ... done (0.359s) |
| current: 190.2 sent/s avg: 190.2 sent/s total: 199 sent |
| current: 648.3 sent/s avg: 415.9 sent/s total: 850 sent |
| current: 530.1 sent/s avg: 453.6 sent/s total: 1380 sent |
| current: 793.8 sent/s avg: 539.0 sent/s total: 2178 sent |
| current: 705.4 sent/s avg: 571.9 sent/s total: 2882 sent |
| |
| |
| Average: 569.4 sent/s |
| Total: 3251 sent |
| Runtime: 5.71s |
| |
| Precision: 0.9366247297154147 |
| Recall: 0.739956568946797 |
| F-Measure: 0.8267557582133971 |
| </pre><p> |
| </p> |
| </div> |
| </div> |
| </div> |
| <div class="section" title="Arvores Deitadas"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="tools.corpora.arvores-deitadas"></a>Arvores Deitadas</h2></div></div></div><div class="toc"><dl><dt><span class="section"><a href="#tools.corpora.arvores-deitadas.getting">Getting the data</a></span></dt><dt><span class="section"><a href="#tools.corpora.arvores-deitadas.converting">Converting the data (optional)</a></span></dt><dt><span class="section"><a href="#tools.corpora.arvores-deitadas.evaluation">Training and Evaluation</a></span></dt></dl></div> |
| |
| <p> |
| The Portuguese corpora available at <a class="ulink" href="http://www.linguateca.pt" target="_top">Floresta Sintá(c)tica</a> project follow the Arvores Deitadas (AD) format. Apache OpenNLP includes tools to convert from AD format to native format. |
| </p> |
| <div class="section" title="Getting the data"><div class="titlepage"><div><div><h3 class="title"><a name="tools.corpora.arvores-deitadas.getting"></a>Getting the data</h3></div></div></div> |
| |
| <p> |
| The Corpus can be downloaded from here: <a class="ulink" href="http://www.linguateca.pt/floresta/corpus.html" target="_top">http://www.linguateca.pt/floresta/corpus.html</a> |
| </p> |
| <p> |
| The Name Finder models were trained using the Amazonia corpus: <a class="ulink" href="http://www.linguateca.pt/floresta/ficheiros/gz/amazonia.ad.gz" target="_top">amazonia.ad</a>. |
| The Chunker models were trained using the <a class="ulink" href="http://www.linguateca.pt/floresta/ficheiros/gz/Bosque_CF_8.0.ad.txt.gz" target="_top">Bosque_CF_8.0.ad</a>. |
| </p> |
| </div> |
| |
| <div class="section" title="Converting the data (optional)"><div class="titlepage"><div><div><h3 class="title"><a name="tools.corpora.arvores-deitadas.converting"></a>Converting the data (optional)</h3></div></div></div> |
| |
| <p> |
| To extract NameFinder training data from Amazonia corpus: |
| </p><pre class="screen"> |
| |
| $ opennlp TokenNameFinderConverter ad -lang pt -encoding ISO-8859-1 -data amazonia.ad > corpus.txt |
| </pre><p> |
| </p> |
| <p> |
| To extract Chunker training data from Bosque_CF_8.0.ad corpus: |
| </p><pre class="screen"> |
| |
| $ opennlp ChunkerConverter ad -lang pt -data Bosque_CF_8.0.ad.txt -encoding ISO-8859-1 > bosque-chunk |
| </pre><p> |
| </p> |
| </div> |
| <div class="section" title="Training and Evaluation"><div class="titlepage"><div><div><h3 class="title"><a name="tools.corpora.arvores-deitadas.evaluation"></a>Training and Evaluation</h3></div></div></div> |
| |
| <p> |
| To perform the evaluation the corpus was split into a training and a test part. |
| </p><pre class="screen"> |
| |
| $ sed '1,55172d' corpus.txt > corpus_train.txt |
| $ sed '55172,100000000d' corpus.txt > corpus_test.txt |
| </pre><p> |
| </p><pre class="screen"> |
| |
| $ opennlp TokenNameFinderTrainer -model pt-ner.bin -cutoff 20 -lang PT -data corpus_train.txt -encoding UTF-8 |
| ... |
| $ opennlp TokenNameFinderEvaluator -model pt-ner.bin -lang PT -data corpus_train.txt -encoding UTF-8 |
| |
| Precision: 0.8005071889818507 |
| Recall: 0.7450581122145297 |
| F-Measure: 0.7717879983140168 |
| </pre><p> |
| </p> |
| </div> |
| </div> |
| |
| <div class="section" title="OntoNotes Release 4.0"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="tools.corpora.ontonotes"></a>OntoNotes Release 4.0</h2></div></div></div><div class="toc"><dl><dt><span class="section"><a href="#tools.corpora.ontonotes.namefinder">Name Finder Training</a></span></dt></dl></div> |
| |
| <p> |
| "OntoNotes Release 4.0, Linguistic Data Consortium (LDC) catalog number |
| LDC2011T03 and isbn 1-58563-574-X, was developed as part of the |
| OntoNotes project, a collaborative effort between BBN Technologies, |
| the University of Colorado, the University of Pennsylvania and the |
| University of Southern Californias Information Sciences Institute. The |
| goal of the project is to annotate a large corpus comprising various |
| genres of text (news, conversational telephone speech, weblogs, usenet |
| newsgroups, broadcast, talk shows) in three languages (English, |
| Chinese, and Arabic) with structural information (syntax and predicate |
| argument structure) and shallow semantics (word sense linked to an |
| ontology and coreference). OntoNotes Release 4.0 is supported by the |
| Defense Advance Research Project Agency, GALE Program Contract No. |
| HR0011-06-C-0022. |
| </p> |
| <p> |
| OntoNotes Release 4.0 contains the content of earlier releases -- OntoNotes |
| Release 1.0 LDC2007T21, OntoNotes Release 2.0 LDC2008T04 and OntoNotes |
| Release 3.0 LDC2009T24 -- and adds newswire, broadcast news, broadcast |
| conversation and web data in English and Chinese and newswire data in |
| Arabic. This cumulative publication consists of 2.4 million words as |
| follows: 300k words of Arabic newswire 250k words of Chinese newswire, |
| 250k words of Chinese broadcast news, 150k words of Chinese broadcast |
| conversation and 150k words of Chinese web text and 600k words of |
| English newswire, 200k word of English broadcast news, 200k words of |
| English broadcast conversation and 300k words of English web text. |
| </p> |
| <p> |
| The OntoNotes project builds on two time-tested resources, following the |
| Penn Treebank for syntax and the Penn PropBank for predicate-argument |
| structure. Its semantic representation will include word sense |
| disambiguation for nouns and verbs, with each word sense connected to |
| an ontology, and coreference. The current goals call for annotation of |
| over a million words each of English and Chinese, and half a million |
| words of Arabic over five years." (http://catalog.ldc.upenn.edu/LDC2011T03) |
| </p> |
| <div class="section" title="Name Finder Training"><div class="titlepage"><div><div><h3 class="title"><a name="tools.corpora.ontonotes.namefinder"></a>Name Finder Training</h3></div></div></div> |
| |
| <p> |
| The OntoNotes corpus can be used to train the Name Finder. The corpus |
| contains many different name types |
| to train a model for a specific type only the built-in type filter |
| option should be used. |
| </p> |
| <p> |
| The sample shows how to train a model to detect person names. |
| </p><pre class="programlisting"> |
| |
| $ bin/opennlp TokenNameFinderTrainer.ontonotes -lang en -model en-ontonotes.bin \ |
| -nameTypes person -ontoNotesDir ontonotes-release-4.0/data/files/data/english/ |
| |
| Indexing events using cutoff of 5 |
| |
| Computing event counts... done. 1953446 events |
| Indexing... done. |
| Sorting and merging events... done. Reduced 1953446 events to 1822037. |
| Done indexing. |
| Incorporating indexed data for training... |
| done. |
| Number of Event Tokens: 1822037 |
| Number of Outcomes: 3 |
| Number of Predicates: 298263 |
| ...done. |
| Computing model parameters ... |
| Performing 100 iterations. |
| 1: ... loglikelihood=-2146079.7808976253 0.976677625078963 |
| 2: ... loglikelihood=-195016.59754190338 0.976677625078963 |
| ... cut lots of iterations ... |
| 99: ... loglikelihood=-10269.902459614596 0.9987299367374374 |
| 100: ... loglikelihood=-10227.160010853702 0.9987314724850341 |
| Writing name finder model ... done (2.315s) |
| |
| Wrote name finder model to |
| path: /dev/opennlp/trunk/opennlp-tools/en-ontonotes.bin |
| </pre><p> |
| </p> |
| </div> |
| </div> |
| |
| <div class="section" title="Brat Format Support"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="tools.corpora.brat"></a>Brat Format Support</h2></div></div></div><div class="toc"><dl><dt><span class="section"><a href="#tools.corpora.brat.webtool">Sentences and Tokens</a></span></dt><dt><span class="section"><a href="#tools.corpora.brat.training">Training</a></span></dt><dt><span class="section"><a href="#tools.corpora.brat.evaluation">Evaluation</a></span></dt><dt><span class="section"><a href="#tools.corpora.brat.cross-validation">Cross Validation</a></span></dt></dl></div> |
| |
| <p> |
| The brat annotation tool is an online environment for collaborative text annotation and |
| supports labeling documents with named entities. The best performance of a name finder |
| can only be achieved if it was trained on documents similar to the the documents it will |
| process. For that reason it is often necessary to manually label a large number of documents and |
| build a custom corpus. This is where brat comes in handy. |
| |
| </p><table border="0" summary="manufactured viewport for HTML img" cellspacing="0" cellpadding="0" width="585"><tr style="height: 360px"><td><img src="images/brat.png" height="360"></td></tr></table><p> |
| |
| OpenNLP can directly be trained and evaluated on labeled data in the brat format. |
| Instructions on how to use, download and install brat can be found on the project website: |
| |
| <a class="ulink" href="http://brat.nlplab.org" target="_top">http://brat.nlplab.org</a> |
| |
| Configuration of brat, including setting up the different entities and relations can be found at: |
| |
| <a class="ulink" href="http://brat.nlplab.org/configuration.html" target="_top">http://brat.nlplab.org/configuration.html</a> |
| |
| </p> |
| |
| |
| <div class="section" title="Sentences and Tokens"><div class="titlepage"><div><div><h3 class="title"><a name="tools.corpora.brat.webtool"></a>Sentences and Tokens</h3></div></div></div> |
| |
| <p> |
| The brat annotation tool only adds named entity spans to the data and doesn't provide information |
| about tokens and sentences. To train the name finder this information is required. By default it |
| is assumed that each line is a sentence and that tokens are whitespace separated. This can be |
| adjusted by providing a custom sentence detector and optional also a tokenizer. |
| |
| The opennlp brat command supports the following arguments for providing custom sentence detector |
| and tokenizer. |
| |
| </p><table border="0" summary="Simple list" class="simplelist"><tr><td><p>-sentenceDetectorModel - your sentence model</p></td></tr><tr><td><p>-tokenizerModel - your tokenizer model</p></td></tr><tr><td><p>-ruleBasedTokenizer - simple | whitespace</p></td></tr></table><p> |
| |
| </p> |
| </div> |
| |
| <div class="section" title="Training"><div class="titlepage"><div><div><h3 class="title"><a name="tools.corpora.brat.training"></a>Training</h3></div></div></div> |
| |
| <p> |
| To train your namefinder model using your brat annotated files you can either use the opennlp command |
| line tool or call opennlp.tools.cmdline.CLI main class from your preferred IDE. |
| |
| Calling opennlp TokenNameFinder.brat without arguments gives you a list of all the arguments you can use. |
| Obviously some combinations are not valid. E.g. you should not provide a token model and also define |
| a rule based tokenizer. |
| |
| </p><pre class="screen"> |
| |
| $ opennlp TokenNameFinderTrainer.brat |
| Usage: opennlp TokenNameFinderTrainer.brat [-factory factoryName] [-resources resourcesDir] [-type modelType] |
| [-featuregen featuregenFile] [-nameTypes types] [-sequenceCodec codec] [-params paramsFile] -lang language |
| -model modelFile [-tokenizerModel modelFile] [-ruleBasedTokenizer name] -annotationConfig annConfFile |
| -bratDataDir bratDataDir [-recursive value] [-sentenceDetectorModel modelFile] |
| |
| Arguments description: |
| -factory factoryName |
| A sub-class of TokenNameFinderFactory |
| -resources resourcesDir |
| The resources directory |
| -type modelType |
| The type of the token name finder model |
| -featuregen featuregenFile |
| The feature generator descriptor file |
| -nameTypes types |
| name types to use for training |
| -sequenceCodec codec |
| sequence codec used to code name spans |
| -params paramsFile |
| training parameters file. |
| -lang language |
| language which is being processed. |
| -model modelFile |
| output model file. |
| -tokenizerModel modelFile |
| -ruleBasedTokenizer name |
| -annotationConfig annConfFile |
| -bratDataDir bratDataDir |
| location of brat data dir |
| -recursive value |
| -sentenceDetectorModel modelFile |
| |
| </pre><p> |
| |
| The following command will train a danish organization name finder model. |
| |
| </p><pre class="screen"> |
| |
| $ opennlp TokenNameFinderTrainer.brat -resources conf/resources \ |
| -featuregen conf/resources/fg-da-org.xml -nameTypes Organization \ |
| -params conf/resources/TrainerParams.txt -lang da \ |
| -model models/da-org.bin -ruleBasedTokenizer simple \ |
| -annotationConfig data/annotation.conf -bratDataDir data/gold/da/train \ |
| -recursive true -sentenceDetectorModel models/da-sent.bin |
| |
| Indexing events using cutoff of 0 |
| |
| Computing event counts... |
| done. 620738 events |
| Indexing... done. |
| Collecting events... Done indexing. |
| Incorporating indexed data for training... |
| done. |
| Number of Event Tokens: 620738 |
| Number of Outcomes: 3 |
| Number of Predicates: 1403655 |
| Computing model parameters... |
| Performing 100 iterations. |
| 1: . (614536/620738) 0.9900086671027067 |
| 2: . (617590/620738) 0.9949286172265915 |
| 3: . (618615/620738) 0.9965798775006525 |
| 4: . (619263/620738) 0.9976237961909856 |
| 5: . (619509/620738) 0.9980200986567602 |
| 6: . (619830/620738) 0.9985372250450271 |
| 7: . (619968/620738) 0.9987595410624128 |
| 8: . (620110/620738) 0.9989883010223315 |
| 9: . (620200/620738) 0.9991332897293222 |
| 10: . (620266/620738) 0.9992396147811153 |
| 20: . (620538/620738) 0.999677802873354 |
| 30: . (620641/620738) 0.9998437343935767 |
| 40: . (620653/620738) 0.9998630662211755 |
| Stopping: change in training set accuracy less than 1.0E-5 |
| Stats: (620594/620738) 0.9997680180688149 |
| ...done. |
| |
| Writing name finder model ... Training data summary: |
| #Sentences: 26133 |
| #Tokens: 620738 |
| #Organization entities: 13053 |
| |
| Compressed 1403655 parameters to 116378 |
| 4 outcome patterns |
| done (11.099s) |
| |
| Wrote name finder model to |
| path: models/da-org.bin |
| |
| </pre><p> |
| </p> |
| </div> |
| |
| <div class="section" title="Evaluation"><div class="titlepage"><div><div><h3 class="title"><a name="tools.corpora.brat.evaluation"></a>Evaluation</h3></div></div></div> |
| |
| <p> |
| To evaluate you name finder model opennlp provides an evaluator that works with your brat |
| annotated data. Normally you would partition your data in a training set and a test set e.g. 70% |
| training and 30% test. |
| The training set is of cause only used for training the model and should never be used for |
| evaluation. The test set is only used for evaluation. In order to avoid overfitting, it is preferable if the training set and |
| test set is somewhat balanced so that both sets represents a broad variety of the entities |
| it should be able to identify. Shuffling the data before splitting is most likely sufficient in many cases. |
| |
| </p><pre class="screen"> |
| |
| $ opennlp TokenNameFinderEvaluator.brat -model models/da-org.bin \ |
| -ruleBasedTokenizer simple -annotationConfig data/annotation.conf \ |
| -bratDataDir data/gold/da/test -recursive true \ |
| -sentenceDetectorModel models/da-sent.bin |
| |
| Loading Token Name Finder model ... done (12.395s) |
| |
| Average: 610.7 sent/s |
| Total: 6133 sent |
| Runtime: 10.043s |
| |
| Precision: 0.7321974661424203 |
| Recall: 0.25176505933603727 |
| F-Measure: 0.3746926000447127 |
| |
| |
| </pre><p> |
| </p> |
| </div> |
| |
| <div class="section" title="Cross Validation"><div class="titlepage"><div><div><h3 class="title"><a name="tools.corpora.brat.cross-validation"></a>Cross Validation</h3></div></div></div> |
| |
| <p> |
| You can also use the cross validation to evaluate you model. This can come in handy when you do |
| not have enough data to divide it into a proper training and test set. |
| Running cross validation with the misclassified attribute set to true can also be helpful because it |
| will identify missed annotations as they will pop up as false positives in the text output. |
| </p><pre class="screen"> |
| |
| $ opennlp TokenNameFinderCrossValidator.brat -resources conf/resources \ |
| -featuregen conf/resources/fg-da-org.xml -nameTypes Organization \ |
| -params conf/resources/TrainerParams.txt -lang da -misclassified true \ |
| -folds 10 -detailedF true -ruleBasedTokenizer simple -annotationConfig data/annotation.conf \ |
| -bratDataDir data/gold/da -recursive true -sentenceDetectorModel models/da-sent.bin |
| |
| Indexing events using cutoff of 0 |
| |
| Computing event counts... |
| done. 555858 events |
| Indexing... done. |
| Collecting events... Done indexing. |
| Incorporating indexed data for training... |
| done. |
| Number of Event Tokens: 555858 |
| Number of Outcomes: 3 |
| Number of Predicates: 1302740 |
| Computing model parameters... |
| Performing 100 iterations. |
| 1: . (550095/555858) 0.9896322442062541 |
| 2: . (552971/555858) 0.9948062274897546 |
| ... |
| ... |
| ... (training and evaluationg x 10) |
| ... |
| done |
| |
| Evaluated 26133 samples with 13053 entities; found: 12174 entities; correct: 10361. |
| TOTAL: precision: 85.11%; recall: 79.38%; F1: 82.14%. |
| Organization: precision: 85.11%; recall: 79.38%; F1: 82.14%. [target: 13053; tp: 10361; fp: 1813] |
| |
| |
| |
| </pre><p> |
| </p> |
| </div> |
| </div> |
| </div> |
| <div class="chapter" title="Chapter 14. Machine Learning"><div class="titlepage"><div><div><h2 class="title"><a name="opennlp.ml"></a>Chapter 14. Machine Learning</h2></div></div></div><div class="toc"><p><b>Table of Contents</b></p><dl><dt><span class="section"><a href="#opennlp.ml.maxent">Maximum Entropy</a></span></dt><dd><dl><dt><span class="section"><a href="#opennlp.ml.maxent.impl">Implementation</a></span></dt></dl></dd></dl></div> |
| |
| <div class="section" title="Maximum Entropy"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="opennlp.ml.maxent"></a>Maximum Entropy</h2></div></div></div><div class="toc"><dl><dt><span class="section"><a href="#opennlp.ml.maxent.impl">Implementation</a></span></dt></dl></div> |
| |
| <p> |
| To explain what maximum entropy is, it will be simplest to quote from Manning and Schutze* (p. 589): |
| <span class="quote">“<span class="quote"> |
| Maximum entropy modeling is a framework for integrating information from many heterogeneous |
| information sources for classification. The data for a classification problem is described |
| as a (potentially large) number of features. These features can be quite complex and allow |
| the experimenter to make use of prior knowledge about what types of informations are expected |
| to be important for classification. Each feature corresponds to a constraint on the model. |
| We then compute the maximum entropy model, the model with the maximum entropy of all the models |
| that satisfy the constraints. This term may seem perverse, since we have spent most of the book |
| trying to minimize the (cross) entropy of models, but the idea is that we do not want to go beyond |
| the data. If we chose a model with less entropy, we would add `information' constraints to the |
| model that are not justified by the empirical evidence available to us. Choosing the maximum |
| entropy model is motivated by the desire to preserve as much uncertainty as possible. |
| </span>”</span> |
| </p> |
| <p> |
| So that gives a rough idea of what the maximum entropy framework is. |
| Don't assume anything about your probability distribution other than what you have observed. |
| </p> |
| <p> |
| On the engineering level, using maxent is an excellent way of creating programs which perform |
| very difficult classification tasks very well. For example, precision and recall figures for |
| programs using maxent models have reached (or are) the state of the art on tasks like part of |
| speech tagging, sentence detection, prepositional phrase attachment, and named entity recognition. |
| On the engineering level, an added benefit is that the person creating a maxent model only needs |
| to inform the training procedure of the event space, and need not worry about independence between |
| features. |
| </p> |
| <p> |
| While the authors of this implementation of maximum entropy are generally interested using |
| maxent models in natural language processing, the framework is certainly quite general and |
| useful for a much wider variety of fields. In fact, maximum entropy modeling was originally |
| developed for statistical physics. |
| </p> |
| <p> |
| For a very in-depth discussion of how maxent can be used in natural language processing, |
| try reading Adwait Ratnaparkhi's dissertation. Also, check out Berger, Della Pietra, |
| and Della Pietra's paper A Maximum Entropy Approach to Natural Language Processing, which |
| provides an excellent introduction and discussion of the framework. |
| </p> |
| <p> |
| *Foundations of statistical natural language processing . Christopher D. Manning, Hinrich Schutze. |
| Cambridge, Mass. : MIT Press, c1999. |
| </p> |
| <div class="section" title="Implementation"><div class="titlepage"><div><div><h3 class="title"><a name="opennlp.ml.maxent.impl"></a>Implementation</h3></div></div></div> |
| |
| <p> |
| We have tried to make the opennlp.maxent implementation easy to use. To create a model, one |
| needs (of course) the training data, and then implementations of two interfaces in the |
| opennlp.maxent package, EventStream and ContextGenerator. These have fairly simple specifications, |
| and example implementations can be found in the OpenNLP Tools preprocessing components. |
| </p> |
| <p> |
| We have also set in place some interfaces and code to make it easier to automate the training |
| and evaluation process (the Evalable interface and the TrainEval class). It is not necessary |
| to use this functionality, but if you do you'll find it much easier to see how well your models |
| are doing. The opennlp.grok.preprocess.namefind package is an example of a maximum entropy |
| component which uses this functionality. |
| </p> |
| <p> |
| We have managed to use several techniques to reduce the size of the models when writing them to |
| disk, which also means that reading in a model for use is much quicker than with less compact |
| encodings of the model. This was especially important to us since we use many maxent models in |
| the Grok library, and we wanted the start up time and the physical size of the library to be as |
| minimal as possible. As of version 1.2.0, maxent has an io package which greatly simplifies the |
| process of loading and saving models in different formats. |
| </p> |
| </div> |
| </div> |
| </div> |
| <div class="chapter" title="Chapter 15. UIMA Integration"><div class="titlepage"><div><div><h2 class="title"><a name="org.apche.opennlp.uima"></a>Chapter 15. UIMA Integration</h2></div></div></div><div class="toc"><p><b>Table of Contents</b></p><dl><dt><span class="section"><a href="#org.apche.opennlp.running-pear-sample">Running the pear sample in CVD</a></span></dt><dt><span class="section"><a href="#org.apche.opennlp.further-help">Further Help</a></span></dt></dl></div> |
| |
| <p> |
| The UIMA Integration wraps the OpenNLP components in UIMA Analysis Engines which can |
| be used to automatically annotate text and train new OpenNLP models from annotated text. |
| </p> |
| <div class="section" title="Running the pear sample in CVD"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="org.apche.opennlp.running-pear-sample"></a>Running the pear sample in CVD</h2></div></div></div> |
| |
| <p> |
| The Cas Visual Debugger is shipped as part of the UIMA distribution and is a tool which can run |
| the OpenNLP UIMA Annotators and display their analysis results. The source distribution comes with a script |
| which can create a sample UIMA application. Which includes the sentence detector, tokenizer, |
| pos tagger, chunker and name finders for English. This sample application is packaged in the |
| pear format and must be installed with the pear installer before it can be run by CVD. |
| Please consult the UIMA documentation for further information about the pear installer. |
| </p> |
| <p> |
| The OpenNLP UIMA pear file must be build manually. |
| First download the source distribution, unzip it and go to the apache-opennlp/opennlp folder. |
| Type "mvn install" to build everything. Now build the pear file, go to apache-opennlp/opennlp-uima |
| and build it as shown below. Note the models will be downloaded |
| from the old SourceForge repository and are not licensed under the AL 2.0. |
| </p><pre class="screen"> |
| |
| $ ant -f createPear.xml |
| Buildfile: createPear.xml |
| |
| createPear: |
| [echo] ##### Creating OpenNlpTextAnalyzer pear ##### |
| [copy] Copying 13 files to OpenNlpTextAnalyzer/desc |
| [copy] Copying 1 file to OpenNlpTextAnalyzer/metadata |
| [copy] Copying 1 file to OpenNlpTextAnalyzer/lib |
| [copy] Copying 3 files to OpenNlpTextAnalyzer/lib |
| [mkdir] Created dir: OpenNlpTextAnalyzer/models |
| [get] Getting: http://opennlp.sourceforge.net/models-1.5/en-token.bin |
| [get] To: OpenNlpTextAnalyzer/models/en-token.bin |
| [get] Getting: http://opennlp.sourceforge.net/models-1.5/en-sent.bin |
| [get] To: OpenNlpTextAnalyzer/models/en-sent.bin |
| [get] Getting: http://opennlp.sourceforge.net/models-1.5/en-ner-date.bin |
| [get] To: OpenNlpTextAnalyzer/models/en-ner-date.bin |
| [get] Getting: http://opennlp.sourceforge.net/models-1.5/en-ner-location.bin |
| [get] To: OpenNlpTextAnalyzer/models/en-ner-location.bin |
| [get] Getting: http://opennlp.sourceforge.net/models-1.5/en-ner-money.bin |
| [get] To: OpenNlpTextAnalyzer/models/en-ner-money.bin |
| [get] Getting: http://opennlp.sourceforge.net/models-1.5/en-ner-organization.bin |
| [get] To: OpenNlpTextAnalyzer/models/en-ner-organization.bin |
| [get] Getting: http://opennlp.sourceforge.net/models-1.5/en-ner-percentage.bin |
| [get] To: OpenNlpTextAnalyzer/models/en-ner-percentage.bin |
| [get] Getting: http://opennlp.sourceforge.net/models-1.5/en-ner-person.bin |
| [get] To: OpenNlpTextAnalyzer/models/en-ner-person.bin |
| [get] Getting: http://opennlp.sourceforge.net/models-1.5/en-ner-time.bin |
| [get] To: OpenNlpTextAnalyzer/models/en-ner-time.bin |
| [get] Getting: http://opennlp.sourceforge.net/models-1.5/en-pos-maxent.bin |
| [get] To: OpenNlpTextAnalyzer/models/en-pos-maxent.bin |
| [get] Getting: http://opennlp.sourceforge.net/models-1.5/en-chunker.bin |
| [get] To: OpenNlpTextAnalyzer/models/en-chunker.bin |
| [zip] Building zip: OpenNlpTextAnalyzer.pear |
| |
| BUILD SUCCESSFUL |
| Total time: 3 minutes 20 seconds |
| </pre><p> |
| </p> |
| <p> |
| After the pear is installed start the Cas Visual Debugger shipped with the UIMA framework. |
| And click on Tools -> Load AE. Then select the opennlp.uima.OpenNlpTextAnalyzer_pear.xml |
| file in the file dialog. Now enter some text and start the analysis engine with |
| "Run -> Run OpenNLPTextAnalyzer". Afterwards the results will be displayed. |
| You should see sentences, tokens, chunks, pos tags and maybe some names. Remember the input text |
| must be written in English. |
| </p> |
| </div> |
| <div class="section" title="Further Help"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="org.apche.opennlp.further-help"></a>Further Help</h2></div></div></div> |
| |
| <p> |
| For more information about how to use the integration please consult the javadoc of the individual |
| Analysis Engines and checkout the included xml descriptors. |
| </p> |
| <p> |
| TODO: Extend this documentation with information about the individual components. |
| If you want to contribute please contact us on the mailing list |
| or comment on the jira issue <a class="ulink" href="https://issues.apache.org/jira/browse/OPENNLP-49" target="_top">OPENNLP-49</a>. |
| </p> |
| </div> |
| </div> |
| <div class="chapter" title="Chapter 16. Morfologik Addon"><div class="titlepage"><div><div><h2 class="title"><a name="tools.morfologik-addon"></a>Chapter 16. Morfologik Addon</h2></div></div></div><div class="toc"><p><b>Table of Contents</b></p><dl><dt><span class="section"><a href="#tools.morfologik-addon.api">Morfologik Integration</a></span></dt><dt><span class="section"><a href="#tools.morfologik-addon.cmdline">Morfologik CLI Tools</a></span></dt></dl></div> |
| |
| <p> |
| <a class="ulink" href="https://github.com/morfologik/morfologik-stemming" target="_top"><em class="citetitle">Morfologik</em></a> |
| provides tools for finite state automata (FSA) construction and dictionary-based morphological dictionaries. |
| </p> |
| <p> |
| The Morfologik Addon implements OpenNLP interfaces and extensions to allow the use of FSA Morfologik dictionary tools. |
| </p> |
| <div class="section" title="Morfologik Integration"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="tools.morfologik-addon.api"></a>Morfologik Integration</h2></div></div></div> |
| |
| <p> |
| To allow for an easy integration with OpenNLP, the following implementations are provided: |
| </p><div class="itemizedlist"><ul class="itemizedlist" type="opencircle"><li class="listitem" style="list-style-type: circle"> |
| <p> |
| The <code class="code">MorfologikPOSTaggerFactory</code> extends <code class="code">POSTaggerFactory</code>, which helps creating a POSTagger model with an embedded FSA TagDictionary. |
| </p> |
| </li><li class="listitem" style="list-style-type: circle"> |
| <p> |
| The <code class="code">MorfologikTagDictionary</code> implements a FSA based <code class="code">TagDictionary</code>, allowing for much smaller files than the default XML based with improved memory consumption. |
| </p> |
| </li><li class="listitem" style="list-style-type: circle"> |
| <p> |
| The <code class="code">MorfologikLemmatizer</code> implements a FSA based <code class="code">Lemmatizer</code> dictionaries. |
| </p> |
| </li></ul></div><p> |
| </p> |
| <p> |
| The first two implementations can be used directly from command line, as in the example bellow. Having a FSA Morfologik dictionary (see next section how to build one), you can train a POS Tagger |
| model with an embedded FSA dictionary. |
| </p> |
| <p> |
| The example trains a POSTagger with a CONLL corpus named <code class="code">portuguese_bosque_train.conll</code> and a FSA dictionary named |
| <code class="code">pt-morfologik.dict</code>. It will output a model named <code class="code">pos-pt_fsadic.model</code>. |
| |
| </p><pre class="screen"> |
| |
| $ bin/opennlp POSTaggerTrainer -type perceptron -lang pt -model pos-pt_fsadic.model -data portuguese_bosque_train.conll \ |
| -encoding UTF-8 -factory opennlp.morfologik.tagdict.MorfologikPOSTaggerFactory -dict pt-morfologik.dict |
| </pre><p> |
| |
| </p> |
| <p> |
| Another example follows. It shows how to use the <code class="code">MorfologikLemmatizer</code>. You will need a lemma dictionary and info file, in this example, we will use a very small Portuguese dictionary. |
| Its syntax is <code class="code">lemma,lexeme,postag</code>. |
| </p> |
| <p> |
| File <code class="code">lemmaDictionary.txt:</code> |
| </p><pre class="screen"> |
| |
| casa,casa,NOUN |
| casar,casa,V |
| casar,casar,V-INF |
| Casa,Casa,PROP |
| casa,casinha,NOUN |
| casa,casona,NOUN |
| menino,menina,NOUN |
| menino,menino,NOUN |
| menino,meninão,NOUN |
| menino,menininho,NOUN |
| carro,carro,NOUN |
| </pre><p> |
| </p> |
| <p> |
| Mandatory metadata file, which must have the same name but .info extension <code class="code">lemmaDictionary.info:</code> |
| </p><pre class="screen"> |
| |
| # |
| # REQUIRED PROPERTIES |
| # |
| |
| # Column (lemma, inflected, tag) separator. This must be a single byte in the target encoding. |
| fsa.dict.separator=, |
| |
| # The charset in which the input is encoded. UTF-8 is strongly recommended. |
| fsa.dict.encoding=UTF-8 |
| |
| # The type of lemma-inflected form encoding compression that precedes automaton |
| # construction. Allowed values: [suffix, infix, prefix, none]. |
| # Details are in Daciuk's paper and in the code. |
| # Leave at 'prefix' if not sure. |
| fsa.dict.encoder=prefix |
| |
| </pre><p> |
| </p> |
| <p> |
| The following code creates a binary FSA Morfologik dictionary, loads it in MorfologikLemmatizer and uses it to |
| find the lemma the word "casa" noun and verb. |
| |
| </p><pre class="programlisting"> |
| |
| <i class="hl-comment" style="color: silver">// Part 1: compile a FSA lemma dictionary </i> |
| |
| <i class="hl-comment" style="color: silver">// we need the tabular dictionary. It is mandatory to have info </i> |
| <i class="hl-comment" style="color: silver">// file with same name, but .info extension</i> |
| Path textLemmaDictionary = Paths.get(<b class="hl-string"><i style="color:red">"dictionaryWithLemma.txt"</i></b>); |
| |
| <i class="hl-comment" style="color: silver">// this will build a binary dictionary located in compiledLemmaDictionary</i> |
| Path compiledLemmaDictionary = <b class="hl-keyword">new</b> MorfologikDictionayBuilder() |
| .build(textLemmaDictionary); |
| |
| <i class="hl-comment" style="color: silver">// Part 2: load a MorfologikLemmatizer and use it</i> |
| MorfologikLemmatizer lemmatizer = <b class="hl-keyword">new</b> MorfologikLemmatizer(compiledLemmaDictionary); |
| |
| String[] toks = {<b class="hl-string"><i style="color:red">"casa"</i></b>, <b class="hl-string"><i style="color:red">"casa"</i></b>}; |
| String[] tags = {<b class="hl-string"><i style="color:red">"NOUN"</i></b>, <b class="hl-string"><i style="color:red">"V"</i></b>}; |
| |
| String[] lemmas = lemmatizer.lemmatize(toks, tags); |
| System.out.println(Arrays.toString(lemmas)); <i class="hl-comment" style="color: silver">// outputs [casa, casar]</i> |
| |
| </pre><p> |
| |
| </p> |
| </div> |
| <div class="section" title="Morfologik CLI Tools"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="tools.morfologik-addon.cmdline"></a>Morfologik CLI Tools</h2></div></div></div> |
| |
| <p> |
| The Morfologik addon provides a command line tool. <code class="code">XMLDictionaryToTable</code> makes easy to convert from an OpenNLP XML based dictionary |
| to a tabular format. <code class="code">MorfologikDictionaryBuilder</code> can take a tabular dictionary and output a binary Morfologik FSA dictionary. |
| </p> |
| <pre class="screen"> |
| |
| $ sh bin/morfologik-addon |
| OpenNLP Morfologik Addon. Usage: opennlp-morfologik-addon TOOL |
| where TOOL is one of: |
| MorfologikDictionaryBuilder builds a binary POS Dictionary using Morfologik |
| XMLDictionaryToTable reads an OpenNLP XML tag dictionary and outputs it in a tabular file |
| All tools print help when invoked with help parameter |
| Example: opennlp-morfologik-addon POSDictionaryBuilder help |
| |
| </pre> |
| </div> |
| </div> |
| <div class="chapter" title="Chapter 17. The Command Line Interface"><div class="titlepage"><div><div><h2 class="title"><a name="tools.cli"></a>Chapter 17. The Command Line Interface</h2></div></div></div><div class="toc"><p><b>Table of Contents</b></p><dl><dt><span class="section"><a href="#tools.cli.doccat">Doccat</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.cli.doccat.Doccat">Doccat</a></span></dt><dt><span class="section"><a href="#tools.cli.doccat.DoccatTrainer">DoccatTrainer</a></span></dt><dt><span class="section"><a href="#tools.cli.doccat.DoccatEvaluator">DoccatEvaluator</a></span></dt><dt><span class="section"><a href="#tools.cli.doccat.DoccatCrossValidator">DoccatCrossValidator</a></span></dt><dt><span class="section"><a href="#tools.cli.doccat.DoccatConverter">DoccatConverter</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.cli.langdetect">Langdetect</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.cli.langdetect.LanguageDetector">LanguageDetector</a></span></dt><dt><span class="section"><a href="#tools.cli.langdetect.LanguageDetectorTrainer">LanguageDetectorTrainer</a></span></dt><dt><span class="section"><a href="#tools.cli.langdetect.LanguageDetectorConverter">LanguageDetectorConverter</a></span></dt><dt><span class="section"><a href="#tools.cli.langdetect.LanguageDetectorCrossValidator">LanguageDetectorCrossValidator</a></span></dt><dt><span class="section"><a href="#tools.cli.langdetect.LanguageDetectorEvaluator">LanguageDetectorEvaluator</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.cli.dictionary">Dictionary</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.cli.dictionary.DictionaryBuilder">DictionaryBuilder</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.cli.tokenizer">Tokenizer</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.cli.tokenizer.SimpleTokenizer">SimpleTokenizer</a></span></dt><dt><span class="section"><a href="#tools.cli.tokenizer.TokenizerME">TokenizerME</a></span></dt><dt><span class="section"><a href="#tools.cli.tokenizer.TokenizerTrainer">TokenizerTrainer</a></span></dt><dt><span class="section"><a href="#tools.cli.tokenizer.TokenizerMEEvaluator">TokenizerMEEvaluator</a></span></dt><dt><span class="section"><a href="#tools.cli.tokenizer.TokenizerCrossValidator">TokenizerCrossValidator</a></span></dt><dt><span class="section"><a href="#tools.cli.tokenizer.TokenizerConverter">TokenizerConverter</a></span></dt><dt><span class="section"><a href="#tools.cli.tokenizer.DictionaryDetokenizer">DictionaryDetokenizer</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.cli.sentdetect">Sentdetect</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.cli.sentdetect.SentenceDetector">SentenceDetector</a></span></dt><dt><span class="section"><a href="#tools.cli.sentdetect.SentenceDetectorTrainer">SentenceDetectorTrainer</a></span></dt><dt><span class="section"><a href="#tools.cli.sentdetect.SentenceDetectorEvaluator">SentenceDetectorEvaluator</a></span></dt><dt><span class="section"><a href="#tools.cli.sentdetect.SentenceDetectorCrossValidator">SentenceDetectorCrossValidator</a></span></dt><dt><span class="section"><a href="#tools.cli.sentdetect.SentenceDetectorConverter">SentenceDetectorConverter</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.cli.namefind">Namefind</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.cli.namefind.TokenNameFinder">TokenNameFinder</a></span></dt><dt><span class="section"><a href="#tools.cli.namefind.TokenNameFinderTrainer">TokenNameFinderTrainer</a></span></dt><dt><span class="section"><a href="#tools.cli.namefind.TokenNameFinderEvaluator">TokenNameFinderEvaluator</a></span></dt><dt><span class="section"><a href="#tools.cli.namefind.TokenNameFinderCrossValidator">TokenNameFinderCrossValidator</a></span></dt><dt><span class="section"><a href="#tools.cli.namefind.TokenNameFinderConverter">TokenNameFinderConverter</a></span></dt><dt><span class="section"><a href="#tools.cli.namefind.CensusDictionaryCreator">CensusDictionaryCreator</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.cli.postag">Postag</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.cli.postag.POSTagger">POSTagger</a></span></dt><dt><span class="section"><a href="#tools.cli.postag.POSTaggerTrainer">POSTaggerTrainer</a></span></dt><dt><span class="section"><a href="#tools.cli.postag.POSTaggerEvaluator">POSTaggerEvaluator</a></span></dt><dt><span class="section"><a href="#tools.cli.postag.POSTaggerCrossValidator">POSTaggerCrossValidator</a></span></dt><dt><span class="section"><a href="#tools.cli.postag.POSTaggerConverter">POSTaggerConverter</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.cli.lemmatizer">Lemmatizer</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.cli.lemmatizer.LemmatizerME">LemmatizerME</a></span></dt><dt><span class="section"><a href="#tools.cli.lemmatizer.LemmatizerTrainerME">LemmatizerTrainerME</a></span></dt><dt><span class="section"><a href="#tools.cli.lemmatizer.LemmatizerEvaluator">LemmatizerEvaluator</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.cli.chunker">Chunker</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.cli.chunker.ChunkerME">ChunkerME</a></span></dt><dt><span class="section"><a href="#tools.cli.chunker.ChunkerTrainerME">ChunkerTrainerME</a></span></dt><dt><span class="section"><a href="#tools.cli.chunker.ChunkerEvaluator">ChunkerEvaluator</a></span></dt><dt><span class="section"><a href="#tools.cli.chunker.ChunkerCrossValidator">ChunkerCrossValidator</a></span></dt><dt><span class="section"><a href="#tools.cli.chunker.ChunkerConverter">ChunkerConverter</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.cli.parser">Parser</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.cli.parser.Parser">Parser</a></span></dt><dt><span class="section"><a href="#tools.cli.parser.ParserTrainer">ParserTrainer</a></span></dt><dt><span class="section"><a href="#tools.cli.parser.ParserEvaluator">ParserEvaluator</a></span></dt><dt><span class="section"><a href="#tools.cli.parser.ParserConverter">ParserConverter</a></span></dt><dt><span class="section"><a href="#tools.cli.parser.BuildModelUpdater">BuildModelUpdater</a></span></dt><dt><span class="section"><a href="#tools.cli.parser.CheckModelUpdater">CheckModelUpdater</a></span></dt><dt><span class="section"><a href="#tools.cli.parser.TaggerModelReplacer">TaggerModelReplacer</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.cli.entitylinker">Entitylinker</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.cli.entitylinker.EntityLinker">EntityLinker</a></span></dt></dl></dd><dt><span class="section"><a href="#tools.cli.languagemodel">Languagemodel</a></span></dt><dd><dl><dt><span class="section"><a href="#tools.cli.languagemodel.NGramLanguageModel">NGramLanguageModel</a></span></dt></dl></dd></dl></div> |
| |
| |
| |
| <p>This section details the available tools and parameters of the Command Line Interface. For a introduction in its usage please refer to <a class="xref" href="#intro.cli" title="Command line interface (CLI)">the section called “Command line interface (CLI)”</a>. </p> |
| |
| <div class="section" title="Doccat"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="tools.cli.doccat"></a>Doccat</h2></div></div></div><div class="toc"><dl><dt><span class="section"><a href="#tools.cli.doccat.Doccat">Doccat</a></span></dt><dt><span class="section"><a href="#tools.cli.doccat.DoccatTrainer">DoccatTrainer</a></span></dt><dt><span class="section"><a href="#tools.cli.doccat.DoccatEvaluator">DoccatEvaluator</a></span></dt><dt><span class="section"><a href="#tools.cli.doccat.DoccatCrossValidator">DoccatCrossValidator</a></span></dt><dt><span class="section"><a href="#tools.cli.doccat.DoccatConverter">DoccatConverter</a></span></dt></dl></div> |
| |
| |
| |
| <div class="section" title="Doccat"><div class="titlepage"><div><div><h3 class="title"><a name="tools.cli.doccat.Doccat"></a>Doccat</h3></div></div></div> |
| |
| |
| |
| <p>Learned document categorizer</p> |
| |
| <pre class="screen"> |
| |
| Usage: opennlp Doccat model < documents |
| |
| |
| </pre> |
| </div> |
| |
| <div class="section" title="DoccatTrainer"><div class="titlepage"><div><div><h3 class="title"><a name="tools.cli.doccat.DoccatTrainer"></a>DoccatTrainer</h3></div></div></div> |
| |
| |
| |
| <p>Trainer for the learnable document categorizer</p> |
| |
| <pre class="screen"> |
| |
| Usage: opennlp DoccatTrainer[.leipzig] [-factory factoryName] [-featureGenerators fg] [-tokenizer tokenizer] |
| [-params paramsFile] -lang language -model modelFile -data sampleData [-encoding charsetName] |
| Arguments description: |
| -factory factoryName |
| A sub-class of DoccatFactory where to get implementation and resources. |
| -featureGenerators fg |
| Comma separated feature generator classes. Bag of words is used if not specified. |
| -tokenizer tokenizer |
| Tokenizer implementation. WhitespaceTokenizer is used if not specified. |
| -params paramsFile |
| training parameters file. |
| -lang language |
| language which is being processed. |
| -model modelFile |
| output model file. |
| -data sampleData |
| data to be used, usually a file name. |
| -encoding charsetName |
| encoding for reading and writing text, if absent the system default is used. |
| |
| |
| </pre> |
| <p>The supported formats and arguments are:</p> |
| |
| <div class="informaltable"><table border="1"><colgroup><col><col><col><col></colgroup><thead><tr><th align="left">Format</th><th align="left">Argument</th><th align="left">Value</th><th align="left">Optional</th><th align="left">Description</th></tr></thead><tbody><tr><td rowspan="2" align="left" valign="middle">leipzig</td><td align="left">sentencesDir</td><td align="left">sentencesDir</td><td align="left">No</td><td align="left">Dir with Leipig sentences to be used</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr></tbody></table></div> |
| |
| </div> |
| |
| <div class="section" title="DoccatEvaluator"><div class="titlepage"><div><div><h3 class="title"><a name="tools.cli.doccat.DoccatEvaluator"></a>DoccatEvaluator</h3></div></div></div> |
| |
| |
| |
| <p>Measures the performance of the Doccat model with the reference data</p> |
| |
| <pre class="screen"> |
| |
| Usage: opennlp DoccatEvaluator[.leipzig] -model model [-misclassified true|false] [-reportOutputFile |
| outputFile] -data sampleData [-encoding charsetName] |
| Arguments description: |
| -model model |
| the model file to be evaluated. |
| -misclassified true|false |
| if true will print false negatives and false positives. |
| -reportOutputFile outputFile |
| the path of the fine-grained report file. |
| -data sampleData |
| data to be used, usually a file name. |
| -encoding charsetName |
| encoding for reading and writing text, if absent the system default is used. |
| |
| |
| </pre> |
| <p>The supported formats and arguments are:</p> |
| |
| <div class="informaltable"><table border="1"><colgroup><col><col><col><col></colgroup><thead><tr><th align="left">Format</th><th align="left">Argument</th><th align="left">Value</th><th align="left">Optional</th><th align="left">Description</th></tr></thead><tbody><tr><td rowspan="2" align="left" valign="middle">leipzig</td><td align="left">sentencesDir</td><td align="left">sentencesDir</td><td align="left">No</td><td align="left">Dir with Leipig sentences to be used</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr></tbody></table></div> |
| |
| </div> |
| |
| <div class="section" title="DoccatCrossValidator"><div class="titlepage"><div><div><h3 class="title"><a name="tools.cli.doccat.DoccatCrossValidator"></a>DoccatCrossValidator</h3></div></div></div> |
| |
| |
| |
| <p>K-fold cross validator for the learnable Document Categorizer</p> |
| |
| <pre class="screen"> |
| |
| Usage: opennlp DoccatCrossValidator[.leipzig] [-misclassified true|false] [-folds num] [-factory factoryName] |
| [-featureGenerators fg] [-tokenizer tokenizer] [-params paramsFile] -lang language [-reportOutputFile |
| outputFile] -data sampleData [-encoding charsetName] |
| Arguments description: |
| -misclassified true|false |
| if true will print false negatives and false positives. |
| -folds num |
| number of folds, default is 10. |
| -factory factoryName |
| A sub-class of DoccatFactory where to get implementation and resources. |
| -featureGenerators fg |
| Comma separated feature generator classes. Bag of words is used if not specified. |
| -tokenizer tokenizer |
| Tokenizer implementation. WhitespaceTokenizer is used if not specified. |
| -params paramsFile |
| training parameters file. |
| -lang language |
| language which is being processed. |
| -reportOutputFile outputFile |
| the path of the fine-grained report file. |
| -data sampleData |
| data to be used, usually a file name. |
| -encoding charsetName |
| encoding for reading and writing text, if absent the system default is used. |
| |
| |
| </pre> |
| <p>The supported formats and arguments are:</p> |
| |
| <div class="informaltable"><table border="1"><colgroup><col><col><col><col></colgroup><thead><tr><th align="left">Format</th><th align="left">Argument</th><th align="left">Value</th><th align="left">Optional</th><th align="left">Description</th></tr></thead><tbody><tr><td rowspan="2" align="left" valign="middle">leipzig</td><td align="left">sentencesDir</td><td align="left">sentencesDir</td><td align="left">No</td><td align="left">Dir with Leipig sentences to be used</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr></tbody></table></div> |
| |
| </div> |
| |
| <div class="section" title="DoccatConverter"><div class="titlepage"><div><div><h3 class="title"><a name="tools.cli.doccat.DoccatConverter"></a>DoccatConverter</h3></div></div></div> |
| |
| |
| |
| <p>Converts leipzig data format to native OpenNLP format</p> |
| |
| <pre class="screen"> |
| |
| Usage: opennlp DoccatConverter help|leipzig [help|options...] |
| |
| |
| </pre> |
| <p>The supported formats and arguments are:</p> |
| |
| <div class="informaltable"><table border="1"><colgroup><col><col><col><col></colgroup><thead><tr><th align="left">Format</th><th align="left">Argument</th><th align="left">Value</th><th align="left">Optional</th><th align="left">Description</th></tr></thead><tbody><tr><td rowspan="2" align="left" valign="middle">leipzig</td><td align="left">sentencesDir</td><td align="left">sentencesDir</td><td align="left">No</td><td align="left">Dir with Leipig sentences to be used</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr></tbody></table></div> |
| |
| </div> |
| |
| </div> |
| |
| <div class="section" title="Langdetect"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="tools.cli.langdetect"></a>Langdetect</h2></div></div></div><div class="toc"><dl><dt><span class="section"><a href="#tools.cli.langdetect.LanguageDetector">LanguageDetector</a></span></dt><dt><span class="section"><a href="#tools.cli.langdetect.LanguageDetectorTrainer">LanguageDetectorTrainer</a></span></dt><dt><span class="section"><a href="#tools.cli.langdetect.LanguageDetectorConverter">LanguageDetectorConverter</a></span></dt><dt><span class="section"><a href="#tools.cli.langdetect.LanguageDetectorCrossValidator">LanguageDetectorCrossValidator</a></span></dt><dt><span class="section"><a href="#tools.cli.langdetect.LanguageDetectorEvaluator">LanguageDetectorEvaluator</a></span></dt></dl></div> |
| |
| |
| |
| <div class="section" title="LanguageDetector"><div class="titlepage"><div><div><h3 class="title"><a name="tools.cli.langdetect.LanguageDetector"></a>LanguageDetector</h3></div></div></div> |
| |
| |
| |
| <p>Learned language detector</p> |
| |
| <pre class="screen"> |
| |
| Usage: opennlp LanguageDetector model < documents |
| |
| |
| </pre> |
| </div> |
| |
| <div class="section" title="LanguageDetectorTrainer"><div class="titlepage"><div><div><h3 class="title"><a name="tools.cli.langdetect.LanguageDetectorTrainer"></a>LanguageDetectorTrainer</h3></div></div></div> |
| |
| |
| |
| <p>Trainer for the learnable language detector</p> |
| |
| <pre class="screen"> |
| |
| Usage: opennlp LanguageDetectorTrainer[.leipzig] -model modelFile [-params paramsFile] [-factory factoryName] |
| -data sampleData [-encoding charsetName] |
| Arguments description: |
| -model modelFile |
| output model file. |
| -params paramsFile |
| training parameters file. |
| -factory factoryName |
| A sub-class of LanguageDetectorFactory where to get implementation and resources. |
| -data sampleData |
| data to be used, usually a file name. |
| -encoding charsetName |
| encoding for reading and writing text, if absent the system default is used. |
| |
| |
| </pre> |
| <p>The supported formats and arguments are:</p> |
| |
| <div class="informaltable"><table border="1"><colgroup><col><col><col><col></colgroup><thead><tr><th align="left">Format</th><th align="left">Argument</th><th align="left">Value</th><th align="left">Optional</th><th align="left">Description</th></tr></thead><tbody><tr><td rowspan="4" align="left" valign="middle">leipzig</td><td align="left">sentencesDir</td><td align="left">sentencesDir</td><td align="left">No</td><td align="left">Dir with Leipig sentences to be used</td></tr><tr><td align="left">sentencesPerSample</td><td align="left">sentencesPerSample</td><td align="left">No</td><td align="left">Number of sentences per sample</td></tr><tr><td align="left">samplesPerLanguage</td><td align="left">samplesPerLanguage</td><td align="left">No</td><td align="left">Number of samples per language</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr></tbody></table></div> |
| |
| </div> |
| |
| <div class="section" title="LanguageDetectorConverter"><div class="titlepage"><div><div><h3 class="title"><a name="tools.cli.langdetect.LanguageDetectorConverter"></a>LanguageDetectorConverter</h3></div></div></div> |
| |
| |
| |
| <p>Converts leipzig data format to native OpenNLP format</p> |
| |
| <pre class="screen"> |
| |
| Usage: opennlp LanguageDetectorConverter help|leipzig [help|options...] |
| |
| |
| </pre> |
| <p>The supported formats and arguments are:</p> |
| |
| <div class="informaltable"><table border="1"><colgroup><col><col><col><col></colgroup><thead><tr><th align="left">Format</th><th align="left">Argument</th><th align="left">Value</th><th align="left">Optional</th><th align="left">Description</th></tr></thead><tbody><tr><td rowspan="4" align="left" valign="middle">leipzig</td><td align="left">sentencesDir</td><td align="left">sentencesDir</td><td align="left">No</td><td align="left">Dir with Leipig sentences to be used</td></tr><tr><td align="left">sentencesPerSample</td><td align="left">sentencesPerSample</td><td align="left">No</td><td align="left">Number of sentences per sample</td></tr><tr><td align="left">samplesPerLanguage</td><td align="left">samplesPerLanguage</td><td align="left">No</td><td align="left">Number of samples per language</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr></tbody></table></div> |
| |
| </div> |
| |
| <div class="section" title="LanguageDetectorCrossValidator"><div class="titlepage"><div><div><h3 class="title"><a name="tools.cli.langdetect.LanguageDetectorCrossValidator"></a>LanguageDetectorCrossValidator</h3></div></div></div> |
| |
| |
| |
| <p>K-fold cross validator for the learnable Language Detector</p> |
| |
| <pre class="screen"> |
| |
| Usage: opennlp LanguageDetectorCrossValidator[.leipzig] [-misclassified true|false] [-folds num] [-factory |
| factoryName] [-params paramsFile] [-reportOutputFile outputFile] -data sampleData [-encoding |
| charsetName] |
| Arguments description: |
| -misclassified true|false |
| if true will print false negatives and false positives. |
| -folds num |
| number of folds, default is 10. |
| -factory factoryName |
| A sub-class of LanguageDetectorFactory where to get implementation and resources. |
| -params paramsFile |
| training parameters file. |
| -reportOutputFile outputFile |
| the path of the fine-grained report file. |
| -data sampleData |
| data to be used, usually a file name. |
| -encoding charsetName |
| encoding for reading and writing text, if absent the system default is used. |
| |
| |
| </pre> |
| <p>The supported formats and arguments are:</p> |
| |
| <div class="informaltable"><table border="1"><colgroup><col><col><col><col></colgroup><thead><tr><th align="left">Format</th><th align="left">Argument</th><th align="left">Value</th><th align="left">Optional</th><th align="left">Description</th></tr></thead><tbody><tr><td rowspan="4" align="left" valign="middle">leipzig</td><td align="left">sentencesDir</td><td align="left">sentencesDir</td><td align="left">No</td><td align="left">Dir with Leipig sentences to be used</td></tr><tr><td align="left">sentencesPerSample</td><td align="left">sentencesPerSample</td><td align="left">No</td><td align="left">Number of sentences per sample</td></tr><tr><td align="left">samplesPerLanguage</td><td align="left">samplesPerLanguage</td><td align="left">No</td><td align="left">Number of samples per language</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr></tbody></table></div> |
| |
| </div> |
| |
| <div class="section" title="LanguageDetectorEvaluator"><div class="titlepage"><div><div><h3 class="title"><a name="tools.cli.langdetect.LanguageDetectorEvaluator"></a>LanguageDetectorEvaluator</h3></div></div></div> |
| |
| |
| |
| <p>Measures the performance of the Language Detector model with the reference data</p> |
| |
| <pre class="screen"> |
| |
| Usage: opennlp LanguageDetectorEvaluator[.leipzig] -model model [-misclassified true|false] |
| [-reportOutputFile outputFile] -data sampleData [-encoding charsetName] |
| Arguments description: |
| -model model |
| the model file to be evaluated. |
| -misclassified true|false |
| if true will print false negatives and false positives. |
| -reportOutputFile outputFile |
| the path of the fine-grained report file. |
| -data sampleData |
| data to be used, usually a file name. |
| -encoding charsetName |
| encoding for reading and writing text, if absent the system default is used. |
| |
| |
| </pre> |
| <p>The supported formats and arguments are:</p> |
| |
| <div class="informaltable"><table border="1"><colgroup><col><col><col><col></colgroup><thead><tr><th align="left">Format</th><th align="left">Argument</th><th align="left">Value</th><th align="left">Optional</th><th align="left">Description</th></tr></thead><tbody><tr><td rowspan="4" align="left" valign="middle">leipzig</td><td align="left">sentencesDir</td><td align="left">sentencesDir</td><td align="left">No</td><td align="left">Dir with Leipig sentences to be used</td></tr><tr><td align="left">sentencesPerSample</td><td align="left">sentencesPerSample</td><td align="left">No</td><td align="left">Number of sentences per sample</td></tr><tr><td align="left">samplesPerLanguage</td><td align="left">samplesPerLanguage</td><td align="left">No</td><td align="left">Number of samples per language</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr></tbody></table></div> |
| |
| </div> |
| |
| </div> |
| |
| <div class="section" title="Dictionary"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="tools.cli.dictionary"></a>Dictionary</h2></div></div></div><div class="toc"><dl><dt><span class="section"><a href="#tools.cli.dictionary.DictionaryBuilder">DictionaryBuilder</a></span></dt></dl></div> |
| |
| |
| |
| <div class="section" title="DictionaryBuilder"><div class="titlepage"><div><div><h3 class="title"><a name="tools.cli.dictionary.DictionaryBuilder"></a>DictionaryBuilder</h3></div></div></div> |
| |
| |
| |
| <p>Builds a new dictionary</p> |
| |
| <pre class="screen"> |
| |
| Usage: opennlp DictionaryBuilder -outputFile out -inputFile in [-encoding charsetName] |
| |
| Arguments description: |
| -outputFile out |
| The dictionary file. |
| -inputFile in |
| Plain file with one entry per line |
| -encoding charsetName |
| encoding for reading and writing text, if absent the system default is used. |
| |
| |
| </pre> |
| </div> |
| |
| </div> |
| |
| <div class="section" title="Tokenizer"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="tools.cli.tokenizer"></a>Tokenizer</h2></div></div></div><div class="toc"><dl><dt><span class="section"><a href="#tools.cli.tokenizer.SimpleTokenizer">SimpleTokenizer</a></span></dt><dt><span class="section"><a href="#tools.cli.tokenizer.TokenizerME">TokenizerME</a></span></dt><dt><span class="section"><a href="#tools.cli.tokenizer.TokenizerTrainer">TokenizerTrainer</a></span></dt><dt><span class="section"><a href="#tools.cli.tokenizer.TokenizerMEEvaluator">TokenizerMEEvaluator</a></span></dt><dt><span class="section"><a href="#tools.cli.tokenizer.TokenizerCrossValidator">TokenizerCrossValidator</a></span></dt><dt><span class="section"><a href="#tools.cli.tokenizer.TokenizerConverter">TokenizerConverter</a></span></dt><dt><span class="section"><a href="#tools.cli.tokenizer.DictionaryDetokenizer">DictionaryDetokenizer</a></span></dt></dl></div> |
| |
| |
| |
| <div class="section" title="SimpleTokenizer"><div class="titlepage"><div><div><h3 class="title"><a name="tools.cli.tokenizer.SimpleTokenizer"></a>SimpleTokenizer</h3></div></div></div> |
| |
| |
| |
| <p>Character class tokenizer</p> |
| |
| <pre class="screen"> |
| |
| Usage: opennlp SimpleTokenizer < sentences |
| |
| |
| </pre> |
| </div> |
| |
| <div class="section" title="TokenizerME"><div class="titlepage"><div><div><h3 class="title"><a name="tools.cli.tokenizer.TokenizerME"></a>TokenizerME</h3></div></div></div> |
| |
| |
| |
| <p>Learnable tokenizer</p> |
| |
| <pre class="screen"> |
| |
| Usage: opennlp TokenizerME model < sentences |
| |
| |
| </pre> |
| </div> |
| |
| <div class="section" title="TokenizerTrainer"><div class="titlepage"><div><div><h3 class="title"><a name="tools.cli.tokenizer.TokenizerTrainer"></a>TokenizerTrainer</h3></div></div></div> |
| |
| |
| |
| <p>Trainer for the learnable tokenizer</p> |
| |
| <pre class="screen"> |
| |
| Usage: opennlp TokenizerTrainer[.irishsentencebank|.ad|.pos|.conllx|.namefinder|.parse|.conllu] [-factory |
| factoryName] [-abbDict path] [-alphaNumOpt isAlphaNumOpt] [-params paramsFile] -lang language -model |
| modelFile -data sampleData [-encoding charsetName] |
| Arguments description: |
| -factory factoryName |
| A sub-class of TokenizerFactory where to get implementation and resources. |
| -abbDict path |
| abbreviation dictionary in XML format. |
| -alphaNumOpt isAlphaNumOpt |
| Optimization flag to skip alpha numeric tokens for further tokenization |
| -params paramsFile |
| training parameters file. |
| -lang language |
| language which is being processed. |
| -model modelFile |
| output model file. |
| -data sampleData |
| data to be used, usually a file name. |
| -encoding charsetName |
| encoding for reading and writing text, if absent the system default is used. |
| |
| |
| </pre> |
| <p>The supported formats and arguments are:</p> |
| |
| <div class="informaltable"><table border="1"><colgroup><col><col><col><col></colgroup><thead><tr><th align="left">Format</th><th align="left">Argument</th><th align="left">Value</th><th align="left">Optional</th><th align="left">Description</th></tr></thead><tbody><tr><td rowspan="2" align="left" valign="middle">irishsentencebank</td><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td rowspan="5" align="left" valign="middle">ad</td><td align="left">encoding</td><td align="left">charsetName</td><td align="left">No</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td align="left">lang</td><td align="left">language</td><td align="left">No</td><td align="left">Language which is being processed.</td></tr><tr><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">splitHyphenatedTokens</td><td align="left">split</td><td align="left">Yes</td><td align="left">If true all hyphenated tokens will be separated (default true)</td></tr><tr><td align="left">detokenizer</td><td align="left">dictionary</td><td align="left">No</td><td align="left">Specifies the file with detokenizer dictionary.</td></tr><tr><td rowspan="3" align="left" valign="middle">pos</td><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td align="left">detokenizer</td><td align="left">dictionary</td><td align="left">No</td><td align="left">Specifies the file with detokenizer dictionary.</td></tr><tr><td rowspan="3" align="left" valign="middle">conllx</td><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td align="left">detokenizer</td><td align="left">dictionary</td><td align="left">No</td><td align="left">Specifies the file with detokenizer dictionary.</td></tr><tr><td rowspan="3" align="left" valign="middle">namefinder</td><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td align="left">detokenizer</td><td align="left">dictionary</td><td align="left">No</td><td align="left">Specifies the file with detokenizer dictionary.</td></tr><tr><td rowspan="3" align="left" valign="middle">parse</td><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td align="left">detokenizer</td><td align="left">dictionary</td><td align="left">No</td><td align="left">Specifies the file with detokenizer dictionary.</td></tr><tr><td rowspan="2" align="left" valign="middle">conllu</td><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr></tbody></table></div> |
| |
| </div> |
| |
| <div class="section" title="TokenizerMEEvaluator"><div class="titlepage"><div><div><h3 class="title"><a name="tools.cli.tokenizer.TokenizerMEEvaluator"></a>TokenizerMEEvaluator</h3></div></div></div> |
| |
| |
| |
| <p>Evaluator for the learnable tokenizer</p> |
| |
| <pre class="screen"> |
| |
| Usage: opennlp TokenizerMEEvaluator[.irishsentencebank|.ad|.pos|.conllx|.namefinder|.parse|.conllu] -model |
| model [-misclassified true|false] -data sampleData [-encoding charsetName] |
| Arguments description: |
| -model model |
| the model file to be evaluated. |
| -misclassified true|false |
| if true will print false negatives and false positives. |
| -data sampleData |
| data to be used, usually a file name. |
| -encoding charsetName |
| encoding for reading and writing text, if absent the system default is used. |
| |
| |
| </pre> |
| <p>The supported formats and arguments are:</p> |
| |
| <div class="informaltable"><table border="1"><colgroup><col><col><col><col></colgroup><thead><tr><th align="left">Format</th><th align="left">Argument</th><th align="left">Value</th><th align="left">Optional</th><th align="left">Description</th></tr></thead><tbody><tr><td rowspan="2" align="left" valign="middle">irishsentencebank</td><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td rowspan="5" align="left" valign="middle">ad</td><td align="left">encoding</td><td align="left">charsetName</td><td align="left">No</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td align="left">lang</td><td align="left">language</td><td align="left">No</td><td align="left">Language which is being processed.</td></tr><tr><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">splitHyphenatedTokens</td><td align="left">split</td><td align="left">Yes</td><td align="left">If true all hyphenated tokens will be separated (default true)</td></tr><tr><td align="left">detokenizer</td><td align="left">dictionary</td><td align="left">No</td><td align="left">Specifies the file with detokenizer dictionary.</td></tr><tr><td rowspan="3" align="left" valign="middle">pos</td><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td align="left">detokenizer</td><td align="left">dictionary</td><td align="left">No</td><td align="left">Specifies the file with detokenizer dictionary.</td></tr><tr><td rowspan="3" align="left" valign="middle">conllx</td><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td align="left">detokenizer</td><td align="left">dictionary</td><td align="left">No</td><td align="left">Specifies the file with detokenizer dictionary.</td></tr><tr><td rowspan="3" align="left" valign="middle">namefinder</td><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td align="left">detokenizer</td><td align="left">dictionary</td><td align="left">No</td><td align="left">Specifies the file with detokenizer dictionary.</td></tr><tr><td rowspan="3" align="left" valign="middle">parse</td><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td align="left">detokenizer</td><td align="left">dictionary</td><td align="left">No</td><td align="left">Specifies the file with detokenizer dictionary.</td></tr><tr><td rowspan="2" align="left" valign="middle">conllu</td><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr></tbody></table></div> |
| |
| </div> |
| |
| <div class="section" title="TokenizerCrossValidator"><div class="titlepage"><div><div><h3 class="title"><a name="tools.cli.tokenizer.TokenizerCrossValidator"></a>TokenizerCrossValidator</h3></div></div></div> |
| |
| |
| |
| <p>K-fold cross validator for the learnable tokenizer</p> |
| |
| <pre class="screen"> |
| |
| Usage: opennlp TokenizerCrossValidator[.irishsentencebank|.ad|.pos|.conllx|.namefinder|.parse|.conllu] |
| [-misclassified true|false] [-folds num] [-factory factoryName] [-abbDict path] [-alphaNumOpt |
| isAlphaNumOpt] [-params paramsFile] -lang language -data sampleData [-encoding charsetName] |
| Arguments description: |
| -misclassified true|false |
| if true will print false negatives and false positives. |
| -folds num |
| number of folds, default is 10. |
| -factory factoryName |
| A sub-class of TokenizerFactory where to get implementation and resources. |
| -abbDict path |
| abbreviation dictionary in XML format. |
| -alphaNumOpt isAlphaNumOpt |
| Optimization flag to skip alpha numeric tokens for further tokenization |
| -params paramsFile |
| training parameters file. |
| -lang language |
| language which is being processed. |
| -data sampleData |
| data to be used, usually a file name. |
| -encoding charsetName |
| encoding for reading and writing text, if absent the system default is used. |
| |
| |
| </pre> |
| <p>The supported formats and arguments are:</p> |
| |
| <div class="informaltable"><table border="1"><colgroup><col><col><col><col></colgroup><thead><tr><th align="left">Format</th><th align="left">Argument</th><th align="left">Value</th><th align="left">Optional</th><th align="left">Description</th></tr></thead><tbody><tr><td rowspan="2" align="left" valign="middle">irishsentencebank</td><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td rowspan="5" align="left" valign="middle">ad</td><td align="left">encoding</td><td align="left">charsetName</td><td align="left">No</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td align="left">lang</td><td align="left">language</td><td align="left">No</td><td align="left">Language which is being processed.</td></tr><tr><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">splitHyphenatedTokens</td><td align="left">split</td><td align="left">Yes</td><td align="left">If true all hyphenated tokens will be separated (default true)</td></tr><tr><td align="left">detokenizer</td><td align="left">dictionary</td><td align="left">No</td><td align="left">Specifies the file with detokenizer dictionary.</td></tr><tr><td rowspan="3" align="left" valign="middle">pos</td><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td align="left">detokenizer</td><td align="left">dictionary</td><td align="left">No</td><td align="left">Specifies the file with detokenizer dictionary.</td></tr><tr><td rowspan="3" align="left" valign="middle">conllx</td><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td align="left">detokenizer</td><td align="left">dictionary</td><td align="left">No</td><td align="left">Specifies the file with detokenizer dictionary.</td></tr><tr><td rowspan="3" align="left" valign="middle">namefinder</td><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td align="left">detokenizer</td><td align="left">dictionary</td><td align="left">No</td><td align="left">Specifies the file with detokenizer dictionary.</td></tr><tr><td rowspan="3" align="left" valign="middle">parse</td><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td align="left">detokenizer</td><td align="left">dictionary</td><td align="left">No</td><td align="left">Specifies the file with detokenizer dictionary.</td></tr><tr><td rowspan="2" align="left" valign="middle">conllu</td><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr></tbody></table></div> |
| |
| </div> |
| |
| <div class="section" title="TokenizerConverter"><div class="titlepage"><div><div><h3 class="title"><a name="tools.cli.tokenizer.TokenizerConverter"></a>TokenizerConverter</h3></div></div></div> |
| |
| |
| |
| <p>Converts foreign data formats (irishsentencebank,ad,pos,conllx,namefinder,parse,conllu) to native OpenNLP format</p> |
| |
| <pre class="screen"> |
| |
| Usage: opennlp TokenizerConverter help|irishsentencebank|ad|pos|conllx|namefinder|parse|conllu |
| [help|options...] |
| |
| </pre> |
| <p>The supported formats and arguments are:</p> |
| |
| <div class="informaltable"><table border="1"><colgroup><col><col><col><col></colgroup><thead><tr><th align="left">Format</th><th align="left">Argument</th><th align="left">Value</th><th align="left">Optional</th><th align="left">Description</th></tr></thead><tbody><tr><td rowspan="2" align="left" valign="middle">irishsentencebank</td><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td rowspan="5" align="left" valign="middle">ad</td><td align="left">encoding</td><td align="left">charsetName</td><td align="left">No</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td align="left">lang</td><td align="left">language</td><td align="left">No</td><td align="left">Language which is being processed.</td></tr><tr><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">splitHyphenatedTokens</td><td align="left">split</td><td align="left">Yes</td><td align="left">If true all hyphenated tokens will be separated (default true)</td></tr><tr><td align="left">detokenizer</td><td align="left">dictionary</td><td align="left">No</td><td align="left">Specifies the file with detokenizer dictionary.</td></tr><tr><td rowspan="3" align="left" valign="middle">pos</td><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td align="left">detokenizer</td><td align="left">dictionary</td><td align="left">No</td><td align="left">Specifies the file with detokenizer dictionary.</td></tr><tr><td rowspan="3" align="left" valign="middle">conllx</td><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td align="left">detokenizer</td><td align="left">dictionary</td><td align="left">No</td><td align="left">Specifies the file with detokenizer dictionary.</td></tr><tr><td rowspan="3" align="left" valign="middle">namefinder</td><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td align="left">detokenizer</td><td align="left">dictionary</td><td align="left">No</td><td align="left">Specifies the file with detokenizer dictionary.</td></tr><tr><td rowspan="3" align="left" valign="middle">parse</td><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td align="left">detokenizer</td><td align="left">dictionary</td><td align="left">No</td><td align="left">Specifies the file with detokenizer dictionary.</td></tr><tr><td rowspan="2" align="left" valign="middle">conllu</td><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr></tbody></table></div> |
| |
| </div> |
| |
| <div class="section" title="DictionaryDetokenizer"><div class="titlepage"><div><div><h3 class="title"><a name="tools.cli.tokenizer.DictionaryDetokenizer"></a>DictionaryDetokenizer</h3></div></div></div> |
| |
| |
| |
| <p></p> |
| |
| <pre class="screen"> |
| |
| Usage: opennlp DictionaryDetokenizer detokenizerDictionary |
| |
| |
| </pre> |
| </div> |
| |
| </div> |
| |
| <div class="section" title="Sentdetect"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="tools.cli.sentdetect"></a>Sentdetect</h2></div></div></div><div class="toc"><dl><dt><span class="section"><a href="#tools.cli.sentdetect.SentenceDetector">SentenceDetector</a></span></dt><dt><span class="section"><a href="#tools.cli.sentdetect.SentenceDetectorTrainer">SentenceDetectorTrainer</a></span></dt><dt><span class="section"><a href="#tools.cli.sentdetect.SentenceDetectorEvaluator">SentenceDetectorEvaluator</a></span></dt><dt><span class="section"><a href="#tools.cli.sentdetect.SentenceDetectorCrossValidator">SentenceDetectorCrossValidator</a></span></dt><dt><span class="section"><a href="#tools.cli.sentdetect.SentenceDetectorConverter">SentenceDetectorConverter</a></span></dt></dl></div> |
| |
| |
| |
| <div class="section" title="SentenceDetector"><div class="titlepage"><div><div><h3 class="title"><a name="tools.cli.sentdetect.SentenceDetector"></a>SentenceDetector</h3></div></div></div> |
| |
| |
| |
| <p>Learnable sentence detector</p> |
| |
| <pre class="screen"> |
| |
| Usage: opennlp SentenceDetector model < sentences |
| |
| |
| </pre> |
| </div> |
| |
| <div class="section" title="SentenceDetectorTrainer"><div class="titlepage"><div><div><h3 class="title"><a name="tools.cli.sentdetect.SentenceDetectorTrainer"></a>SentenceDetectorTrainer</h3></div></div></div> |
| |
| |
| |
| <p>Trainer for the learnable sentence detector</p> |
| |
| <pre class="screen"> |
| |
| Usage: opennlp |
| SentenceDetectorTrainer[.irishsentencebank|.ad|.pos|.conllx|.namefinder|.parse|.moses|.conllu|.letsmt] |
| [-factory factoryName] [-eosChars string] [-abbDict path] [-params paramsFile] -lang language -model |
| modelFile -data sampleData [-encoding charsetName] |
| Arguments description: |
| -factory factoryName |
| A sub-class of SentenceDetectorFactory where to get implementation and resources. |
| -eosChars string |
| EOS characters. |
| -abbDict path |
| abbreviation dictionary in XML format. |
| -params paramsFile |
| training parameters file. |
| -lang language |
| language which is being processed. |
| -model modelFile |
| output model file. |
| -data sampleData |
| data to be used, usually a file name. |
| -encoding charsetName |
| encoding for reading and writing text, if absent the system default is used. |
| |
| |
| </pre> |
| <p>The supported formats and arguments are:</p> |
| |
| <div class="informaltable"><table border="1"><colgroup><col><col><col><col></colgroup><thead><tr><th align="left">Format</th><th align="left">Argument</th><th align="left">Value</th><th align="left">Optional</th><th align="left">Description</th></tr></thead><tbody><tr><td rowspan="2" align="left" valign="middle">irishsentencebank</td><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td rowspan="4" align="left" valign="middle">ad</td><td align="left">encoding</td><td align="left">charsetName</td><td align="left">No</td><td align="left">Encoding for reading and writing text.</td></tr><tr><td align="left">lang</td><td align="left">language</td><td align="left">No</td><td align="left">Language which is being processed.</td></tr><tr><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">includeTitles</td><td align="left">includeTitles</td><td align="left">Yes</td><td align="left">If true will include sentences marked as headlines.</td></tr><tr><td rowspan="3" align="left" valign="middle">pos</td><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td align="left">detokenizer</td><td align="left">dictionary</td><td align="left">No</td><td align="left">Specifies the file with detokenizer dictionary.</td></tr><tr><td rowspan="3" align="left" valign="middle">conllx</td><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td align="left">detokenizer</td><td align="left">dictionary</td><td align="left">No</td><td align="left">Specifies the file with detokenizer dictionary.</td></tr><tr><td rowspan="3" align="left" valign="middle">namefinder</td><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td align="left">detokenizer</td><td align="left">dictionary</td><td align="left">No</td><td align="left">Specifies the file with detokenizer dictionary.</td></tr><tr><td rowspan="3" align="left" valign="middle">parse</td><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td align="left">detokenizer</td><td align="left">dictionary</td><td align="left">No</td><td align="left">Specifies the file with detokenizer dictionary.</td></tr><tr><td rowspan="2" align="left" valign="middle">moses</td><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td rowspan="3" align="left" valign="middle">conllu</td><td align="left">sentencesPerSample</td><td align="left">sentencesPerSample</td><td align="left">No</td><td align="left">Number of sentences per sample</td></tr><tr><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td rowspan="3" align="left" valign="middle">letsmt</td><td align="left">detokenizer</td><td align="left">dictionary</td><td align="left">Yes</td><td align="left">Specifies the file with detokenizer dictionary.</td></tr><tr><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr></tbody></table></div> |
| |
| </div> |
| |
| <div class="section" title="SentenceDetectorEvaluator"><div class="titlepage"><div><div><h3 class="title"><a name="tools.cli.sentdetect.SentenceDetectorEvaluator"></a>SentenceDetectorEvaluator</h3></div></div></div> |
| |
| |
| |
| <p>Evaluator for the learnable sentence detector</p> |
| |
| <pre class="screen"> |
| |
| Usage: opennlp |
| SentenceDetectorEvaluator[.irishsentencebank|.ad|.pos|.conllx|.namefinder|.parse|.moses|.conllu|.letsmt] |
| -model model [-misclassified true|false] -data sampleData [-encoding charsetName] |
| Arguments description: |
| -model model |
| the model file to be evaluated. |
| -misclassified true|false |
| if true will print false negatives and false positives. |
| -data sampleData |
| data to be used, usually a file name. |
| -encoding charsetName |
| encoding for reading and writing text, if absent the system default is used. |
| |
| |
| </pre> |
| <p>The supported formats and arguments are:</p> |
| |
| <div class="informaltable"><table border="1"><colgroup><col><col><col><col></colgroup><thead><tr><th align="left">Format</th><th align="left">Argument</th><th align="left">Value</th><th align="left">Optional</th><th align="left">Description</th></tr></thead><tbody><tr><td rowspan="2" align="left" valign="middle">irishsentencebank</td><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td rowspan="4" align="left" valign="middle">ad</td><td align="left">encoding</td><td align="left">charsetName</td><td align="left">No</td><td align="left">Encoding for reading and writing text.</td></tr><tr><td align="left">lang</td><td align="left">language</td><td align="left">No</td><td align="left">Language which is being processed.</td></tr><tr><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">includeTitles</td><td align="left">includeTitles</td><td align="left">Yes</td><td align="left">If true will include sentences marked as headlines.</td></tr><tr><td rowspan="3" align="left" valign="middle">pos</td><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td align="left">detokenizer</td><td align="left">dictionary</td><td align="left">No</td><td align="left">Specifies the file with detokenizer dictionary.</td></tr><tr><td rowspan="3" align="left" valign="middle">conllx</td><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td align="left">detokenizer</td><td align="left">dictionary</td><td align="left">No</td><td align="left">Specifies the file with detokenizer dictionary.</td></tr><tr><td rowspan="3" align="left" valign="middle">namefinder</td><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td align="left">detokenizer</td><td align="left">dictionary</td><td align="left">No</td><td align="left">Specifies the file with detokenizer dictionary.</td></tr><tr><td rowspan="3" align="left" valign="middle">parse</td><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td align="left">detokenizer</td><td align="left">dictionary</td><td align="left">No</td><td align="left">Specifies the file with detokenizer dictionary.</td></tr><tr><td rowspan="2" align="left" valign="middle">moses</td><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td rowspan="3" align="left" valign="middle">conllu</td><td align="left">sentencesPerSample</td><td align="left">sentencesPerSample</td><td align="left">No</td><td align="left">Number of sentences per sample</td></tr><tr><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td rowspan="3" align="left" valign="middle">letsmt</td><td align="left">detokenizer</td><td align="left">dictionary</td><td align="left">Yes</td><td align="left">Specifies the file with detokenizer dictionary.</td></tr><tr><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr></tbody></table></div> |
| |
| </div> |
| |
| <div class="section" title="SentenceDetectorCrossValidator"><div class="titlepage"><div><div><h3 class="title"><a name="tools.cli.sentdetect.SentenceDetectorCrossValidator"></a>SentenceDetectorCrossValidator</h3></div></div></div> |
| |
| |
| |
| <p>K-fold cross validator for the learnable sentence detector</p> |
| |
| <pre class="screen"> |
| |
| Usage: opennlp |
| SentenceDetectorCrossValidator[.irishsentencebank|.ad|.pos|.conllx|.namefinder|.parse|.moses|.conllu|.letsmt] |
| [-factory factoryName] [-eosChars string] [-abbDict path] [-params paramsFile] -lang language |
| [-misclassified true|false] [-folds num] -data sampleData [-encoding charsetName] |
| Arguments description: |
| -factory factoryName |
| A sub-class of SentenceDetectorFactory where to get implementation and resources. |
| -eosChars string |
| EOS characters. |
| -abbDict path |
| abbreviation dictionary in XML format. |
| -params paramsFile |
| training parameters file. |
| -lang language |
| language which is being processed. |
| -misclassified true|false |
| if true will print false negatives and false positives. |
| -folds num |
| number of folds, default is 10. |
| -data sampleData |
| data to be used, usually a file name. |
| -encoding charsetName |
| encoding for reading and writing text, if absent the system default is used. |
| |
| |
| </pre> |
| <p>The supported formats and arguments are:</p> |
| |
| <div class="informaltable"><table border="1"><colgroup><col><col><col><col></colgroup><thead><tr><th align="left">Format</th><th align="left">Argument</th><th align="left">Value</th><th align="left">Optional</th><th align="left">Description</th></tr></thead><tbody><tr><td rowspan="2" align="left" valign="middle">irishsentencebank</td><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td rowspan="4" align="left" valign="middle">ad</td><td align="left">encoding</td><td align="left">charsetName</td><td align="left">No</td><td align="left">Encoding for reading and writing text.</td></tr><tr><td align="left">lang</td><td align="left">language</td><td align="left">No</td><td align="left">Language which is being processed.</td></tr><tr><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">includeTitles</td><td align="left">includeTitles</td><td align="left">Yes</td><td align="left">If true will include sentences marked as headlines.</td></tr><tr><td rowspan="3" align="left" valign="middle">pos</td><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td align="left">detokenizer</td><td align="left">dictionary</td><td align="left">No</td><td align="left">Specifies the file with detokenizer dictionary.</td></tr><tr><td rowspan="3" align="left" valign="middle">conllx</td><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td align="left">detokenizer</td><td align="left">dictionary</td><td align="left">No</td><td align="left">Specifies the file with detokenizer dictionary.</td></tr><tr><td rowspan="3" align="left" valign="middle">namefinder</td><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td align="left">detokenizer</td><td align="left">dictionary</td><td align="left">No</td><td align="left">Specifies the file with detokenizer dictionary.</td></tr><tr><td rowspan="3" align="left" valign="middle">parse</td><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td align="left">detokenizer</td><td align="left">dictionary</td><td align="left">No</td><td align="left">Specifies the file with detokenizer dictionary.</td></tr><tr><td rowspan="2" align="left" valign="middle">moses</td><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td rowspan="3" align="left" valign="middle">conllu</td><td align="left">sentencesPerSample</td><td align="left">sentencesPerSample</td><td align="left">No</td><td align="left">Number of sentences per sample</td></tr><tr><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td rowspan="3" align="left" valign="middle">letsmt</td><td align="left">detokenizer</td><td align="left">dictionary</td><td align="left">Yes</td><td align="left">Specifies the file with detokenizer dictionary.</td></tr><tr><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr></tbody></table></div> |
| |
| </div> |
| |
| <div class="section" title="SentenceDetectorConverter"><div class="titlepage"><div><div><h3 class="title"><a name="tools.cli.sentdetect.SentenceDetectorConverter"></a>SentenceDetectorConverter</h3></div></div></div> |
| |
| |
| |
| <p>Converts foreign data formats (irishsentencebank,ad,pos,conllx,namefinder,parse,moses,conllu,letsmt) to native OpenNLP format</p> |
| |
| <pre class="screen"> |
| |
| Usage: opennlp SentenceDetectorConverter |
| help|irishsentencebank|ad|pos|conllx|namefinder|parse|moses|conllu|letsmt [help|options...] |
| |
| </pre> |
| <p>The supported formats and arguments are:</p> |
| |
| <div class="informaltable"><table border="1"><colgroup><col><col><col><col></colgroup><thead><tr><th align="left">Format</th><th align="left">Argument</th><th align="left">Value</th><th align="left">Optional</th><th align="left">Description</th></tr></thead><tbody><tr><td rowspan="2" align="left" valign="middle">irishsentencebank</td><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td rowspan="4" align="left" valign="middle">ad</td><td align="left">encoding</td><td align="left">charsetName</td><td align="left">No</td><td align="left">Encoding for reading and writing text.</td></tr><tr><td align="left">lang</td><td align="left">language</td><td align="left">No</td><td align="left">Language which is being processed.</td></tr><tr><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">includeTitles</td><td align="left">includeTitles</td><td align="left">Yes</td><td align="left">If true will include sentences marked as headlines.</td></tr><tr><td rowspan="3" align="left" valign="middle">pos</td><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td align="left">detokenizer</td><td align="left">dictionary</td><td align="left">No</td><td align="left">Specifies the file with detokenizer dictionary.</td></tr><tr><td rowspan="3" align="left" valign="middle">conllx</td><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td align="left">detokenizer</td><td align="left">dictionary</td><td align="left">No</td><td align="left">Specifies the file with detokenizer dictionary.</td></tr><tr><td rowspan="3" align="left" valign="middle">namefinder</td><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td align="left">detokenizer</td><td align="left">dictionary</td><td align="left">No</td><td align="left">Specifies the file with detokenizer dictionary.</td></tr><tr><td rowspan="3" align="left" valign="middle">parse</td><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td align="left">detokenizer</td><td align="left">dictionary</td><td align="left">No</td><td align="left">Specifies the file with detokenizer dictionary.</td></tr><tr><td rowspan="2" align="left" valign="middle">moses</td><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td rowspan="3" align="left" valign="middle">conllu</td><td align="left">sentencesPerSample</td><td align="left">sentencesPerSample</td><td align="left">No</td><td align="left">Number of sentences per sample</td></tr><tr><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td rowspan="3" align="left" valign="middle">letsmt</td><td align="left">detokenizer</td><td align="left">dictionary</td><td align="left">Yes</td><td align="left">Specifies the file with detokenizer dictionary.</td></tr><tr><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr></tbody></table></div> |
| |
| </div> |
| |
| </div> |
| |
| <div class="section" title="Namefind"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="tools.cli.namefind"></a>Namefind</h2></div></div></div><div class="toc"><dl><dt><span class="section"><a href="#tools.cli.namefind.TokenNameFinder">TokenNameFinder</a></span></dt><dt><span class="section"><a href="#tools.cli.namefind.TokenNameFinderTrainer">TokenNameFinderTrainer</a></span></dt><dt><span class="section"><a href="#tools.cli.namefind.TokenNameFinderEvaluator">TokenNameFinderEvaluator</a></span></dt><dt><span class="section"><a href="#tools.cli.namefind.TokenNameFinderCrossValidator">TokenNameFinderCrossValidator</a></span></dt><dt><span class="section"><a href="#tools.cli.namefind.TokenNameFinderConverter">TokenNameFinderConverter</a></span></dt><dt><span class="section"><a href="#tools.cli.namefind.CensusDictionaryCreator">CensusDictionaryCreator</a></span></dt></dl></div> |
| |
| |
| |
| <div class="section" title="TokenNameFinder"><div class="titlepage"><div><div><h3 class="title"><a name="tools.cli.namefind.TokenNameFinder"></a>TokenNameFinder</h3></div></div></div> |
| |
| |
| |
| <p>Learnable name finder</p> |
| |
| <pre class="screen"> |
| |
| Usage: opennlp TokenNameFinder model1 model2 ... modelN < sentences |
| |
| |
| </pre> |
| </div> |
| |
| <div class="section" title="TokenNameFinderTrainer"><div class="titlepage"><div><div><h3 class="title"><a name="tools.cli.namefind.TokenNameFinderTrainer"></a>TokenNameFinderTrainer</h3></div></div></div> |
| |
| |
| |
| <p>Trainer for the learnable name finder</p> |
| |
| <pre class="screen"> |
| |
| Usage: opennlp TokenNameFinderTrainer[.evalita|.ad|.conll03|.bionlp2004|.conll02|.muc6|.ontonotes|.brat] |
| [-factory factoryName] [-resources resourcesDir] [-type modelType] [-featuregen featuregenFile] |
| [-nameTypes types] [-sequenceCodec codec] [-params paramsFile] -lang language -model modelFile -data |
| sampleData [-encoding charsetName] |
| Arguments description: |
| -factory factoryName |
| A sub-class of TokenNameFinderFactory |
| -resources resourcesDir |
| The resources directory |
| -type modelType |
| The type of the token name finder model |
| -featuregen featuregenFile |
| The feature generator descriptor file |
| -nameTypes types |
| name types to use for training |
| -sequenceCodec codec |
| sequence codec used to code name spans |
| -params paramsFile |
| training parameters file. |
| -lang language |
| language which is being processed. |
| -model modelFile |
| output model file. |
| -data sampleData |
| data to be used, usually a file name. |
| -encoding charsetName |
| encoding for reading and writing text, if absent the system default is used. |
| |
| |
| </pre> |
| <p>The supported formats and arguments are:</p> |
| |
| <div class="informaltable"><table border="1"><colgroup><col><col><col><col></colgroup><thead><tr><th align="left">Format</th><th align="left">Argument</th><th align="left">Value</th><th align="left">Optional</th><th align="left">Description</th></tr></thead><tbody><tr><td rowspan="4" align="left" valign="middle">evalita</td><td align="left">lang</td><td align="left">it</td><td align="left">No</td><td align="left"> </td></tr><tr><td align="left">types</td><td align="left">per,loc,org,gpe</td><td align="left">No</td><td align="left"> </td></tr><tr><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td rowspan="4" align="left" valign="middle">ad</td><td align="left">encoding</td><td align="left">charsetName</td><td align="left">No</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td align="left">lang</td><td align="left">language</td><td align="left">No</td><td align="left">Language which is being processed.</td></tr><tr><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">splitHyphenatedTokens</td><td align="left">split</td><td align="left">Yes</td><td align="left">If true all hyphenated tokens will be separated (default true)</td></tr><tr><td rowspan="4" align="left" valign="middle">conll03</td><td align="left">lang</td><td align="left">eng|deu</td><td align="left">No</td><td align="left"> </td></tr><tr><td align="left">types</td><td align="left">per,loc,org,misc</td><td align="left">No</td><td align="left"> </td></tr><tr><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td rowspan="3" align="left" valign="middle">bionlp2004</td><td align="left">types</td><td align="left">DNA,protein,cell_type,cell_line,RNA</td><td align="left">No</td><td align="left"> </td></tr><tr><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td rowspan="4" align="left" valign="middle">conll02</td><td align="left">lang</td><td align="left">spa|nld</td><td align="left">No</td><td align="left"> </td></tr><tr><td align="left">types</td><td align="left">per,loc,org,misc</td><td align="left">No</td><td align="left"> </td></tr><tr><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td rowspan="3" align="left" valign="middle">muc6</td><td align="left">tokenizerModel</td><td align="left">modelFile</td><td align="left">No</td><td align="left"> </td></tr><tr><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td align="left" valign="middle">ontonotes</td><td align="left">ontoNotesDir</td><td align="left">OntoNotes 4.0 corpus directory</td><td align="left">No</td><td align="left"> </td></tr><tr><td rowspan="6" align="left" valign="middle">brat</td><td align="left">tokenizerModel</td><td align="left">modelFile</td><td align="left">Yes</td><td align="left"> </td></tr><tr><td align="left">ruleBasedTokenizer</td><td align="left">name</td><td align="left">Yes</td><td align="left"> </td></tr><tr><td align="left">annotationConfig</td><td align="left">annConfFile</td><td align="left">No</td><td align="left"> </td></tr><tr><td align="left">bratDataDir</td><td align="left">bratDataDir</td><td align="left">No</td><td align="left">Location of brat data dir</td></tr><tr><td align="left">recursive</td><td align="left">value</td><td align="left">Yes</td><td align="left"> </td></tr><tr><td align="left">sentenceDetectorModel</td><td align="left">modelFile</td><td align="left">Yes</td><td align="left"> </td></tr></tbody></table></div> |
| |
| </div> |
| |
| <div class="section" title="TokenNameFinderEvaluator"><div class="titlepage"><div><div><h3 class="title"><a name="tools.cli.namefind.TokenNameFinderEvaluator"></a>TokenNameFinderEvaluator</h3></div></div></div> |
| |
| |
| |
| <p>Measures the performance of the NameFinder model with the reference data</p> |
| |
| <pre class="screen"> |
| |
| Usage: opennlp TokenNameFinderEvaluator[.evalita|.ad|.conll03|.bionlp2004|.conll02|.muc6|.ontonotes|.brat] |
| [-nameTypes types] -model model [-misclassified true|false] [-detailedF true|false] |
| [-reportOutputFile outputFile] -data sampleData [-encoding charsetName] |
| Arguments description: |
| -nameTypes types |
| name types to use for evaluation |
| -model model |
| the model file to be evaluated. |
| -misclassified true|false |
| if true will print false negatives and false positives. |
| -detailedF true|false |
| if true (default) will print detailed FMeasure results. |
| -reportOutputFile outputFile |
| the path of the fine-grained report file. |
| -data sampleData |
| data to be used, usually a file name. |
| -encoding charsetName |
| encoding for reading and writing text, if absent the system default is used. |
| |
| |
| </pre> |
| <p>The supported formats and arguments are:</p> |
| |
| <div class="informaltable"><table border="1"><colgroup><col><col><col><col></colgroup><thead><tr><th align="left">Format</th><th align="left">Argument</th><th align="left">Value</th><th align="left">Optional</th><th align="left">Description</th></tr></thead><tbody><tr><td rowspan="4" align="left" valign="middle">evalita</td><td align="left">lang</td><td align="left">it</td><td align="left">No</td><td align="left"> </td></tr><tr><td align="left">types</td><td align="left">per,loc,org,gpe</td><td align="left">No</td><td align="left"> </td></tr><tr><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td rowspan="4" align="left" valign="middle">ad</td><td align="left">encoding</td><td align="left">charsetName</td><td align="left">No</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td align="left">lang</td><td align="left">language</td><td align="left">No</td><td align="left">Language which is being processed.</td></tr><tr><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">splitHyphenatedTokens</td><td align="left">split</td><td align="left">Yes</td><td align="left">If true all hyphenated tokens will be separated (default true)</td></tr><tr><td rowspan="4" align="left" valign="middle">conll03</td><td align="left">lang</td><td align="left">eng|deu</td><td align="left">No</td><td align="left"> </td></tr><tr><td align="left">types</td><td align="left">per,loc,org,misc</td><td align="left">No</td><td align="left"> </td></tr><tr><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td rowspan="3" align="left" valign="middle">bionlp2004</td><td align="left">types</td><td align="left">DNA,protein,cell_type,cell_line,RNA</td><td align="left">No</td><td align="left"> </td></tr><tr><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td rowspan="4" align="left" valign="middle">conll02</td><td align="left">lang</td><td align="left">spa|nld</td><td align="left">No</td><td align="left"> </td></tr><tr><td align="left">types</td><td align="left">per,loc,org,misc</td><td align="left">No</td><td align="left"> </td></tr><tr><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td rowspan="3" align="left" valign="middle">muc6</td><td align="left">tokenizerModel</td><td align="left">modelFile</td><td align="left">No</td><td align="left"> </td></tr><tr><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td align="left" valign="middle">ontonotes</td><td align="left">ontoNotesDir</td><td align="left">OntoNotes 4.0 corpus directory</td><td align="left">No</td><td align="left"> </td></tr><tr><td rowspan="6" align="left" valign="middle">brat</td><td align="left">tokenizerModel</td><td align="left">modelFile</td><td align="left">Yes</td><td align="left"> </td></tr><tr><td align="left">ruleBasedTokenizer</td><td align="left">name</td><td align="left">Yes</td><td align="left"> </td></tr><tr><td align="left">annotationConfig</td><td align="left">annConfFile</td><td align="left">No</td><td align="left"> </td></tr><tr><td align="left">bratDataDir</td><td align="left">bratDataDir</td><td align="left">No</td><td align="left">Location of brat data dir</td></tr><tr><td align="left">recursive</td><td align="left">value</td><td align="left">Yes</td><td align="left"> </td></tr><tr><td align="left">sentenceDetectorModel</td><td align="left">modelFile</td><td align="left">Yes</td><td align="left"> </td></tr></tbody></table></div> |
| |
| </div> |
| |
| <div class="section" title="TokenNameFinderCrossValidator"><div class="titlepage"><div><div><h3 class="title"><a name="tools.cli.namefind.TokenNameFinderCrossValidator"></a>TokenNameFinderCrossValidator</h3></div></div></div> |
| |
| |
| |
| <p>K-fold cross validator for the learnable Name Finder</p> |
| |
| <pre class="screen"> |
| |
| Usage: opennlp |
| TokenNameFinderCrossValidator[.evalita|.ad|.conll03|.bionlp2004|.conll02|.muc6|.ontonotes|.brat] |
| [-factory factoryName] [-resources resourcesDir] [-type modelType] [-featuregen featuregenFile] |
| [-nameTypes types] [-sequenceCodec codec] [-params paramsFile] -lang language [-misclassified |
| true|false] [-folds num] [-detailedF true|false] [-reportOutputFile outputFile] -data sampleData |
| [-encoding charsetName] |
| Arguments description: |
| -factory factoryName |
| A sub-class of TokenNameFinderFactory |
| -resources resourcesDir |
| The resources directory |
| -type modelType |
| The type of the token name finder model |
| -featuregen featuregenFile |
| The feature generator descriptor file |
| -nameTypes types |
| name types to use for training |
| -sequenceCodec codec |
| sequence codec used to code name spans |
| -params paramsFile |
| training parameters file. |
| -lang language |
| language which is being processed. |
| -misclassified true|false |
| if true will print false negatives and false positives. |
| -folds num |
| number of folds, default is 10. |
| -detailedF true|false |
| if true (default) will print detailed FMeasure results. |
| -reportOutputFile outputFile |
| the path of the fine-grained report file. |
| -data sampleData |
| data to be used, usually a file name. |
| -encoding charsetName |
| encoding for reading and writing text, if absent the system default is used. |
| |
| |
| </pre> |
| <p>The supported formats and arguments are:</p> |
| |
| <div class="informaltable"><table border="1"><colgroup><col><col><col><col></colgroup><thead><tr><th align="left">Format</th><th align="left">Argument</th><th align="left">Value</th><th align="left">Optional</th><th align="left">Description</th></tr></thead><tbody><tr><td rowspan="4" align="left" valign="middle">evalita</td><td align="left">lang</td><td align="left">it</td><td align="left">No</td><td align="left"> </td></tr><tr><td align="left">types</td><td align="left">per,loc,org,gpe</td><td align="left">No</td><td align="left"> </td></tr><tr><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td rowspan="4" align="left" valign="middle">ad</td><td align="left">encoding</td><td align="left">charsetName</td><td align="left">No</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td align="left">lang</td><td align="left">language</td><td align="left">No</td><td align="left">Language which is being processed.</td></tr><tr><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">splitHyphenatedTokens</td><td align="left">split</td><td align="left">Yes</td><td align="left">If true all hyphenated tokens will be separated (default true)</td></tr><tr><td rowspan="4" align="left" valign="middle">conll03</td><td align="left">lang</td><td align="left">eng|deu</td><td align="left">No</td><td align="left"> </td></tr><tr><td align="left">types</td><td align="left">per,loc,org,misc</td><td align="left">No</td><td align="left"> </td></tr><tr><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td rowspan="3" align="left" valign="middle">bionlp2004</td><td align="left">types</td><td align="left">DNA,protein,cell_type,cell_line,RNA</td><td align="left">No</td><td align="left"> </td></tr><tr><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td rowspan="4" align="left" valign="middle">conll02</td><td align="left">lang</td><td align="left">spa|nld</td><td align="left">No</td><td align="left"> </td></tr><tr><td align="left">types</td><td align="left">per,loc,org,misc</td><td align="left">No</td><td align="left"> </td></tr><tr><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td rowspan="3" align="left" valign="middle">muc6</td><td align="left">tokenizerModel</td><td align="left">modelFile</td><td align="left">No</td><td align="left"> </td></tr><tr><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td align="left" valign="middle">ontonotes</td><td align="left">ontoNotesDir</td><td align="left">OntoNotes 4.0 corpus directory</td><td align="left">No</td><td align="left"> </td></tr><tr><td rowspan="6" align="left" valign="middle">brat</td><td align="left">tokenizerModel</td><td align="left">modelFile</td><td align="left">Yes</td><td align="left"> </td></tr><tr><td align="left">ruleBasedTokenizer</td><td align="left">name</td><td align="left">Yes</td><td align="left"> </td></tr><tr><td align="left">annotationConfig</td><td align="left">annConfFile</td><td align="left">No</td><td align="left"> </td></tr><tr><td align="left">bratDataDir</td><td align="left">bratDataDir</td><td align="left">No</td><td align="left">Location of brat data dir</td></tr><tr><td align="left">recursive</td><td align="left">value</td><td align="left">Yes</td><td align="left"> </td></tr><tr><td align="left">sentenceDetectorModel</td><td align="left">modelFile</td><td align="left">Yes</td><td align="left"> </td></tr></tbody></table></div> |
| |
| </div> |
| |
| <div class="section" title="TokenNameFinderConverter"><div class="titlepage"><div><div><h3 class="title"><a name="tools.cli.namefind.TokenNameFinderConverter"></a>TokenNameFinderConverter</h3></div></div></div> |
| |
| |
| |
| <p>Converts foreign data formats (evalita,ad,conll03,bionlp2004,conll02,muc6,ontonotes,brat) to native OpenNLP format</p> |
| |
| <pre class="screen"> |
| |
| Usage: opennlp TokenNameFinderConverter help|evalita|ad|conll03|bionlp2004|conll02|muc6|ontonotes|brat |
| [help|options...] |
| |
| </pre> |
| <p>The supported formats and arguments are:</p> |
| |
| <div class="informaltable"><table border="1"><colgroup><col><col><col><col></colgroup><thead><tr><th align="left">Format</th><th align="left">Argument</th><th align="left">Value</th><th align="left">Optional</th><th align="left">Description</th></tr></thead><tbody><tr><td rowspan="4" align="left" valign="middle">evalita</td><td align="left">lang</td><td align="left">it</td><td align="left">No</td><td align="left"> </td></tr><tr><td align="left">types</td><td align="left">per,loc,org,gpe</td><td align="left">No</td><td align="left"> </td></tr><tr><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td rowspan="4" align="left" valign="middle">ad</td><td align="left">encoding</td><td align="left">charsetName</td><td align="left">No</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td align="left">lang</td><td align="left">language</td><td align="left">No</td><td align="left">Language which is being processed.</td></tr><tr><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">splitHyphenatedTokens</td><td align="left">split</td><td align="left">Yes</td><td align="left">If true all hyphenated tokens will be separated (default true)</td></tr><tr><td rowspan="4" align="left" valign="middle">conll03</td><td align="left">lang</td><td align="left">eng|deu</td><td align="left">No</td><td align="left"> </td></tr><tr><td align="left">types</td><td align="left">per,loc,org,misc</td><td align="left">No</td><td align="left"> </td></tr><tr><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td rowspan="3" align="left" valign="middle">bionlp2004</td><td align="left">types</td><td align="left">DNA,protein,cell_type,cell_line,RNA</td><td align="left">No</td><td align="left"> </td></tr><tr><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td rowspan="4" align="left" valign="middle">conll02</td><td align="left">lang</td><td align="left">spa|nld</td><td align="left">No</td><td align="left"> </td></tr><tr><td align="left">types</td><td align="left">per,loc,org,misc</td><td align="left">No</td><td align="left"> </td></tr><tr><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td rowspan="3" align="left" valign="middle">muc6</td><td align="left">tokenizerModel</td><td align="left">modelFile</td><td align="left">No</td><td align="left"> </td></tr><tr><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td align="left" valign="middle">ontonotes</td><td align="left">ontoNotesDir</td><td align="left">OntoNotes 4.0 corpus directory</td><td align="left">No</td><td align="left"> </td></tr><tr><td rowspan="6" align="left" valign="middle">brat</td><td align="left">tokenizerModel</td><td align="left">modelFile</td><td align="left">Yes</td><td align="left"> </td></tr><tr><td align="left">ruleBasedTokenizer</td><td align="left">name</td><td align="left">Yes</td><td align="left"> </td></tr><tr><td align="left">annotationConfig</td><td align="left">annConfFile</td><td align="left">No</td><td align="left"> </td></tr><tr><td align="left">bratDataDir</td><td align="left">bratDataDir</td><td align="left">No</td><td align="left">Location of brat data dir</td></tr><tr><td align="left">recursive</td><td align="left">value</td><td align="left">Yes</td><td align="left"> </td></tr><tr><td align="left">sentenceDetectorModel</td><td align="left">modelFile</td><td align="left">Yes</td><td align="left"> </td></tr></tbody></table></div> |
| |
| </div> |
| |
| <div class="section" title="CensusDictionaryCreator"><div class="titlepage"><div><div><h3 class="title"><a name="tools.cli.namefind.CensusDictionaryCreator"></a>CensusDictionaryCreator</h3></div></div></div> |
| |
| |
| |
| <p>Converts 1990 US Census names into a dictionary</p> |
| |
| <pre class="screen"> |
| |
| Usage: opennlp CensusDictionaryCreator [-encoding charsetName] [-lang code] -censusData censusDict -dict dict |
| |
| Arguments description: |
| -encoding charsetName |
| -lang code |
| -censusData censusDict |
| -dict dict |
| |
| |
| </pre> |
| </div> |
| |
| </div> |
| |
| <div class="section" title="Postag"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="tools.cli.postag"></a>Postag</h2></div></div></div><div class="toc"><dl><dt><span class="section"><a href="#tools.cli.postag.POSTagger">POSTagger</a></span></dt><dt><span class="section"><a href="#tools.cli.postag.POSTaggerTrainer">POSTaggerTrainer</a></span></dt><dt><span class="section"><a href="#tools.cli.postag.POSTaggerEvaluator">POSTaggerEvaluator</a></span></dt><dt><span class="section"><a href="#tools.cli.postag.POSTaggerCrossValidator">POSTaggerCrossValidator</a></span></dt><dt><span class="section"><a href="#tools.cli.postag.POSTaggerConverter">POSTaggerConverter</a></span></dt></dl></div> |
| |
| |
| |
| <div class="section" title="POSTagger"><div class="titlepage"><div><div><h3 class="title"><a name="tools.cli.postag.POSTagger"></a>POSTagger</h3></div></div></div> |
| |
| |
| |
| <p>Learnable part of speech tagger</p> |
| |
| <pre class="screen"> |
| |
| Usage: opennlp POSTagger model < sentences |
| |
| |
| </pre> |
| </div> |
| |
| <div class="section" title="POSTaggerTrainer"><div class="titlepage"><div><div><h3 class="title"><a name="tools.cli.postag.POSTaggerTrainer"></a>POSTaggerTrainer</h3></div></div></div> |
| |
| |
| |
| <p>Trains a model for the part-of-speech tagger</p> |
| |
| <pre class="screen"> |
| |
| Usage: opennlp POSTaggerTrainer[.ad|.conllx|.parse|.ontonotes|.conllu] [-factory factoryName] [-resources |
| resourcesDir] [-tagDictCutoff tagDictCutoff] [-featuregen featuregenFile] [-dict dictionaryPath] |
| [-params paramsFile] -lang language -model modelFile -data sampleData [-encoding charsetName] |
| Arguments description: |
| -factory factoryName |
| A sub-class of POSTaggerFactory where to get implementation and resources. |
| -resources resourcesDir |
| The resources directory |
| -tagDictCutoff tagDictCutoff |
| TagDictionary cutoff. If specified will create/expand a mutable TagDictionary |
| -featuregen featuregenFile |
| The feature generator descriptor file |
| -dict dictionaryPath |
| The XML tag dictionary file |
| -params paramsFile |
| training parameters file. |
| -lang language |
| language which is being processed. |
| -model modelFile |
| output model file. |
| -data sampleData |
| data to be used, usually a file name. |
| -encoding charsetName |
| encoding for reading and writing text, if absent the system default is used. |
| |
| |
| </pre> |
| <p>The supported formats and arguments are:</p> |
| |
| <div class="informaltable"><table border="1"><colgroup><col><col><col><col></colgroup><thead><tr><th align="left">Format</th><th align="left">Argument</th><th align="left">Value</th><th align="left">Optional</th><th align="left">Description</th></tr></thead><tbody><tr><td rowspan="5" align="left" valign="middle">ad</td><td align="left">encoding</td><td align="left">charsetName</td><td align="left">No</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td align="left">lang</td><td align="left">language</td><td align="left">No</td><td align="left">Language which is being processed.</td></tr><tr><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">expandME</td><td align="left">expandME</td><td align="left">Yes</td><td align="left">Expand multiword expressions.</td></tr><tr><td align="left">includeFeatures</td><td align="left">includeFeatures</td><td align="left">Yes</td><td align="left">Combine POS Tags with word features, like number and gender.</td></tr><tr><td rowspan="2" align="left" valign="middle">conllx</td><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td rowspan="2" align="left" valign="middle">parse</td><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td align="left" valign="middle">ontonotes</td><td align="left">ontoNotesDir</td><td align="left">OntoNotes 4.0 corpus directory</td><td align="left">No</td><td align="left"> </td></tr><tr><td rowspan="3" align="left" valign="middle">conllu</td><td align="left">tagset</td><td align="left">tagset</td><td align="left">Yes</td><td align="left">U|x u for unified tags and x for language-specific part-of-speech tags</td></tr><tr><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr></tbody></table></div> |
| |
| </div> |
| |
| <div class="section" title="POSTaggerEvaluator"><div class="titlepage"><div><div><h3 class="title"><a name="tools.cli.postag.POSTaggerEvaluator"></a>POSTaggerEvaluator</h3></div></div></div> |
| |
| |
| |
| <p>Measures the performance of the POS tagger model with the reference data</p> |
| |
| <pre class="screen"> |
| |
| Usage: opennlp POSTaggerEvaluator[.ad|.conllx|.parse|.ontonotes|.conllu] -model model [-misclassified |
| true|false] [-reportOutputFile outputFile] -data sampleData [-encoding charsetName] |
| Arguments description: |
| -model model |
| the model file to be evaluated. |
| -misclassified true|false |
| if true will print false negatives and false positives. |
| -reportOutputFile outputFile |
| the path of the fine-grained report file. |
| -data sampleData |
| data to be used, usually a file name. |
| -encoding charsetName |
| encoding for reading and writing text, if absent the system default is used. |
| |
| |
| </pre> |
| <p>The supported formats and arguments are:</p> |
| |
| <div class="informaltable"><table border="1"><colgroup><col><col><col><col></colgroup><thead><tr><th align="left">Format</th><th align="left">Argument</th><th align="left">Value</th><th align="left">Optional</th><th align="left">Description</th></tr></thead><tbody><tr><td rowspan="5" align="left" valign="middle">ad</td><td align="left">encoding</td><td align="left">charsetName</td><td align="left">No</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td align="left">lang</td><td align="left">language</td><td align="left">No</td><td align="left">Language which is being processed.</td></tr><tr><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">expandME</td><td align="left">expandME</td><td align="left">Yes</td><td align="left">Expand multiword expressions.</td></tr><tr><td align="left">includeFeatures</td><td align="left">includeFeatures</td><td align="left">Yes</td><td align="left">Combine POS Tags with word features, like number and gender.</td></tr><tr><td rowspan="2" align="left" valign="middle">conllx</td><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td rowspan="2" align="left" valign="middle">parse</td><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td align="left" valign="middle">ontonotes</td><td align="left">ontoNotesDir</td><td align="left">OntoNotes 4.0 corpus directory</td><td align="left">No</td><td align="left"> </td></tr><tr><td rowspan="3" align="left" valign="middle">conllu</td><td align="left">tagset</td><td align="left">tagset</td><td align="left">Yes</td><td align="left">U|x u for unified tags and x for language-specific part-of-speech tags</td></tr><tr><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr></tbody></table></div> |
| |
| </div> |
| |
| <div class="section" title="POSTaggerCrossValidator"><div class="titlepage"><div><div><h3 class="title"><a name="tools.cli.postag.POSTaggerCrossValidator"></a>POSTaggerCrossValidator</h3></div></div></div> |
| |
| |
| |
| <p>K-fold cross validator for the learnable POS tagger</p> |
| |
| <pre class="screen"> |
| |
| Usage: opennlp POSTaggerCrossValidator[.ad|.conllx|.parse|.ontonotes|.conllu] [-misclassified true|false] |
| [-folds num] [-factory factoryName] [-resources resourcesDir] [-tagDictCutoff tagDictCutoff] |
| [-featuregen featuregenFile] [-dict dictionaryPath] [-params paramsFile] -lang language |
| [-reportOutputFile outputFile] -data sampleData [-encoding charsetName] |
| Arguments description: |
| -misclassified true|false |
| if true will print false negatives and false positives. |
| -folds num |
| number of folds, default is 10. |
| -factory factoryName |
| A sub-class of POSTaggerFactory where to get implementation and resources. |
| -resources resourcesDir |
| The resources directory |
| -tagDictCutoff tagDictCutoff |
| TagDictionary cutoff. If specified will create/expand a mutable TagDictionary |
| -featuregen featuregenFile |
| The feature generator descriptor file |
| -dict dictionaryPath |
| The XML tag dictionary file |
| -params paramsFile |
| training parameters file. |
| -lang language |
| language which is being processed. |
| -reportOutputFile outputFile |
| the path of the fine-grained report file. |
| -data sampleData |
| data to be used, usually a file name. |
| -encoding charsetName |
| encoding for reading and writing text, if absent the system default is used. |
| |
| |
| </pre> |
| <p>The supported formats and arguments are:</p> |
| |
| <div class="informaltable"><table border="1"><colgroup><col><col><col><col></colgroup><thead><tr><th align="left">Format</th><th align="left">Argument</th><th align="left">Value</th><th align="left">Optional</th><th align="left">Description</th></tr></thead><tbody><tr><td rowspan="5" align="left" valign="middle">ad</td><td align="left">encoding</td><td align="left">charsetName</td><td align="left">No</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td align="left">lang</td><td align="left">language</td><td align="left">No</td><td align="left">Language which is being processed.</td></tr><tr><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">expandME</td><td align="left">expandME</td><td align="left">Yes</td><td align="left">Expand multiword expressions.</td></tr><tr><td align="left">includeFeatures</td><td align="left">includeFeatures</td><td align="left">Yes</td><td align="left">Combine POS Tags with word features, like number and gender.</td></tr><tr><td rowspan="2" align="left" valign="middle">conllx</td><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td rowspan="2" align="left" valign="middle">parse</td><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td align="left" valign="middle">ontonotes</td><td align="left">ontoNotesDir</td><td align="left">OntoNotes 4.0 corpus directory</td><td align="left">No</td><td align="left"> </td></tr><tr><td rowspan="3" align="left" valign="middle">conllu</td><td align="left">tagset</td><td align="left">tagset</td><td align="left">Yes</td><td align="left">U|x u for unified tags and x for language-specific part-of-speech tags</td></tr><tr><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr></tbody></table></div> |
| |
| </div> |
| |
| <div class="section" title="POSTaggerConverter"><div class="titlepage"><div><div><h3 class="title"><a name="tools.cli.postag.POSTaggerConverter"></a>POSTaggerConverter</h3></div></div></div> |
| |
| |
| |
| <p>Converts foreign data formats (ad,conllx,parse,ontonotes,conllu) to native OpenNLP format</p> |
| |
| <pre class="screen"> |
| |
| Usage: opennlp POSTaggerConverter help|ad|conllx|parse|ontonotes|conllu [help|options...] |
| |
| |
| </pre> |
| <p>The supported formats and arguments are:</p> |
| |
| <div class="informaltable"><table border="1"><colgroup><col><col><col><col></colgroup><thead><tr><th align="left">Format</th><th align="left">Argument</th><th align="left">Value</th><th align="left">Optional</th><th align="left">Description</th></tr></thead><tbody><tr><td rowspan="5" align="left" valign="middle">ad</td><td align="left">encoding</td><td align="left">charsetName</td><td align="left">No</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td align="left">lang</td><td align="left">language</td><td align="left">No</td><td align="left">Language which is being processed.</td></tr><tr><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">expandME</td><td align="left">expandME</td><td align="left">Yes</td><td align="left">Expand multiword expressions.</td></tr><tr><td align="left">includeFeatures</td><td align="left">includeFeatures</td><td align="left">Yes</td><td align="left">Combine POS Tags with word features, like number and gender.</td></tr><tr><td rowspan="2" align="left" valign="middle">conllx</td><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td rowspan="2" align="left" valign="middle">parse</td><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td align="left" valign="middle">ontonotes</td><td align="left">ontoNotesDir</td><td align="left">OntoNotes 4.0 corpus directory</td><td align="left">No</td><td align="left"> </td></tr><tr><td rowspan="3" align="left" valign="middle">conllu</td><td align="left">tagset</td><td align="left">tagset</td><td align="left">Yes</td><td align="left">U|x u for unified tags and x for language-specific part-of-speech tags</td></tr><tr><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr></tbody></table></div> |
| |
| </div> |
| |
| </div> |
| |
| <div class="section" title="Lemmatizer"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="tools.cli.lemmatizer"></a>Lemmatizer</h2></div></div></div><div class="toc"><dl><dt><span class="section"><a href="#tools.cli.lemmatizer.LemmatizerME">LemmatizerME</a></span></dt><dt><span class="section"><a href="#tools.cli.lemmatizer.LemmatizerTrainerME">LemmatizerTrainerME</a></span></dt><dt><span class="section"><a href="#tools.cli.lemmatizer.LemmatizerEvaluator">LemmatizerEvaluator</a></span></dt></dl></div> |
| |
| |
| |
| <div class="section" title="LemmatizerME"><div class="titlepage"><div><div><h3 class="title"><a name="tools.cli.lemmatizer.LemmatizerME"></a>LemmatizerME</h3></div></div></div> |
| |
| |
| |
| <p>Learnable lemmatizer</p> |
| |
| <pre class="screen"> |
| |
| Usage: opennlp LemmatizerME model < sentences |
| |
| |
| </pre> |
| </div> |
| |
| <div class="section" title="LemmatizerTrainerME"><div class="titlepage"><div><div><h3 class="title"><a name="tools.cli.lemmatizer.LemmatizerTrainerME"></a>LemmatizerTrainerME</h3></div></div></div> |
| |
| |
| |
| <p>Trainer for the learnable lemmatizer</p> |
| |
| <pre class="screen"> |
| |
| Usage: opennlp LemmatizerTrainerME[.conllu] [-factory factoryName] [-params paramsFile] -lang language -model |
| modelFile -data sampleData [-encoding charsetName] |
| Arguments description: |
| -factory factoryName |
| A sub-class of LemmatizerFactory where to get implementation and resources. |
| -params paramsFile |
| training parameters file. |
| -lang language |
| language which is being processed. |
| -model modelFile |
| output model file. |
| -data sampleData |
| data to be used, usually a file name. |
| -encoding charsetName |
| encoding for reading and writing text, if absent the system default is used. |
| |
| |
| </pre> |
| <p>The supported formats and arguments are:</p> |
| |
| <div class="informaltable"><table border="1"><colgroup><col><col><col><col></colgroup><thead><tr><th align="left">Format</th><th align="left">Argument</th><th align="left">Value</th><th align="left">Optional</th><th align="left">Description</th></tr></thead><tbody><tr><td rowspan="3" align="left" valign="middle">conllu</td><td align="left">tagset</td><td align="left">tagset</td><td align="left">Yes</td><td align="left">U|x u for unified tags and x for language-specific part-of-speech tags</td></tr><tr><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr></tbody></table></div> |
| |
| </div> |
| |
| <div class="section" title="LemmatizerEvaluator"><div class="titlepage"><div><div><h3 class="title"><a name="tools.cli.lemmatizer.LemmatizerEvaluator"></a>LemmatizerEvaluator</h3></div></div></div> |
| |
| |
| |
| <p>Measures the performance of the Lemmatizer model with the reference data</p> |
| |
| <pre class="screen"> |
| |
| Usage: opennlp LemmatizerEvaluator[.conllu] -model model [-misclassified true|false] [-reportOutputFile |
| outputFile] -data sampleData [-encoding charsetName] |
| Arguments description: |
| -model model |
| the model file to be evaluated. |
| -misclassified true|false |
| if true will print false negatives and false positives. |
| -reportOutputFile outputFile |
| the path of the fine-grained report file. |
| -data sampleData |
| data to be used, usually a file name. |
| -encoding charsetName |
| encoding for reading and writing text, if absent the system default is used. |
| |
| |
| </pre> |
| <p>The supported formats and arguments are:</p> |
| |
| <div class="informaltable"><table border="1"><colgroup><col><col><col><col></colgroup><thead><tr><th align="left">Format</th><th align="left">Argument</th><th align="left">Value</th><th align="left">Optional</th><th align="left">Description</th></tr></thead><tbody><tr><td rowspan="3" align="left" valign="middle">conllu</td><td align="left">tagset</td><td align="left">tagset</td><td align="left">Yes</td><td align="left">U|x u for unified tags and x for language-specific part-of-speech tags</td></tr><tr><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr></tbody></table></div> |
| |
| </div> |
| |
| </div> |
| |
| <div class="section" title="Chunker"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="tools.cli.chunker"></a>Chunker</h2></div></div></div><div class="toc"><dl><dt><span class="section"><a href="#tools.cli.chunker.ChunkerME">ChunkerME</a></span></dt><dt><span class="section"><a href="#tools.cli.chunker.ChunkerTrainerME">ChunkerTrainerME</a></span></dt><dt><span class="section"><a href="#tools.cli.chunker.ChunkerEvaluator">ChunkerEvaluator</a></span></dt><dt><span class="section"><a href="#tools.cli.chunker.ChunkerCrossValidator">ChunkerCrossValidator</a></span></dt><dt><span class="section"><a href="#tools.cli.chunker.ChunkerConverter">ChunkerConverter</a></span></dt></dl></div> |
| |
| |
| |
| <div class="section" title="ChunkerME"><div class="titlepage"><div><div><h3 class="title"><a name="tools.cli.chunker.ChunkerME"></a>ChunkerME</h3></div></div></div> |
| |
| |
| |
| <p>Learnable chunker</p> |
| |
| <pre class="screen"> |
| |
| Usage: opennlp ChunkerME model < sentences |
| |
| |
| </pre> |
| </div> |
| |
| <div class="section" title="ChunkerTrainerME"><div class="titlepage"><div><div><h3 class="title"><a name="tools.cli.chunker.ChunkerTrainerME"></a>ChunkerTrainerME</h3></div></div></div> |
| |
| |
| |
| <p>Trainer for the learnable chunker</p> |
| |
| <pre class="screen"> |
| |
| Usage: opennlp ChunkerTrainerME[.ad] [-factory factoryName] [-params paramsFile] -lang language -model |
| modelFile -data sampleData [-encoding charsetName] |
| Arguments description: |
| -factory factoryName |
| A sub-class of ChunkerFactory where to get implementation and resources. |
| -params paramsFile |
| training parameters file. |
| -lang language |
| language which is being processed. |
| -model modelFile |
| output model file. |
| -data sampleData |
| data to be used, usually a file name. |
| -encoding charsetName |
| encoding for reading and writing text, if absent the system default is used. |
| |
| |
| </pre> |
| <p>The supported formats and arguments are:</p> |
| |
| <div class="informaltable"><table border="1"><colgroup><col><col><col><col></colgroup><thead><tr><th align="left">Format</th><th align="left">Argument</th><th align="left">Value</th><th align="left">Optional</th><th align="left">Description</th></tr></thead><tbody><tr><td rowspan="5" align="left" valign="middle">ad</td><td align="left">encoding</td><td align="left">charsetName</td><td align="left">No</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td align="left">lang</td><td align="left">language</td><td align="left">No</td><td align="left">Language which is being processed.</td></tr><tr><td align="left">end</td><td align="left">end</td><td align="left">Yes</td><td align="left">Index of last sentence</td></tr><tr><td align="left">start</td><td align="left">start</td><td align="left">Yes</td><td align="left">Index of first sentence</td></tr><tr><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr></tbody></table></div> |
| |
| </div> |
| |
| <div class="section" title="ChunkerEvaluator"><div class="titlepage"><div><div><h3 class="title"><a name="tools.cli.chunker.ChunkerEvaluator"></a>ChunkerEvaluator</h3></div></div></div> |
| |
| |
| |
| <p>Measures the performance of the Chunker model with the reference data</p> |
| |
| <pre class="screen"> |
| |
| Usage: opennlp ChunkerEvaluator[.ad] -model model [-misclassified true|false] [-detailedF true|false] -data |
| sampleData [-encoding charsetName] |
| Arguments description: |
| -model model |
| the model file to be evaluated. |
| -misclassified true|false |
| if true will print false negatives and false positives. |
| -detailedF true|false |
| if true (default) will print detailed FMeasure results. |
| -data sampleData |
| data to be used, usually a file name. |
| -encoding charsetName |
| encoding for reading and writing text, if absent the system default is used. |
| |
| |
| </pre> |
| <p>The supported formats and arguments are:</p> |
| |
| <div class="informaltable"><table border="1"><colgroup><col><col><col><col></colgroup><thead><tr><th align="left">Format</th><th align="left">Argument</th><th align="left">Value</th><th align="left">Optional</th><th align="left">Description</th></tr></thead><tbody><tr><td rowspan="5" align="left" valign="middle">ad</td><td align="left">encoding</td><td align="left">charsetName</td><td align="left">No</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td align="left">lang</td><td align="left">language</td><td align="left">No</td><td align="left">Language which is being processed.</td></tr><tr><td align="left">end</td><td align="left">end</td><td align="left">Yes</td><td align="left">Index of last sentence</td></tr><tr><td align="left">start</td><td align="left">start</td><td align="left">Yes</td><td align="left">Index of first sentence</td></tr><tr><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr></tbody></table></div> |
| |
| </div> |
| |
| <div class="section" title="ChunkerCrossValidator"><div class="titlepage"><div><div><h3 class="title"><a name="tools.cli.chunker.ChunkerCrossValidator"></a>ChunkerCrossValidator</h3></div></div></div> |
| |
| |
| |
| <p>K-fold cross validator for the chunker</p> |
| |
| <pre class="screen"> |
| |
| Usage: opennlp ChunkerCrossValidator[.ad] [-factory factoryName] [-params paramsFile] -lang language |
| [-misclassified true|false] [-folds num] [-detailedF true|false] -data sampleData [-encoding |
| charsetName] |
| Arguments description: |
| -factory factoryName |
| A sub-class of ChunkerFactory where to get implementation and resources. |
| -params paramsFile |
| training parameters file. |
| -lang language |
| language which is being processed. |
| -misclassified true|false |
| if true will print false negatives and false positives. |
| -folds num |
| number of folds, default is 10. |
| -detailedF true|false |
| if true (default) will print detailed FMeasure results. |
| -data sampleData |
| data to be used, usually a file name. |
| -encoding charsetName |
| encoding for reading and writing text, if absent the system default is used. |
| |
| |
| </pre> |
| <p>The supported formats and arguments are:</p> |
| |
| <div class="informaltable"><table border="1"><colgroup><col><col><col><col></colgroup><thead><tr><th align="left">Format</th><th align="left">Argument</th><th align="left">Value</th><th align="left">Optional</th><th align="left">Description</th></tr></thead><tbody><tr><td rowspan="5" align="left" valign="middle">ad</td><td align="left">encoding</td><td align="left">charsetName</td><td align="left">No</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td align="left">lang</td><td align="left">language</td><td align="left">No</td><td align="left">Language which is being processed.</td></tr><tr><td align="left">end</td><td align="left">end</td><td align="left">Yes</td><td align="left">Index of last sentence</td></tr><tr><td align="left">start</td><td align="left">start</td><td align="left">Yes</td><td align="left">Index of first sentence</td></tr><tr><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr></tbody></table></div> |
| |
| </div> |
| |
| <div class="section" title="ChunkerConverter"><div class="titlepage"><div><div><h3 class="title"><a name="tools.cli.chunker.ChunkerConverter"></a>ChunkerConverter</h3></div></div></div> |
| |
| |
| |
| <p>Converts ad data format to native OpenNLP format</p> |
| |
| <pre class="screen"> |
| |
| Usage: opennlp ChunkerConverter help|ad [help|options...] |
| |
| |
| </pre> |
| <p>The supported formats and arguments are:</p> |
| |
| <div class="informaltable"><table border="1"><colgroup><col><col><col><col></colgroup><thead><tr><th align="left">Format</th><th align="left">Argument</th><th align="left">Value</th><th align="left">Optional</th><th align="left">Description</th></tr></thead><tbody><tr><td rowspan="5" align="left" valign="middle">ad</td><td align="left">encoding</td><td align="left">charsetName</td><td align="left">No</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr><tr><td align="left">lang</td><td align="left">language</td><td align="left">No</td><td align="left">Language which is being processed.</td></tr><tr><td align="left">end</td><td align="left">end</td><td align="left">Yes</td><td align="left">Index of last sentence</td></tr><tr><td align="left">start</td><td align="left">start</td><td align="left">Yes</td><td align="left">Index of first sentence</td></tr><tr><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr></tbody></table></div> |
| |
| </div> |
| |
| </div> |
| |
| <div class="section" title="Parser"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="tools.cli.parser"></a>Parser</h2></div></div></div><div class="toc"><dl><dt><span class="section"><a href="#tools.cli.parser.Parser">Parser</a></span></dt><dt><span class="section"><a href="#tools.cli.parser.ParserTrainer">ParserTrainer</a></span></dt><dt><span class="section"><a href="#tools.cli.parser.ParserEvaluator">ParserEvaluator</a></span></dt><dt><span class="section"><a href="#tools.cli.parser.ParserConverter">ParserConverter</a></span></dt><dt><span class="section"><a href="#tools.cli.parser.BuildModelUpdater">BuildModelUpdater</a></span></dt><dt><span class="section"><a href="#tools.cli.parser.CheckModelUpdater">CheckModelUpdater</a></span></dt><dt><span class="section"><a href="#tools.cli.parser.TaggerModelReplacer">TaggerModelReplacer</a></span></dt></dl></div> |
| |
| |
| |
| <div class="section" title="Parser"><div class="titlepage"><div><div><h3 class="title"><a name="tools.cli.parser.Parser"></a>Parser</h3></div></div></div> |
| |
| |
| |
| <p>Performs full syntactic parsing</p> |
| |
| <pre class="screen"> |
| |
| Usage: opennlp Parser [-bs n -ap n -k n -tk tok_model] model < sentences |
| -bs n: Use a beam size of n. |
| -ap f: Advance outcomes in with at least f% of the probability mass. |
| -k n: Show the top n parses. This will also display their log-probablities. |
| -tk tok_model: Use the specified tokenizer model to tokenize the sentences. Defaults to a WhitespaceTokenizer. |
| |
| |
| </pre> |
| </div> |
| |
| <div class="section" title="ParserTrainer"><div class="titlepage"><div><div><h3 class="title"><a name="tools.cli.parser.ParserTrainer"></a>ParserTrainer</h3></div></div></div> |
| |
| |
| |
| <p>Trains the learnable parser</p> |
| |
| <pre class="screen"> |
| |
| Usage: opennlp ParserTrainer[.ontonotes|.frenchtreebank] [-headRulesSerializerImpl className] -headRules |
| headRulesFile [-parserType CHUNKING|TREEINSERT] [-fun true|false] [-params paramsFile] -lang language |
| -model modelFile [-encoding charsetName] -data sampleData |
| Arguments description: |
| -headRulesSerializerImpl className |
| head rules artifact serializer class name |
| -headRules headRulesFile |
| head rules file. |
| -parserType CHUNKING|TREEINSERT |
| one of CHUNKING or TREEINSERT, default is CHUNKING. |
| -fun true|false |
| Learn to generate function tags. |
| -params paramsFile |
| training parameters file. |
| -lang language |
| language which is being processed. |
| -model modelFile |
| output model file. |
| -encoding charsetName |
| encoding for reading and writing text, if absent the system default is used. |
| -data sampleData |
| data to be used, usually a file name. |
| |
| |
| </pre> |
| <p>The supported formats and arguments are:</p> |
| |
| <div class="informaltable"><table border="1"><colgroup><col><col><col><col></colgroup><thead><tr><th align="left">Format</th><th align="left">Argument</th><th align="left">Value</th><th align="left">Optional</th><th align="left">Description</th></tr></thead><tbody><tr><td align="left" valign="middle">ontonotes</td><td align="left">ontoNotesDir</td><td align="left">OntoNotes 4.0 corpus directory</td><td align="left">No</td><td align="left"> </td></tr><tr><td rowspan="2" align="left" valign="middle">frenchtreebank</td><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr></tbody></table></div> |
| |
| </div> |
| |
| <div class="section" title="ParserEvaluator"><div class="titlepage"><div><div><h3 class="title"><a name="tools.cli.parser.ParserEvaluator"></a>ParserEvaluator</h3></div></div></div> |
| |
| |
| |
| <p>Measures the performance of the Parser model with the reference data</p> |
| |
| <pre class="screen"> |
| |
| Usage: opennlp ParserEvaluator[.ontonotes|.frenchtreebank] -model model [-misclassified true|false] -data |
| sampleData [-encoding charsetName] |
| Arguments description: |
| -model model |
| the model file to be evaluated. |
| -misclassified true|false |
| if true will print false negatives and false positives. |
| -data sampleData |
| data to be used, usually a file name. |
| -encoding charsetName |
| encoding for reading and writing text, if absent the system default is used. |
| |
| |
| </pre> |
| <p>The supported formats and arguments are:</p> |
| |
| <div class="informaltable"><table border="1"><colgroup><col><col><col><col></colgroup><thead><tr><th align="left">Format</th><th align="left">Argument</th><th align="left">Value</th><th align="left">Optional</th><th align="left">Description</th></tr></thead><tbody><tr><td align="left" valign="middle">ontonotes</td><td align="left">ontoNotesDir</td><td align="left">OntoNotes 4.0 corpus directory</td><td align="left">No</td><td align="left"> </td></tr><tr><td rowspan="2" align="left" valign="middle">frenchtreebank</td><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr></tbody></table></div> |
| |
| </div> |
| |
| <div class="section" title="ParserConverter"><div class="titlepage"><div><div><h3 class="title"><a name="tools.cli.parser.ParserConverter"></a>ParserConverter</h3></div></div></div> |
| |
| |
| |
| <p>Converts foreign data formats (ontonotes,frenchtreebank) to native OpenNLP format</p> |
| |
| <pre class="screen"> |
| |
| Usage: opennlp ParserConverter help|ontonotes|frenchtreebank [help|options...] |
| |
| |
| </pre> |
| <p>The supported formats and arguments are:</p> |
| |
| <div class="informaltable"><table border="1"><colgroup><col><col><col><col></colgroup><thead><tr><th align="left">Format</th><th align="left">Argument</th><th align="left">Value</th><th align="left">Optional</th><th align="left">Description</th></tr></thead><tbody><tr><td align="left" valign="middle">ontonotes</td><td align="left">ontoNotesDir</td><td align="left">OntoNotes 4.0 corpus directory</td><td align="left">No</td><td align="left"> </td></tr><tr><td rowspan="2" align="left" valign="middle">frenchtreebank</td><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr></tbody></table></div> |
| |
| </div> |
| |
| <div class="section" title="BuildModelUpdater"><div class="titlepage"><div><div><h3 class="title"><a name="tools.cli.parser.BuildModelUpdater"></a>BuildModelUpdater</h3></div></div></div> |
| |
| |
| |
| <p>Trains and updates the build model in a parser model</p> |
| |
| <pre class="screen"> |
| |
| Usage: opennlp BuildModelUpdater[.ontonotes|.frenchtreebank] -model modelFile [-params paramsFile] -lang |
| language -data sampleData [-encoding charsetName] |
| Arguments description: |
| -model modelFile |
| output model file. |
| -params paramsFile |
| training parameters file. |
| -lang language |
| language which is being processed. |
| -data sampleData |
| data to be used, usually a file name. |
| -encoding charsetName |
| encoding for reading and writing text, if absent the system default is used. |
| |
| |
| </pre> |
| <p>The supported formats and arguments are:</p> |
| |
| <div class="informaltable"><table border="1"><colgroup><col><col><col><col></colgroup><thead><tr><th align="left">Format</th><th align="left">Argument</th><th align="left">Value</th><th align="left">Optional</th><th align="left">Description</th></tr></thead><tbody><tr><td align="left" valign="middle">ontonotes</td><td align="left">ontoNotesDir</td><td align="left">OntoNotes 4.0 corpus directory</td><td align="left">No</td><td align="left"> </td></tr><tr><td rowspan="2" align="left" valign="middle">frenchtreebank</td><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr></tbody></table></div> |
| |
| </div> |
| |
| <div class="section" title="CheckModelUpdater"><div class="titlepage"><div><div><h3 class="title"><a name="tools.cli.parser.CheckModelUpdater"></a>CheckModelUpdater</h3></div></div></div> |
| |
| |
| |
| <p>Trains and updates the check model in a parser model</p> |
| |
| <pre class="screen"> |
| |
| Usage: opennlp CheckModelUpdater[.ontonotes|.frenchtreebank] -model modelFile [-params paramsFile] -lang |
| language -data sampleData [-encoding charsetName] |
| Arguments description: |
| -model modelFile |
| output model file. |
| -params paramsFile |
| training parameters file. |
| -lang language |
| language which is being processed. |
| -data sampleData |
| data to be used, usually a file name. |
| -encoding charsetName |
| encoding for reading and writing text, if absent the system default is used. |
| |
| |
| </pre> |
| <p>The supported formats and arguments are:</p> |
| |
| <div class="informaltable"><table border="1"><colgroup><col><col><col><col></colgroup><thead><tr><th align="left">Format</th><th align="left">Argument</th><th align="left">Value</th><th align="left">Optional</th><th align="left">Description</th></tr></thead><tbody><tr><td align="left" valign="middle">ontonotes</td><td align="left">ontoNotesDir</td><td align="left">OntoNotes 4.0 corpus directory</td><td align="left">No</td><td align="left"> </td></tr><tr><td rowspan="2" align="left" valign="middle">frenchtreebank</td><td align="left">data</td><td align="left">sampleData</td><td align="left">No</td><td align="left">Data to be used, usually a file name.</td></tr><tr><td align="left">encoding</td><td align="left">charsetName</td><td align="left">Yes</td><td align="left">Encoding for reading and writing text, if absent the system default is used.</td></tr></tbody></table></div> |
| |
| </div> |
| |
| <div class="section" title="TaggerModelReplacer"><div class="titlepage"><div><div><h3 class="title"><a name="tools.cli.parser.TaggerModelReplacer"></a>TaggerModelReplacer</h3></div></div></div> |
| |
| |
| |
| <p>Replaces the tagger model in a parser model</p> |
| |
| <pre class="screen"> |
| |
| Usage: opennlp TaggerModelReplacer parser.model tagger.model |
| |
| |
| </pre> |
| </div> |
| |
| </div> |
| |
| <div class="section" title="Entitylinker"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="tools.cli.entitylinker"></a>Entitylinker</h2></div></div></div><div class="toc"><dl><dt><span class="section"><a href="#tools.cli.entitylinker.EntityLinker">EntityLinker</a></span></dt></dl></div> |
| |
| |
| |
| <div class="section" title="EntityLinker"><div class="titlepage"><div><div><h3 class="title"><a name="tools.cli.entitylinker.EntityLinker"></a>EntityLinker</h3></div></div></div> |
| |
| |
| |
| <p>Links an entity to an external data set</p> |
| |
| <pre class="screen"> |
| |
| Usage: opennlp EntityLinker model < sentences |
| |
| |
| </pre> |
| </div> |
| |
| </div> |
| |
| <div class="section" title="Languagemodel"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="tools.cli.languagemodel"></a>Languagemodel</h2></div></div></div><div class="toc"><dl><dt><span class="section"><a href="#tools.cli.languagemodel.NGramLanguageModel">NGramLanguageModel</a></span></dt></dl></div> |
| |
| |
| |
| <div class="section" title="NGramLanguageModel"><div class="titlepage"><div><div><h3 class="title"><a name="tools.cli.languagemodel.NGramLanguageModel"></a>NGramLanguageModel</h3></div></div></div> |
| |
| |
| |
| <p>Gives the probability and most probable next token(s) of a sequence of tokens in a language model</p> |
| |
| <pre class="screen"> |
| |
| Usage: opennlp NGramLanguageModel model |
| |
| |
| </pre> |
| </div> |
| |
| </div> |
| |
| |
| |
| </div> |
| </div></body></html> |