blob: f809029a4bc4f6d72a0d4eda0a868dd851ccda84 [file] [log] [blame]
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN"
"http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd"[
]>
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
<!-- ## Warning ## this content is autogenerated! Please fix issues in to opennlp-tools/src/main/java/opennlp/tools/cmdline/GenerateManualTool.java
and execute the following command in opennlp-tool folder to update this file:
mvn -e -q exec:java "-Dexec.mainClass=opennlp.tools.cmdline.GenerateManualTool" "-Dexec.args=../opennlp-docs/src/docbkx/cli.xml"
-->
<chapter id='tools.cli'>
<title>The Command Line Interface</title>
<para>This section details the available tools and parameters of the Command Line Interface. For a introduction in its usage please refer to <xref linkend='intro.cli'/>. </para>
<section id='tools.cli.doccat'>
<title>Doccat</title>
<section id='tools.cli.doccat.Doccat'>
<title>Doccat</title>
<para>Learned document categorizer</para>
<screen>
<![CDATA[
Usage: opennlp Doccat model < documents
]]>
</screen>
</section>
<section id='tools.cli.doccat.DoccatTrainer'>
<title>DoccatTrainer</title>
<para>Trainer for the learnable document categorizer</para>
<screen>
<![CDATA[
Usage: opennlp DoccatTrainer[.leipzig] [-factory factoryName] [-featureGenerators fg] [-tokenizer tokenizer]
[-params paramsFile] -lang language -model modelFile -data sampleData [-encoding charsetName]
Arguments description:
-factory factoryName
A sub-class of DoccatFactory where to get implementation and resources.
-featureGenerators fg
Comma separated feature generator classes. Bag of words is used if not specified.
-tokenizer tokenizer
Tokenizer implementation. WhitespaceTokenizer is used if not specified.
-params paramsFile
training parameters file.
-lang language
language which is being processed.
-model modelFile
output model file.
-data sampleData
data to be used, usually a file name.
-encoding charsetName
encoding for reading and writing text, if absent the system default is used.
]]>
</screen>
<para>The supported formats and arguments are:</para>
<informaltable frame='all'><tgroup cols='4' align='left' colsep='1' rowsep='1'>
<thead><row><entry>Format</entry><entry>Argument</entry><entry>Value</entry><entry>Optional</entry><entry>Description</entry></row></thead>
<tbody>
<row>
<entry morerows='1' valign='middle'>leipzig</entry>
<entry>sentencesDir</entry>
<entry>sentencesDir</entry>
<entry>No</entry>
<entry>Dir with Leipig sentences to be used</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
</tbody>
</tgroup></informaltable>
</section>
<section id='tools.cli.doccat.DoccatEvaluator'>
<title>DoccatEvaluator</title>
<para>Measures the performance of the Doccat model with the reference data</para>
<screen>
<![CDATA[
Usage: opennlp DoccatEvaluator[.leipzig] -model model [-misclassified true|false] [-reportOutputFile
outputFile] -data sampleData [-encoding charsetName]
Arguments description:
-model model
the model file to be evaluated.
-misclassified true|false
if true will print false negatives and false positives.
-reportOutputFile outputFile
the path of the fine-grained report file.
-data sampleData
data to be used, usually a file name.
-encoding charsetName
encoding for reading and writing text, if absent the system default is used.
]]>
</screen>
<para>The supported formats and arguments are:</para>
<informaltable frame='all'><tgroup cols='4' align='left' colsep='1' rowsep='1'>
<thead><row><entry>Format</entry><entry>Argument</entry><entry>Value</entry><entry>Optional</entry><entry>Description</entry></row></thead>
<tbody>
<row>
<entry morerows='1' valign='middle'>leipzig</entry>
<entry>sentencesDir</entry>
<entry>sentencesDir</entry>
<entry>No</entry>
<entry>Dir with Leipig sentences to be used</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
</tbody>
</tgroup></informaltable>
</section>
<section id='tools.cli.doccat.DoccatCrossValidator'>
<title>DoccatCrossValidator</title>
<para>K-fold cross validator for the learnable Document Categorizer</para>
<screen>
<![CDATA[
Usage: opennlp DoccatCrossValidator[.leipzig] [-misclassified true|false] [-folds num] [-factory factoryName]
[-featureGenerators fg] [-tokenizer tokenizer] [-params paramsFile] -lang language [-reportOutputFile
outputFile] -data sampleData [-encoding charsetName]
Arguments description:
-misclassified true|false
if true will print false negatives and false positives.
-folds num
number of folds, default is 10.
-factory factoryName
A sub-class of DoccatFactory where to get implementation and resources.
-featureGenerators fg
Comma separated feature generator classes. Bag of words is used if not specified.
-tokenizer tokenizer
Tokenizer implementation. WhitespaceTokenizer is used if not specified.
-params paramsFile
training parameters file.
-lang language
language which is being processed.
-reportOutputFile outputFile
the path of the fine-grained report file.
-data sampleData
data to be used, usually a file name.
-encoding charsetName
encoding for reading and writing text, if absent the system default is used.
]]>
</screen>
<para>The supported formats and arguments are:</para>
<informaltable frame='all'><tgroup cols='4' align='left' colsep='1' rowsep='1'>
<thead><row><entry>Format</entry><entry>Argument</entry><entry>Value</entry><entry>Optional</entry><entry>Description</entry></row></thead>
<tbody>
<row>
<entry morerows='1' valign='middle'>leipzig</entry>
<entry>sentencesDir</entry>
<entry>sentencesDir</entry>
<entry>No</entry>
<entry>Dir with Leipig sentences to be used</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
</tbody>
</tgroup></informaltable>
</section>
<section id='tools.cli.doccat.DoccatConverter'>
<title>DoccatConverter</title>
<para>Converts leipzig data format to native OpenNLP format</para>
<screen>
<![CDATA[
Usage: opennlp DoccatConverter help|leipzig [help|options...]
]]>
</screen>
<para>The supported formats and arguments are:</para>
<informaltable frame='all'><tgroup cols='4' align='left' colsep='1' rowsep='1'>
<thead><row><entry>Format</entry><entry>Argument</entry><entry>Value</entry><entry>Optional</entry><entry>Description</entry></row></thead>
<tbody>
<row>
<entry morerows='1' valign='middle'>leipzig</entry>
<entry>sentencesDir</entry>
<entry>sentencesDir</entry>
<entry>No</entry>
<entry>Dir with Leipig sentences to be used</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
</tbody>
</tgroup></informaltable>
</section>
</section>
<section id='tools.cli.langdetect'>
<title>Langdetect</title>
<section id='tools.cli.langdetect.LanguageDetector'>
<title>LanguageDetector</title>
<para>Learned language detector</para>
<screen>
<![CDATA[
Usage: opennlp LanguageDetector model < documents
]]>
</screen>
</section>
<section id='tools.cli.langdetect.LanguageDetectorTrainer'>
<title>LanguageDetectorTrainer</title>
<para>Trainer for the learnable language detector</para>
<screen>
<![CDATA[
Usage: opennlp LanguageDetectorTrainer[.leipzig] -model modelFile [-params paramsFile] [-factory factoryName]
-data sampleData [-encoding charsetName]
Arguments description:
-model modelFile
output model file.
-params paramsFile
training parameters file.
-factory factoryName
A sub-class of LanguageDetectorFactory where to get implementation and resources.
-data sampleData
data to be used, usually a file name.
-encoding charsetName
encoding for reading and writing text, if absent the system default is used.
]]>
</screen>
<para>The supported formats and arguments are:</para>
<informaltable frame='all'><tgroup cols='4' align='left' colsep='1' rowsep='1'>
<thead><row><entry>Format</entry><entry>Argument</entry><entry>Value</entry><entry>Optional</entry><entry>Description</entry></row></thead>
<tbody>
<row>
<entry morerows='3' valign='middle'>leipzig</entry>
<entry>sentencesDir</entry>
<entry>sentencesDir</entry>
<entry>No</entry>
<entry>Dir with Leipig sentences to be used</entry>
</row>
<row>
<entry>sentencesPerSample</entry>
<entry>sentencesPerSample</entry>
<entry>No</entry>
<entry>Number of sentences per sample</entry>
</row>
<row>
<entry>samplesPerLanguage</entry>
<entry>samplesPerLanguage</entry>
<entry>No</entry>
<entry>Number of samples per language</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
</tbody>
</tgroup></informaltable>
</section>
<section id='tools.cli.langdetect.LanguageDetectorConverter'>
<title>LanguageDetectorConverter</title>
<para>Converts leipzig data format to native OpenNLP format</para>
<screen>
<![CDATA[
Usage: opennlp LanguageDetectorConverter help|leipzig [help|options...]
]]>
</screen>
<para>The supported formats and arguments are:</para>
<informaltable frame='all'><tgroup cols='4' align='left' colsep='1' rowsep='1'>
<thead><row><entry>Format</entry><entry>Argument</entry><entry>Value</entry><entry>Optional</entry><entry>Description</entry></row></thead>
<tbody>
<row>
<entry morerows='3' valign='middle'>leipzig</entry>
<entry>sentencesDir</entry>
<entry>sentencesDir</entry>
<entry>No</entry>
<entry>Dir with Leipig sentences to be used</entry>
</row>
<row>
<entry>sentencesPerSample</entry>
<entry>sentencesPerSample</entry>
<entry>No</entry>
<entry>Number of sentences per sample</entry>
</row>
<row>
<entry>samplesPerLanguage</entry>
<entry>samplesPerLanguage</entry>
<entry>No</entry>
<entry>Number of samples per language</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
</tbody>
</tgroup></informaltable>
</section>
<section id='tools.cli.langdetect.LanguageDetectorCrossValidator'>
<title>LanguageDetectorCrossValidator</title>
<para>K-fold cross validator for the learnable Language Detector</para>
<screen>
<![CDATA[
Usage: opennlp LanguageDetectorCrossValidator[.leipzig] [-misclassified true|false] [-folds num] [-factory
factoryName] [-params paramsFile] [-reportOutputFile outputFile] -data sampleData [-encoding
charsetName]
Arguments description:
-misclassified true|false
if true will print false negatives and false positives.
-folds num
number of folds, default is 10.
-factory factoryName
A sub-class of LanguageDetectorFactory where to get implementation and resources.
-params paramsFile
training parameters file.
-reportOutputFile outputFile
the path of the fine-grained report file.
-data sampleData
data to be used, usually a file name.
-encoding charsetName
encoding for reading and writing text, if absent the system default is used.
]]>
</screen>
<para>The supported formats and arguments are:</para>
<informaltable frame='all'><tgroup cols='4' align='left' colsep='1' rowsep='1'>
<thead><row><entry>Format</entry><entry>Argument</entry><entry>Value</entry><entry>Optional</entry><entry>Description</entry></row></thead>
<tbody>
<row>
<entry morerows='3' valign='middle'>leipzig</entry>
<entry>sentencesDir</entry>
<entry>sentencesDir</entry>
<entry>No</entry>
<entry>Dir with Leipig sentences to be used</entry>
</row>
<row>
<entry>sentencesPerSample</entry>
<entry>sentencesPerSample</entry>
<entry>No</entry>
<entry>Number of sentences per sample</entry>
</row>
<row>
<entry>samplesPerLanguage</entry>
<entry>samplesPerLanguage</entry>
<entry>No</entry>
<entry>Number of samples per language</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
</tbody>
</tgroup></informaltable>
</section>
<section id='tools.cli.langdetect.LanguageDetectorEvaluator'>
<title>LanguageDetectorEvaluator</title>
<para>Measures the performance of the Language Detector model with the reference data</para>
<screen>
<![CDATA[
Usage: opennlp LanguageDetectorEvaluator[.leipzig] -model model [-misclassified true|false]
[-reportOutputFile outputFile] -data sampleData [-encoding charsetName]
Arguments description:
-model model
the model file to be evaluated.
-misclassified true|false
if true will print false negatives and false positives.
-reportOutputFile outputFile
the path of the fine-grained report file.
-data sampleData
data to be used, usually a file name.
-encoding charsetName
encoding for reading and writing text, if absent the system default is used.
]]>
</screen>
<para>The supported formats and arguments are:</para>
<informaltable frame='all'><tgroup cols='4' align='left' colsep='1' rowsep='1'>
<thead><row><entry>Format</entry><entry>Argument</entry><entry>Value</entry><entry>Optional</entry><entry>Description</entry></row></thead>
<tbody>
<row>
<entry morerows='3' valign='middle'>leipzig</entry>
<entry>sentencesDir</entry>
<entry>sentencesDir</entry>
<entry>No</entry>
<entry>Dir with Leipig sentences to be used</entry>
</row>
<row>
<entry>sentencesPerSample</entry>
<entry>sentencesPerSample</entry>
<entry>No</entry>
<entry>Number of sentences per sample</entry>
</row>
<row>
<entry>samplesPerLanguage</entry>
<entry>samplesPerLanguage</entry>
<entry>No</entry>
<entry>Number of samples per language</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
</tbody>
</tgroup></informaltable>
</section>
</section>
<section id='tools.cli.dictionary'>
<title>Dictionary</title>
<section id='tools.cli.dictionary.DictionaryBuilder'>
<title>DictionaryBuilder</title>
<para>Builds a new dictionary</para>
<screen>
<![CDATA[
Usage: opennlp DictionaryBuilder -outputFile out -inputFile in [-encoding charsetName]
Arguments description:
-outputFile out
The dictionary file.
-inputFile in
Plain file with one entry per line
-encoding charsetName
encoding for reading and writing text, if absent the system default is used.
]]>
</screen>
</section>
</section>
<section id='tools.cli.tokenizer'>
<title>Tokenizer</title>
<section id='tools.cli.tokenizer.SimpleTokenizer'>
<title>SimpleTokenizer</title>
<para>Character class tokenizer</para>
<screen>
<![CDATA[
Usage: opennlp SimpleTokenizer < sentences
]]>
</screen>
</section>
<section id='tools.cli.tokenizer.TokenizerME'>
<title>TokenizerME</title>
<para>Learnable tokenizer</para>
<screen>
<![CDATA[
Usage: opennlp TokenizerME model < sentences
]]>
</screen>
</section>
<section id='tools.cli.tokenizer.TokenizerTrainer'>
<title>TokenizerTrainer</title>
<para>Trainer for the learnable tokenizer</para>
<screen>
<![CDATA[
Usage: opennlp TokenizerTrainer[.irishsentencebank|.ad|.pos|.conllx|.namefinder|.parse|.conllu] [-factory
factoryName] [-abbDict path] [-alphaNumOpt isAlphaNumOpt] [-params paramsFile] -lang language -model
modelFile -data sampleData [-encoding charsetName]
Arguments description:
-factory factoryName
A sub-class of TokenizerFactory where to get implementation and resources.
-abbDict path
abbreviation dictionary in XML format.
-alphaNumOpt isAlphaNumOpt
Optimization flag to skip alpha numeric tokens for further tokenization
-params paramsFile
training parameters file.
-lang language
language which is being processed.
-model modelFile
output model file.
-data sampleData
data to be used, usually a file name.
-encoding charsetName
encoding for reading and writing text, if absent the system default is used.
]]>
</screen>
<para>The supported formats and arguments are:</para>
<informaltable frame='all'><tgroup cols='4' align='left' colsep='1' rowsep='1'>
<thead><row><entry>Format</entry><entry>Argument</entry><entry>Value</entry><entry>Optional</entry><entry>Description</entry></row></thead>
<tbody>
<row>
<entry morerows='1' valign='middle'>irishsentencebank</entry>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry morerows='4' valign='middle'>ad</entry>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>No</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry>lang</entry>
<entry>language</entry>
<entry>No</entry>
<entry>Language which is being processed.</entry>
</row>
<row>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>splitHyphenatedTokens</entry>
<entry>split</entry>
<entry>Yes</entry>
<entry>If true all hyphenated tokens will be separated (default true)</entry>
</row>
<row>
<entry>detokenizer</entry>
<entry>dictionary</entry>
<entry>No</entry>
<entry>Specifies the file with detokenizer dictionary.</entry>
</row>
<row>
<entry morerows='2' valign='middle'>pos</entry>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry>detokenizer</entry>
<entry>dictionary</entry>
<entry>No</entry>
<entry>Specifies the file with detokenizer dictionary.</entry>
</row>
<row>
<entry morerows='2' valign='middle'>conllx</entry>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry>detokenizer</entry>
<entry>dictionary</entry>
<entry>No</entry>
<entry>Specifies the file with detokenizer dictionary.</entry>
</row>
<row>
<entry morerows='2' valign='middle'>namefinder</entry>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry>detokenizer</entry>
<entry>dictionary</entry>
<entry>No</entry>
<entry>Specifies the file with detokenizer dictionary.</entry>
</row>
<row>
<entry morerows='2' valign='middle'>parse</entry>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry>detokenizer</entry>
<entry>dictionary</entry>
<entry>No</entry>
<entry>Specifies the file with detokenizer dictionary.</entry>
</row>
<row>
<entry morerows='1' valign='middle'>conllu</entry>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
</tbody>
</tgroup></informaltable>
</section>
<section id='tools.cli.tokenizer.TokenizerMEEvaluator'>
<title>TokenizerMEEvaluator</title>
<para>Evaluator for the learnable tokenizer</para>
<screen>
<![CDATA[
Usage: opennlp TokenizerMEEvaluator[.irishsentencebank|.ad|.pos|.conllx|.namefinder|.parse|.conllu] -model
model [-misclassified true|false] -data sampleData [-encoding charsetName]
Arguments description:
-model model
the model file to be evaluated.
-misclassified true|false
if true will print false negatives and false positives.
-data sampleData
data to be used, usually a file name.
-encoding charsetName
encoding for reading and writing text, if absent the system default is used.
]]>
</screen>
<para>The supported formats and arguments are:</para>
<informaltable frame='all'><tgroup cols='4' align='left' colsep='1' rowsep='1'>
<thead><row><entry>Format</entry><entry>Argument</entry><entry>Value</entry><entry>Optional</entry><entry>Description</entry></row></thead>
<tbody>
<row>
<entry morerows='1' valign='middle'>irishsentencebank</entry>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry morerows='4' valign='middle'>ad</entry>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>No</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry>lang</entry>
<entry>language</entry>
<entry>No</entry>
<entry>Language which is being processed.</entry>
</row>
<row>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>splitHyphenatedTokens</entry>
<entry>split</entry>
<entry>Yes</entry>
<entry>If true all hyphenated tokens will be separated (default true)</entry>
</row>
<row>
<entry>detokenizer</entry>
<entry>dictionary</entry>
<entry>No</entry>
<entry>Specifies the file with detokenizer dictionary.</entry>
</row>
<row>
<entry morerows='2' valign='middle'>pos</entry>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry>detokenizer</entry>
<entry>dictionary</entry>
<entry>No</entry>
<entry>Specifies the file with detokenizer dictionary.</entry>
</row>
<row>
<entry morerows='2' valign='middle'>conllx</entry>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry>detokenizer</entry>
<entry>dictionary</entry>
<entry>No</entry>
<entry>Specifies the file with detokenizer dictionary.</entry>
</row>
<row>
<entry morerows='2' valign='middle'>namefinder</entry>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry>detokenizer</entry>
<entry>dictionary</entry>
<entry>No</entry>
<entry>Specifies the file with detokenizer dictionary.</entry>
</row>
<row>
<entry morerows='2' valign='middle'>parse</entry>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry>detokenizer</entry>
<entry>dictionary</entry>
<entry>No</entry>
<entry>Specifies the file with detokenizer dictionary.</entry>
</row>
<row>
<entry morerows='1' valign='middle'>conllu</entry>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
</tbody>
</tgroup></informaltable>
</section>
<section id='tools.cli.tokenizer.TokenizerCrossValidator'>
<title>TokenizerCrossValidator</title>
<para>K-fold cross validator for the learnable tokenizer</para>
<screen>
<![CDATA[
Usage: opennlp TokenizerCrossValidator[.irishsentencebank|.ad|.pos|.conllx|.namefinder|.parse|.conllu]
[-misclassified true|false] [-folds num] [-factory factoryName] [-abbDict path] [-alphaNumOpt
isAlphaNumOpt] [-params paramsFile] -lang language -data sampleData [-encoding charsetName]
Arguments description:
-misclassified true|false
if true will print false negatives and false positives.
-folds num
number of folds, default is 10.
-factory factoryName
A sub-class of TokenizerFactory where to get implementation and resources.
-abbDict path
abbreviation dictionary in XML format.
-alphaNumOpt isAlphaNumOpt
Optimization flag to skip alpha numeric tokens for further tokenization
-params paramsFile
training parameters file.
-lang language
language which is being processed.
-data sampleData
data to be used, usually a file name.
-encoding charsetName
encoding for reading and writing text, if absent the system default is used.
]]>
</screen>
<para>The supported formats and arguments are:</para>
<informaltable frame='all'><tgroup cols='4' align='left' colsep='1' rowsep='1'>
<thead><row><entry>Format</entry><entry>Argument</entry><entry>Value</entry><entry>Optional</entry><entry>Description</entry></row></thead>
<tbody>
<row>
<entry morerows='1' valign='middle'>irishsentencebank</entry>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry morerows='4' valign='middle'>ad</entry>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>No</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry>lang</entry>
<entry>language</entry>
<entry>No</entry>
<entry>Language which is being processed.</entry>
</row>
<row>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>splitHyphenatedTokens</entry>
<entry>split</entry>
<entry>Yes</entry>
<entry>If true all hyphenated tokens will be separated (default true)</entry>
</row>
<row>
<entry>detokenizer</entry>
<entry>dictionary</entry>
<entry>No</entry>
<entry>Specifies the file with detokenizer dictionary.</entry>
</row>
<row>
<entry morerows='2' valign='middle'>pos</entry>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry>detokenizer</entry>
<entry>dictionary</entry>
<entry>No</entry>
<entry>Specifies the file with detokenizer dictionary.</entry>
</row>
<row>
<entry morerows='2' valign='middle'>conllx</entry>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry>detokenizer</entry>
<entry>dictionary</entry>
<entry>No</entry>
<entry>Specifies the file with detokenizer dictionary.</entry>
</row>
<row>
<entry morerows='2' valign='middle'>namefinder</entry>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry>detokenizer</entry>
<entry>dictionary</entry>
<entry>No</entry>
<entry>Specifies the file with detokenizer dictionary.</entry>
</row>
<row>
<entry morerows='2' valign='middle'>parse</entry>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry>detokenizer</entry>
<entry>dictionary</entry>
<entry>No</entry>
<entry>Specifies the file with detokenizer dictionary.</entry>
</row>
<row>
<entry morerows='1' valign='middle'>conllu</entry>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
</tbody>
</tgroup></informaltable>
</section>
<section id='tools.cli.tokenizer.TokenizerConverter'>
<title>TokenizerConverter</title>
<para>Converts foreign data formats (irishsentencebank,ad,pos,conllx,namefinder,parse,conllu) to native OpenNLP format</para>
<screen>
<![CDATA[
Usage: opennlp TokenizerConverter help|irishsentencebank|ad|pos|conllx|namefinder|parse|conllu
[help|options...]
]]>
</screen>
<para>The supported formats and arguments are:</para>
<informaltable frame='all'><tgroup cols='4' align='left' colsep='1' rowsep='1'>
<thead><row><entry>Format</entry><entry>Argument</entry><entry>Value</entry><entry>Optional</entry><entry>Description</entry></row></thead>
<tbody>
<row>
<entry morerows='1' valign='middle'>irishsentencebank</entry>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry morerows='4' valign='middle'>ad</entry>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>No</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry>lang</entry>
<entry>language</entry>
<entry>No</entry>
<entry>Language which is being processed.</entry>
</row>
<row>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>splitHyphenatedTokens</entry>
<entry>split</entry>
<entry>Yes</entry>
<entry>If true all hyphenated tokens will be separated (default true)</entry>
</row>
<row>
<entry>detokenizer</entry>
<entry>dictionary</entry>
<entry>No</entry>
<entry>Specifies the file with detokenizer dictionary.</entry>
</row>
<row>
<entry morerows='2' valign='middle'>pos</entry>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry>detokenizer</entry>
<entry>dictionary</entry>
<entry>No</entry>
<entry>Specifies the file with detokenizer dictionary.</entry>
</row>
<row>
<entry morerows='2' valign='middle'>conllx</entry>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry>detokenizer</entry>
<entry>dictionary</entry>
<entry>No</entry>
<entry>Specifies the file with detokenizer dictionary.</entry>
</row>
<row>
<entry morerows='2' valign='middle'>namefinder</entry>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry>detokenizer</entry>
<entry>dictionary</entry>
<entry>No</entry>
<entry>Specifies the file with detokenizer dictionary.</entry>
</row>
<row>
<entry morerows='2' valign='middle'>parse</entry>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry>detokenizer</entry>
<entry>dictionary</entry>
<entry>No</entry>
<entry>Specifies the file with detokenizer dictionary.</entry>
</row>
<row>
<entry morerows='1' valign='middle'>conllu</entry>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
</tbody>
</tgroup></informaltable>
</section>
<section id='tools.cli.tokenizer.DictionaryDetokenizer'>
<title>DictionaryDetokenizer</title>
<para></para>
<screen>
<![CDATA[
Usage: opennlp DictionaryDetokenizer detokenizerDictionary
]]>
</screen>
</section>
</section>
<section id='tools.cli.sentdetect'>
<title>Sentdetect</title>
<section id='tools.cli.sentdetect.SentenceDetector'>
<title>SentenceDetector</title>
<para>Learnable sentence detector</para>
<screen>
<![CDATA[
Usage: opennlp SentenceDetector model < sentences
]]>
</screen>
</section>
<section id='tools.cli.sentdetect.SentenceDetectorTrainer'>
<title>SentenceDetectorTrainer</title>
<para>Trainer for the learnable sentence detector</para>
<screen>
<![CDATA[
Usage: opennlp
SentenceDetectorTrainer[.irishsentencebank|.ad|.pos|.conllx|.namefinder|.parse|.moses|.conllu|.letsmt]
[-factory factoryName] [-eosChars string] [-abbDict path] [-params paramsFile] -lang language -model
modelFile -data sampleData [-encoding charsetName]
Arguments description:
-factory factoryName
A sub-class of SentenceDetectorFactory where to get implementation and resources.
-eosChars string
EOS characters.
-abbDict path
abbreviation dictionary in XML format.
-params paramsFile
training parameters file.
-lang language
language which is being processed.
-model modelFile
output model file.
-data sampleData
data to be used, usually a file name.
-encoding charsetName
encoding for reading and writing text, if absent the system default is used.
]]>
</screen>
<para>The supported formats and arguments are:</para>
<informaltable frame='all'><tgroup cols='4' align='left' colsep='1' rowsep='1'>
<thead><row><entry>Format</entry><entry>Argument</entry><entry>Value</entry><entry>Optional</entry><entry>Description</entry></row></thead>
<tbody>
<row>
<entry morerows='1' valign='middle'>irishsentencebank</entry>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry morerows='3' valign='middle'>ad</entry>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>No</entry>
<entry>Encoding for reading and writing text.</entry>
</row>
<row>
<entry>lang</entry>
<entry>language</entry>
<entry>No</entry>
<entry>Language which is being processed.</entry>
</row>
<row>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>includeTitles</entry>
<entry>includeTitles</entry>
<entry>Yes</entry>
<entry>If true will include sentences marked as headlines.</entry>
</row>
<row>
<entry morerows='2' valign='middle'>pos</entry>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry>detokenizer</entry>
<entry>dictionary</entry>
<entry>No</entry>
<entry>Specifies the file with detokenizer dictionary.</entry>
</row>
<row>
<entry morerows='2' valign='middle'>conllx</entry>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry>detokenizer</entry>
<entry>dictionary</entry>
<entry>No</entry>
<entry>Specifies the file with detokenizer dictionary.</entry>
</row>
<row>
<entry morerows='2' valign='middle'>namefinder</entry>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry>detokenizer</entry>
<entry>dictionary</entry>
<entry>No</entry>
<entry>Specifies the file with detokenizer dictionary.</entry>
</row>
<row>
<entry morerows='2' valign='middle'>parse</entry>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry>detokenizer</entry>
<entry>dictionary</entry>
<entry>No</entry>
<entry>Specifies the file with detokenizer dictionary.</entry>
</row>
<row>
<entry morerows='1' valign='middle'>moses</entry>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry morerows='2' valign='middle'>conllu</entry>
<entry>sentencesPerSample</entry>
<entry>sentencesPerSample</entry>
<entry>No</entry>
<entry>Number of sentences per sample</entry>
</row>
<row>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry morerows='2' valign='middle'>letsmt</entry>
<entry>detokenizer</entry>
<entry>dictionary</entry>
<entry>Yes</entry>
<entry>Specifies the file with detokenizer dictionary.</entry>
</row>
<row>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
</tbody>
</tgroup></informaltable>
</section>
<section id='tools.cli.sentdetect.SentenceDetectorEvaluator'>
<title>SentenceDetectorEvaluator</title>
<para>Evaluator for the learnable sentence detector</para>
<screen>
<![CDATA[
Usage: opennlp
SentenceDetectorEvaluator[.irishsentencebank|.ad|.pos|.conllx|.namefinder|.parse|.moses|.conllu|.letsmt]
-model model [-misclassified true|false] -data sampleData [-encoding charsetName]
Arguments description:
-model model
the model file to be evaluated.
-misclassified true|false
if true will print false negatives and false positives.
-data sampleData
data to be used, usually a file name.
-encoding charsetName
encoding for reading and writing text, if absent the system default is used.
]]>
</screen>
<para>The supported formats and arguments are:</para>
<informaltable frame='all'><tgroup cols='4' align='left' colsep='1' rowsep='1'>
<thead><row><entry>Format</entry><entry>Argument</entry><entry>Value</entry><entry>Optional</entry><entry>Description</entry></row></thead>
<tbody>
<row>
<entry morerows='1' valign='middle'>irishsentencebank</entry>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry morerows='3' valign='middle'>ad</entry>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>No</entry>
<entry>Encoding for reading and writing text.</entry>
</row>
<row>
<entry>lang</entry>
<entry>language</entry>
<entry>No</entry>
<entry>Language which is being processed.</entry>
</row>
<row>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>includeTitles</entry>
<entry>includeTitles</entry>
<entry>Yes</entry>
<entry>If true will include sentences marked as headlines.</entry>
</row>
<row>
<entry morerows='2' valign='middle'>pos</entry>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry>detokenizer</entry>
<entry>dictionary</entry>
<entry>No</entry>
<entry>Specifies the file with detokenizer dictionary.</entry>
</row>
<row>
<entry morerows='2' valign='middle'>conllx</entry>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry>detokenizer</entry>
<entry>dictionary</entry>
<entry>No</entry>
<entry>Specifies the file with detokenizer dictionary.</entry>
</row>
<row>
<entry morerows='2' valign='middle'>namefinder</entry>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry>detokenizer</entry>
<entry>dictionary</entry>
<entry>No</entry>
<entry>Specifies the file with detokenizer dictionary.</entry>
</row>
<row>
<entry morerows='2' valign='middle'>parse</entry>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry>detokenizer</entry>
<entry>dictionary</entry>
<entry>No</entry>
<entry>Specifies the file with detokenizer dictionary.</entry>
</row>
<row>
<entry morerows='1' valign='middle'>moses</entry>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry morerows='2' valign='middle'>conllu</entry>
<entry>sentencesPerSample</entry>
<entry>sentencesPerSample</entry>
<entry>No</entry>
<entry>Number of sentences per sample</entry>
</row>
<row>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry morerows='2' valign='middle'>letsmt</entry>
<entry>detokenizer</entry>
<entry>dictionary</entry>
<entry>Yes</entry>
<entry>Specifies the file with detokenizer dictionary.</entry>
</row>
<row>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
</tbody>
</tgroup></informaltable>
</section>
<section id='tools.cli.sentdetect.SentenceDetectorCrossValidator'>
<title>SentenceDetectorCrossValidator</title>
<para>K-fold cross validator for the learnable sentence detector</para>
<screen>
<![CDATA[
Usage: opennlp
SentenceDetectorCrossValidator[.irishsentencebank|.ad|.pos|.conllx|.namefinder|.parse|.moses|.conllu|.letsmt]
[-factory factoryName] [-eosChars string] [-abbDict path] [-params paramsFile] -lang language
[-misclassified true|false] [-folds num] -data sampleData [-encoding charsetName]
Arguments description:
-factory factoryName
A sub-class of SentenceDetectorFactory where to get implementation and resources.
-eosChars string
EOS characters.
-abbDict path
abbreviation dictionary in XML format.
-params paramsFile
training parameters file.
-lang language
language which is being processed.
-misclassified true|false
if true will print false negatives and false positives.
-folds num
number of folds, default is 10.
-data sampleData
data to be used, usually a file name.
-encoding charsetName
encoding for reading and writing text, if absent the system default is used.
]]>
</screen>
<para>The supported formats and arguments are:</para>
<informaltable frame='all'><tgroup cols='4' align='left' colsep='1' rowsep='1'>
<thead><row><entry>Format</entry><entry>Argument</entry><entry>Value</entry><entry>Optional</entry><entry>Description</entry></row></thead>
<tbody>
<row>
<entry morerows='1' valign='middle'>irishsentencebank</entry>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry morerows='3' valign='middle'>ad</entry>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>No</entry>
<entry>Encoding for reading and writing text.</entry>
</row>
<row>
<entry>lang</entry>
<entry>language</entry>
<entry>No</entry>
<entry>Language which is being processed.</entry>
</row>
<row>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>includeTitles</entry>
<entry>includeTitles</entry>
<entry>Yes</entry>
<entry>If true will include sentences marked as headlines.</entry>
</row>
<row>
<entry morerows='2' valign='middle'>pos</entry>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry>detokenizer</entry>
<entry>dictionary</entry>
<entry>No</entry>
<entry>Specifies the file with detokenizer dictionary.</entry>
</row>
<row>
<entry morerows='2' valign='middle'>conllx</entry>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry>detokenizer</entry>
<entry>dictionary</entry>
<entry>No</entry>
<entry>Specifies the file with detokenizer dictionary.</entry>
</row>
<row>
<entry morerows='2' valign='middle'>namefinder</entry>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry>detokenizer</entry>
<entry>dictionary</entry>
<entry>No</entry>
<entry>Specifies the file with detokenizer dictionary.</entry>
</row>
<row>
<entry morerows='2' valign='middle'>parse</entry>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry>detokenizer</entry>
<entry>dictionary</entry>
<entry>No</entry>
<entry>Specifies the file with detokenizer dictionary.</entry>
</row>
<row>
<entry morerows='1' valign='middle'>moses</entry>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry morerows='2' valign='middle'>conllu</entry>
<entry>sentencesPerSample</entry>
<entry>sentencesPerSample</entry>
<entry>No</entry>
<entry>Number of sentences per sample</entry>
</row>
<row>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry morerows='2' valign='middle'>letsmt</entry>
<entry>detokenizer</entry>
<entry>dictionary</entry>
<entry>Yes</entry>
<entry>Specifies the file with detokenizer dictionary.</entry>
</row>
<row>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
</tbody>
</tgroup></informaltable>
</section>
<section id='tools.cli.sentdetect.SentenceDetectorConverter'>
<title>SentenceDetectorConverter</title>
<para>Converts foreign data formats (irishsentencebank,ad,pos,conllx,namefinder,parse,moses,conllu,letsmt) to native OpenNLP format</para>
<screen>
<![CDATA[
Usage: opennlp SentenceDetectorConverter
help|irishsentencebank|ad|pos|conllx|namefinder|parse|moses|conllu|letsmt [help|options...]
]]>
</screen>
<para>The supported formats and arguments are:</para>
<informaltable frame='all'><tgroup cols='4' align='left' colsep='1' rowsep='1'>
<thead><row><entry>Format</entry><entry>Argument</entry><entry>Value</entry><entry>Optional</entry><entry>Description</entry></row></thead>
<tbody>
<row>
<entry morerows='1' valign='middle'>irishsentencebank</entry>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry morerows='3' valign='middle'>ad</entry>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>No</entry>
<entry>Encoding for reading and writing text.</entry>
</row>
<row>
<entry>lang</entry>
<entry>language</entry>
<entry>No</entry>
<entry>Language which is being processed.</entry>
</row>
<row>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>includeTitles</entry>
<entry>includeTitles</entry>
<entry>Yes</entry>
<entry>If true will include sentences marked as headlines.</entry>
</row>
<row>
<entry morerows='2' valign='middle'>pos</entry>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry>detokenizer</entry>
<entry>dictionary</entry>
<entry>No</entry>
<entry>Specifies the file with detokenizer dictionary.</entry>
</row>
<row>
<entry morerows='2' valign='middle'>conllx</entry>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry>detokenizer</entry>
<entry>dictionary</entry>
<entry>No</entry>
<entry>Specifies the file with detokenizer dictionary.</entry>
</row>
<row>
<entry morerows='2' valign='middle'>namefinder</entry>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry>detokenizer</entry>
<entry>dictionary</entry>
<entry>No</entry>
<entry>Specifies the file with detokenizer dictionary.</entry>
</row>
<row>
<entry morerows='2' valign='middle'>parse</entry>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry>detokenizer</entry>
<entry>dictionary</entry>
<entry>No</entry>
<entry>Specifies the file with detokenizer dictionary.</entry>
</row>
<row>
<entry morerows='1' valign='middle'>moses</entry>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry morerows='2' valign='middle'>conllu</entry>
<entry>sentencesPerSample</entry>
<entry>sentencesPerSample</entry>
<entry>No</entry>
<entry>Number of sentences per sample</entry>
</row>
<row>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry morerows='2' valign='middle'>letsmt</entry>
<entry>detokenizer</entry>
<entry>dictionary</entry>
<entry>Yes</entry>
<entry>Specifies the file with detokenizer dictionary.</entry>
</row>
<row>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
</tbody>
</tgroup></informaltable>
</section>
</section>
<section id='tools.cli.namefind'>
<title>Namefind</title>
<section id='tools.cli.namefind.TokenNameFinder'>
<title>TokenNameFinder</title>
<para>Learnable name finder</para>
<screen>
<![CDATA[
Usage: opennlp TokenNameFinder model1 model2 ... modelN < sentences
]]>
</screen>
</section>
<section id='tools.cli.namefind.TokenNameFinderTrainer'>
<title>TokenNameFinderTrainer</title>
<para>Trainer for the learnable name finder</para>
<screen>
<![CDATA[
Usage: opennlp TokenNameFinderTrainer[.evalita|.ad|.conll03|.bionlp2004|.conll02|.muc6|.ontonotes|.brat]
[-factory factoryName] [-resources resourcesDir] [-type modelType] [-featuregen featuregenFile]
[-nameTypes types] [-sequenceCodec codec] [-params paramsFile] -lang language -model modelFile -data
sampleData [-encoding charsetName]
Arguments description:
-factory factoryName
A sub-class of TokenNameFinderFactory
-resources resourcesDir
The resources directory
-type modelType
The type of the token name finder model
-featuregen featuregenFile
The feature generator descriptor file
-nameTypes types
name types to use for training
-sequenceCodec codec
sequence codec used to code name spans
-params paramsFile
training parameters file.
-lang language
language which is being processed.
-model modelFile
output model file.
-data sampleData
data to be used, usually a file name.
-encoding charsetName
encoding for reading and writing text, if absent the system default is used.
]]>
</screen>
<para>The supported formats and arguments are:</para>
<informaltable frame='all'><tgroup cols='4' align='left' colsep='1' rowsep='1'>
<thead><row><entry>Format</entry><entry>Argument</entry><entry>Value</entry><entry>Optional</entry><entry>Description</entry></row></thead>
<tbody>
<row>
<entry morerows='3' valign='middle'>evalita</entry>
<entry>lang</entry>
<entry>it</entry>
<entry>No</entry>
<entry></entry>
</row>
<row>
<entry>types</entry>
<entry>per,loc,org,gpe</entry>
<entry>No</entry>
<entry></entry>
</row>
<row>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry morerows='3' valign='middle'>ad</entry>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>No</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry>lang</entry>
<entry>language</entry>
<entry>No</entry>
<entry>Language which is being processed.</entry>
</row>
<row>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>splitHyphenatedTokens</entry>
<entry>split</entry>
<entry>Yes</entry>
<entry>If true all hyphenated tokens will be separated (default true)</entry>
</row>
<row>
<entry morerows='3' valign='middle'>conll03</entry>
<entry>lang</entry>
<entry>eng|deu</entry>
<entry>No</entry>
<entry></entry>
</row>
<row>
<entry>types</entry>
<entry>per,loc,org,misc</entry>
<entry>No</entry>
<entry></entry>
</row>
<row>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry morerows='2' valign='middle'>bionlp2004</entry>
<entry>types</entry>
<entry>DNA,protein,cell_type,cell_line,RNA</entry>
<entry>No</entry>
<entry></entry>
</row>
<row>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry morerows='3' valign='middle'>conll02</entry>
<entry>lang</entry>
<entry>spa|nld</entry>
<entry>No</entry>
<entry></entry>
</row>
<row>
<entry>types</entry>
<entry>per,loc,org,misc</entry>
<entry>No</entry>
<entry></entry>
</row>
<row>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry morerows='2' valign='middle'>muc6</entry>
<entry>tokenizerModel</entry>
<entry>modelFile</entry>
<entry>No</entry>
<entry></entry>
</row>
<row>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry morerows='0' valign='middle'>ontonotes</entry>
<entry>ontoNotesDir</entry>
<entry>OntoNotes 4.0 corpus directory</entry>
<entry>No</entry>
<entry></entry>
</row>
<row>
<entry morerows='5' valign='middle'>brat</entry>
<entry>tokenizerModel</entry>
<entry>modelFile</entry>
<entry>Yes</entry>
<entry></entry>
</row>
<row>
<entry>ruleBasedTokenizer</entry>
<entry>name</entry>
<entry>Yes</entry>
<entry></entry>
</row>
<row>
<entry>annotationConfig</entry>
<entry>annConfFile</entry>
<entry>No</entry>
<entry></entry>
</row>
<row>
<entry>bratDataDir</entry>
<entry>bratDataDir</entry>
<entry>No</entry>
<entry>Location of brat data dir</entry>
</row>
<row>
<entry>recursive</entry>
<entry>value</entry>
<entry>Yes</entry>
<entry></entry>
</row>
<row>
<entry>sentenceDetectorModel</entry>
<entry>modelFile</entry>
<entry>Yes</entry>
<entry></entry>
</row>
</tbody>
</tgroup></informaltable>
</section>
<section id='tools.cli.namefind.TokenNameFinderEvaluator'>
<title>TokenNameFinderEvaluator</title>
<para>Measures the performance of the NameFinder model with the reference data</para>
<screen>
<![CDATA[
Usage: opennlp TokenNameFinderEvaluator[.evalita|.ad|.conll03|.bionlp2004|.conll02|.muc6|.ontonotes|.brat]
[-nameTypes types] -model model [-misclassified true|false] [-detailedF true|false]
[-reportOutputFile outputFile] -data sampleData [-encoding charsetName]
Arguments description:
-nameTypes types
name types to use for evaluation
-model model
the model file to be evaluated.
-misclassified true|false
if true will print false negatives and false positives.
-detailedF true|false
if true (default) will print detailed FMeasure results.
-reportOutputFile outputFile
the path of the fine-grained report file.
-data sampleData
data to be used, usually a file name.
-encoding charsetName
encoding for reading and writing text, if absent the system default is used.
]]>
</screen>
<para>The supported formats and arguments are:</para>
<informaltable frame='all'><tgroup cols='4' align='left' colsep='1' rowsep='1'>
<thead><row><entry>Format</entry><entry>Argument</entry><entry>Value</entry><entry>Optional</entry><entry>Description</entry></row></thead>
<tbody>
<row>
<entry morerows='3' valign='middle'>evalita</entry>
<entry>lang</entry>
<entry>it</entry>
<entry>No</entry>
<entry></entry>
</row>
<row>
<entry>types</entry>
<entry>per,loc,org,gpe</entry>
<entry>No</entry>
<entry></entry>
</row>
<row>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry morerows='3' valign='middle'>ad</entry>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>No</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry>lang</entry>
<entry>language</entry>
<entry>No</entry>
<entry>Language which is being processed.</entry>
</row>
<row>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>splitHyphenatedTokens</entry>
<entry>split</entry>
<entry>Yes</entry>
<entry>If true all hyphenated tokens will be separated (default true)</entry>
</row>
<row>
<entry morerows='3' valign='middle'>conll03</entry>
<entry>lang</entry>
<entry>eng|deu</entry>
<entry>No</entry>
<entry></entry>
</row>
<row>
<entry>types</entry>
<entry>per,loc,org,misc</entry>
<entry>No</entry>
<entry></entry>
</row>
<row>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry morerows='2' valign='middle'>bionlp2004</entry>
<entry>types</entry>
<entry>DNA,protein,cell_type,cell_line,RNA</entry>
<entry>No</entry>
<entry></entry>
</row>
<row>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry morerows='3' valign='middle'>conll02</entry>
<entry>lang</entry>
<entry>spa|nld</entry>
<entry>No</entry>
<entry></entry>
</row>
<row>
<entry>types</entry>
<entry>per,loc,org,misc</entry>
<entry>No</entry>
<entry></entry>
</row>
<row>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry morerows='2' valign='middle'>muc6</entry>
<entry>tokenizerModel</entry>
<entry>modelFile</entry>
<entry>No</entry>
<entry></entry>
</row>
<row>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry morerows='0' valign='middle'>ontonotes</entry>
<entry>ontoNotesDir</entry>
<entry>OntoNotes 4.0 corpus directory</entry>
<entry>No</entry>
<entry></entry>
</row>
<row>
<entry morerows='5' valign='middle'>brat</entry>
<entry>tokenizerModel</entry>
<entry>modelFile</entry>
<entry>Yes</entry>
<entry></entry>
</row>
<row>
<entry>ruleBasedTokenizer</entry>
<entry>name</entry>
<entry>Yes</entry>
<entry></entry>
</row>
<row>
<entry>annotationConfig</entry>
<entry>annConfFile</entry>
<entry>No</entry>
<entry></entry>
</row>
<row>
<entry>bratDataDir</entry>
<entry>bratDataDir</entry>
<entry>No</entry>
<entry>Location of brat data dir</entry>
</row>
<row>
<entry>recursive</entry>
<entry>value</entry>
<entry>Yes</entry>
<entry></entry>
</row>
<row>
<entry>sentenceDetectorModel</entry>
<entry>modelFile</entry>
<entry>Yes</entry>
<entry></entry>
</row>
</tbody>
</tgroup></informaltable>
</section>
<section id='tools.cli.namefind.TokenNameFinderCrossValidator'>
<title>TokenNameFinderCrossValidator</title>
<para>K-fold cross validator for the learnable Name Finder</para>
<screen>
<![CDATA[
Usage: opennlp
TokenNameFinderCrossValidator[.evalita|.ad|.conll03|.bionlp2004|.conll02|.muc6|.ontonotes|.brat]
[-factory factoryName] [-resources resourcesDir] [-type modelType] [-featuregen featuregenFile]
[-nameTypes types] [-sequenceCodec codec] [-params paramsFile] -lang language [-misclassified
true|false] [-folds num] [-detailedF true|false] [-reportOutputFile outputFile] -data sampleData
[-encoding charsetName]
Arguments description:
-factory factoryName
A sub-class of TokenNameFinderFactory
-resources resourcesDir
The resources directory
-type modelType
The type of the token name finder model
-featuregen featuregenFile
The feature generator descriptor file
-nameTypes types
name types to use for training
-sequenceCodec codec
sequence codec used to code name spans
-params paramsFile
training parameters file.
-lang language
language which is being processed.
-misclassified true|false
if true will print false negatives and false positives.
-folds num
number of folds, default is 10.
-detailedF true|false
if true (default) will print detailed FMeasure results.
-reportOutputFile outputFile
the path of the fine-grained report file.
-data sampleData
data to be used, usually a file name.
-encoding charsetName
encoding for reading and writing text, if absent the system default is used.
]]>
</screen>
<para>The supported formats and arguments are:</para>
<informaltable frame='all'><tgroup cols='4' align='left' colsep='1' rowsep='1'>
<thead><row><entry>Format</entry><entry>Argument</entry><entry>Value</entry><entry>Optional</entry><entry>Description</entry></row></thead>
<tbody>
<row>
<entry morerows='3' valign='middle'>evalita</entry>
<entry>lang</entry>
<entry>it</entry>
<entry>No</entry>
<entry></entry>
</row>
<row>
<entry>types</entry>
<entry>per,loc,org,gpe</entry>
<entry>No</entry>
<entry></entry>
</row>
<row>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry morerows='3' valign='middle'>ad</entry>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>No</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry>lang</entry>
<entry>language</entry>
<entry>No</entry>
<entry>Language which is being processed.</entry>
</row>
<row>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>splitHyphenatedTokens</entry>
<entry>split</entry>
<entry>Yes</entry>
<entry>If true all hyphenated tokens will be separated (default true)</entry>
</row>
<row>
<entry morerows='3' valign='middle'>conll03</entry>
<entry>lang</entry>
<entry>eng|deu</entry>
<entry>No</entry>
<entry></entry>
</row>
<row>
<entry>types</entry>
<entry>per,loc,org,misc</entry>
<entry>No</entry>
<entry></entry>
</row>
<row>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry morerows='2' valign='middle'>bionlp2004</entry>
<entry>types</entry>
<entry>DNA,protein,cell_type,cell_line,RNA</entry>
<entry>No</entry>
<entry></entry>
</row>
<row>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry morerows='3' valign='middle'>conll02</entry>
<entry>lang</entry>
<entry>spa|nld</entry>
<entry>No</entry>
<entry></entry>
</row>
<row>
<entry>types</entry>
<entry>per,loc,org,misc</entry>
<entry>No</entry>
<entry></entry>
</row>
<row>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry morerows='2' valign='middle'>muc6</entry>
<entry>tokenizerModel</entry>
<entry>modelFile</entry>
<entry>No</entry>
<entry></entry>
</row>
<row>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry morerows='0' valign='middle'>ontonotes</entry>
<entry>ontoNotesDir</entry>
<entry>OntoNotes 4.0 corpus directory</entry>
<entry>No</entry>
<entry></entry>
</row>
<row>
<entry morerows='5' valign='middle'>brat</entry>
<entry>tokenizerModel</entry>
<entry>modelFile</entry>
<entry>Yes</entry>
<entry></entry>
</row>
<row>
<entry>ruleBasedTokenizer</entry>
<entry>name</entry>
<entry>Yes</entry>
<entry></entry>
</row>
<row>
<entry>annotationConfig</entry>
<entry>annConfFile</entry>
<entry>No</entry>
<entry></entry>
</row>
<row>
<entry>bratDataDir</entry>
<entry>bratDataDir</entry>
<entry>No</entry>
<entry>Location of brat data dir</entry>
</row>
<row>
<entry>recursive</entry>
<entry>value</entry>
<entry>Yes</entry>
<entry></entry>
</row>
<row>
<entry>sentenceDetectorModel</entry>
<entry>modelFile</entry>
<entry>Yes</entry>
<entry></entry>
</row>
</tbody>
</tgroup></informaltable>
</section>
<section id='tools.cli.namefind.TokenNameFinderConverter'>
<title>TokenNameFinderConverter</title>
<para>Converts foreign data formats (evalita,ad,conll03,bionlp2004,conll02,muc6,ontonotes,brat) to native OpenNLP format</para>
<screen>
<![CDATA[
Usage: opennlp TokenNameFinderConverter help|evalita|ad|conll03|bionlp2004|conll02|muc6|ontonotes|brat
[help|options...]
]]>
</screen>
<para>The supported formats and arguments are:</para>
<informaltable frame='all'><tgroup cols='4' align='left' colsep='1' rowsep='1'>
<thead><row><entry>Format</entry><entry>Argument</entry><entry>Value</entry><entry>Optional</entry><entry>Description</entry></row></thead>
<tbody>
<row>
<entry morerows='3' valign='middle'>evalita</entry>
<entry>lang</entry>
<entry>it</entry>
<entry>No</entry>
<entry></entry>
</row>
<row>
<entry>types</entry>
<entry>per,loc,org,gpe</entry>
<entry>No</entry>
<entry></entry>
</row>
<row>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry morerows='3' valign='middle'>ad</entry>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>No</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry>lang</entry>
<entry>language</entry>
<entry>No</entry>
<entry>Language which is being processed.</entry>
</row>
<row>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>splitHyphenatedTokens</entry>
<entry>split</entry>
<entry>Yes</entry>
<entry>If true all hyphenated tokens will be separated (default true)</entry>
</row>
<row>
<entry morerows='3' valign='middle'>conll03</entry>
<entry>lang</entry>
<entry>eng|deu</entry>
<entry>No</entry>
<entry></entry>
</row>
<row>
<entry>types</entry>
<entry>per,loc,org,misc</entry>
<entry>No</entry>
<entry></entry>
</row>
<row>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry morerows='2' valign='middle'>bionlp2004</entry>
<entry>types</entry>
<entry>DNA,protein,cell_type,cell_line,RNA</entry>
<entry>No</entry>
<entry></entry>
</row>
<row>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry morerows='3' valign='middle'>conll02</entry>
<entry>lang</entry>
<entry>spa|nld</entry>
<entry>No</entry>
<entry></entry>
</row>
<row>
<entry>types</entry>
<entry>per,loc,org,misc</entry>
<entry>No</entry>
<entry></entry>
</row>
<row>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry morerows='2' valign='middle'>muc6</entry>
<entry>tokenizerModel</entry>
<entry>modelFile</entry>
<entry>No</entry>
<entry></entry>
</row>
<row>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry morerows='0' valign='middle'>ontonotes</entry>
<entry>ontoNotesDir</entry>
<entry>OntoNotes 4.0 corpus directory</entry>
<entry>No</entry>
<entry></entry>
</row>
<row>
<entry morerows='5' valign='middle'>brat</entry>
<entry>tokenizerModel</entry>
<entry>modelFile</entry>
<entry>Yes</entry>
<entry></entry>
</row>
<row>
<entry>ruleBasedTokenizer</entry>
<entry>name</entry>
<entry>Yes</entry>
<entry></entry>
</row>
<row>
<entry>annotationConfig</entry>
<entry>annConfFile</entry>
<entry>No</entry>
<entry></entry>
</row>
<row>
<entry>bratDataDir</entry>
<entry>bratDataDir</entry>
<entry>No</entry>
<entry>Location of brat data dir</entry>
</row>
<row>
<entry>recursive</entry>
<entry>value</entry>
<entry>Yes</entry>
<entry></entry>
</row>
<row>
<entry>sentenceDetectorModel</entry>
<entry>modelFile</entry>
<entry>Yes</entry>
<entry></entry>
</row>
</tbody>
</tgroup></informaltable>
</section>
<section id='tools.cli.namefind.CensusDictionaryCreator'>
<title>CensusDictionaryCreator</title>
<para>Converts 1990 US Census names into a dictionary</para>
<screen>
<![CDATA[
Usage: opennlp CensusDictionaryCreator [-encoding charsetName] [-lang code] -censusData censusDict -dict dict
Arguments description:
-encoding charsetName
-lang code
-censusData censusDict
-dict dict
]]>
</screen>
</section>
</section>
<section id='tools.cli.postag'>
<title>Postag</title>
<section id='tools.cli.postag.POSTagger'>
<title>POSTagger</title>
<para>Learnable part of speech tagger</para>
<screen>
<![CDATA[
Usage: opennlp POSTagger model < sentences
]]>
</screen>
</section>
<section id='tools.cli.postag.POSTaggerTrainer'>
<title>POSTaggerTrainer</title>
<para>Trains a model for the part-of-speech tagger</para>
<screen>
<![CDATA[
Usage: opennlp POSTaggerTrainer[.ad|.conllx|.parse|.ontonotes|.conllu] [-factory factoryName] [-resources
resourcesDir] [-tagDictCutoff tagDictCutoff] [-featuregen featuregenFile] [-dict dictionaryPath]
[-params paramsFile] -lang language -model modelFile -data sampleData [-encoding charsetName]
Arguments description:
-factory factoryName
A sub-class of POSTaggerFactory where to get implementation and resources.
-resources resourcesDir
The resources directory
-tagDictCutoff tagDictCutoff
TagDictionary cutoff. If specified will create/expand a mutable TagDictionary
-featuregen featuregenFile
The feature generator descriptor file
-dict dictionaryPath
The XML tag dictionary file
-params paramsFile
training parameters file.
-lang language
language which is being processed.
-model modelFile
output model file.
-data sampleData
data to be used, usually a file name.
-encoding charsetName
encoding for reading and writing text, if absent the system default is used.
]]>
</screen>
<para>The supported formats and arguments are:</para>
<informaltable frame='all'><tgroup cols='4' align='left' colsep='1' rowsep='1'>
<thead><row><entry>Format</entry><entry>Argument</entry><entry>Value</entry><entry>Optional</entry><entry>Description</entry></row></thead>
<tbody>
<row>
<entry morerows='4' valign='middle'>ad</entry>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>No</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry>lang</entry>
<entry>language</entry>
<entry>No</entry>
<entry>Language which is being processed.</entry>
</row>
<row>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>expandME</entry>
<entry>expandME</entry>
<entry>Yes</entry>
<entry>Expand multiword expressions.</entry>
</row>
<row>
<entry>includeFeatures</entry>
<entry>includeFeatures</entry>
<entry>Yes</entry>
<entry>Combine POS Tags with word features, like number and gender.</entry>
</row>
<row>
<entry morerows='1' valign='middle'>conllx</entry>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry morerows='1' valign='middle'>parse</entry>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry morerows='0' valign='middle'>ontonotes</entry>
<entry>ontoNotesDir</entry>
<entry>OntoNotes 4.0 corpus directory</entry>
<entry>No</entry>
<entry></entry>
</row>
<row>
<entry morerows='2' valign='middle'>conllu</entry>
<entry>tagset</entry>
<entry>tagset</entry>
<entry>Yes</entry>
<entry>U|x u for unified tags and x for language-specific part-of-speech tags</entry>
</row>
<row>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
</tbody>
</tgroup></informaltable>
</section>
<section id='tools.cli.postag.POSTaggerEvaluator'>
<title>POSTaggerEvaluator</title>
<para>Measures the performance of the POS tagger model with the reference data</para>
<screen>
<![CDATA[
Usage: opennlp POSTaggerEvaluator[.ad|.conllx|.parse|.ontonotes|.conllu] -model model [-misclassified
true|false] [-reportOutputFile outputFile] -data sampleData [-encoding charsetName]
Arguments description:
-model model
the model file to be evaluated.
-misclassified true|false
if true will print false negatives and false positives.
-reportOutputFile outputFile
the path of the fine-grained report file.
-data sampleData
data to be used, usually a file name.
-encoding charsetName
encoding for reading and writing text, if absent the system default is used.
]]>
</screen>
<para>The supported formats and arguments are:</para>
<informaltable frame='all'><tgroup cols='4' align='left' colsep='1' rowsep='1'>
<thead><row><entry>Format</entry><entry>Argument</entry><entry>Value</entry><entry>Optional</entry><entry>Description</entry></row></thead>
<tbody>
<row>
<entry morerows='4' valign='middle'>ad</entry>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>No</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry>lang</entry>
<entry>language</entry>
<entry>No</entry>
<entry>Language which is being processed.</entry>
</row>
<row>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>expandME</entry>
<entry>expandME</entry>
<entry>Yes</entry>
<entry>Expand multiword expressions.</entry>
</row>
<row>
<entry>includeFeatures</entry>
<entry>includeFeatures</entry>
<entry>Yes</entry>
<entry>Combine POS Tags with word features, like number and gender.</entry>
</row>
<row>
<entry morerows='1' valign='middle'>conllx</entry>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry morerows='1' valign='middle'>parse</entry>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry morerows='0' valign='middle'>ontonotes</entry>
<entry>ontoNotesDir</entry>
<entry>OntoNotes 4.0 corpus directory</entry>
<entry>No</entry>
<entry></entry>
</row>
<row>
<entry morerows='2' valign='middle'>conllu</entry>
<entry>tagset</entry>
<entry>tagset</entry>
<entry>Yes</entry>
<entry>U|x u for unified tags and x for language-specific part-of-speech tags</entry>
</row>
<row>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
</tbody>
</tgroup></informaltable>
</section>
<section id='tools.cli.postag.POSTaggerCrossValidator'>
<title>POSTaggerCrossValidator</title>
<para>K-fold cross validator for the learnable POS tagger</para>
<screen>
<![CDATA[
Usage: opennlp POSTaggerCrossValidator[.ad|.conllx|.parse|.ontonotes|.conllu] [-misclassified true|false]
[-folds num] [-factory factoryName] [-resources resourcesDir] [-tagDictCutoff tagDictCutoff]
[-featuregen featuregenFile] [-dict dictionaryPath] [-params paramsFile] -lang language
[-reportOutputFile outputFile] -data sampleData [-encoding charsetName]
Arguments description:
-misclassified true|false
if true will print false negatives and false positives.
-folds num
number of folds, default is 10.
-factory factoryName
A sub-class of POSTaggerFactory where to get implementation and resources.
-resources resourcesDir
The resources directory
-tagDictCutoff tagDictCutoff
TagDictionary cutoff. If specified will create/expand a mutable TagDictionary
-featuregen featuregenFile
The feature generator descriptor file
-dict dictionaryPath
The XML tag dictionary file
-params paramsFile
training parameters file.
-lang language
language which is being processed.
-reportOutputFile outputFile
the path of the fine-grained report file.
-data sampleData
data to be used, usually a file name.
-encoding charsetName
encoding for reading and writing text, if absent the system default is used.
]]>
</screen>
<para>The supported formats and arguments are:</para>
<informaltable frame='all'><tgroup cols='4' align='left' colsep='1' rowsep='1'>
<thead><row><entry>Format</entry><entry>Argument</entry><entry>Value</entry><entry>Optional</entry><entry>Description</entry></row></thead>
<tbody>
<row>
<entry morerows='4' valign='middle'>ad</entry>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>No</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry>lang</entry>
<entry>language</entry>
<entry>No</entry>
<entry>Language which is being processed.</entry>
</row>
<row>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>expandME</entry>
<entry>expandME</entry>
<entry>Yes</entry>
<entry>Expand multiword expressions.</entry>
</row>
<row>
<entry>includeFeatures</entry>
<entry>includeFeatures</entry>
<entry>Yes</entry>
<entry>Combine POS Tags with word features, like number and gender.</entry>
</row>
<row>
<entry morerows='1' valign='middle'>conllx</entry>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry morerows='1' valign='middle'>parse</entry>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry morerows='0' valign='middle'>ontonotes</entry>
<entry>ontoNotesDir</entry>
<entry>OntoNotes 4.0 corpus directory</entry>
<entry>No</entry>
<entry></entry>
</row>
<row>
<entry morerows='2' valign='middle'>conllu</entry>
<entry>tagset</entry>
<entry>tagset</entry>
<entry>Yes</entry>
<entry>U|x u for unified tags and x for language-specific part-of-speech tags</entry>
</row>
<row>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
</tbody>
</tgroup></informaltable>
</section>
<section id='tools.cli.postag.POSTaggerConverter'>
<title>POSTaggerConverter</title>
<para>Converts foreign data formats (ad,conllx,parse,ontonotes,conllu) to native OpenNLP format</para>
<screen>
<![CDATA[
Usage: opennlp POSTaggerConverter help|ad|conllx|parse|ontonotes|conllu [help|options...]
]]>
</screen>
<para>The supported formats and arguments are:</para>
<informaltable frame='all'><tgroup cols='4' align='left' colsep='1' rowsep='1'>
<thead><row><entry>Format</entry><entry>Argument</entry><entry>Value</entry><entry>Optional</entry><entry>Description</entry></row></thead>
<tbody>
<row>
<entry morerows='4' valign='middle'>ad</entry>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>No</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry>lang</entry>
<entry>language</entry>
<entry>No</entry>
<entry>Language which is being processed.</entry>
</row>
<row>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>expandME</entry>
<entry>expandME</entry>
<entry>Yes</entry>
<entry>Expand multiword expressions.</entry>
</row>
<row>
<entry>includeFeatures</entry>
<entry>includeFeatures</entry>
<entry>Yes</entry>
<entry>Combine POS Tags with word features, like number and gender.</entry>
</row>
<row>
<entry morerows='1' valign='middle'>conllx</entry>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry morerows='1' valign='middle'>parse</entry>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry morerows='0' valign='middle'>ontonotes</entry>
<entry>ontoNotesDir</entry>
<entry>OntoNotes 4.0 corpus directory</entry>
<entry>No</entry>
<entry></entry>
</row>
<row>
<entry morerows='2' valign='middle'>conllu</entry>
<entry>tagset</entry>
<entry>tagset</entry>
<entry>Yes</entry>
<entry>U|x u for unified tags and x for language-specific part-of-speech tags</entry>
</row>
<row>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
</tbody>
</tgroup></informaltable>
</section>
</section>
<section id='tools.cli.lemmatizer'>
<title>Lemmatizer</title>
<section id='tools.cli.lemmatizer.LemmatizerME'>
<title>LemmatizerME</title>
<para>Learnable lemmatizer</para>
<screen>
<![CDATA[
Usage: opennlp LemmatizerME model < sentences
]]>
</screen>
</section>
<section id='tools.cli.lemmatizer.LemmatizerTrainerME'>
<title>LemmatizerTrainerME</title>
<para>Trainer for the learnable lemmatizer</para>
<screen>
<![CDATA[
Usage: opennlp LemmatizerTrainerME[.conllu] [-factory factoryName] [-params paramsFile] -lang language -model
modelFile -data sampleData [-encoding charsetName]
Arguments description:
-factory factoryName
A sub-class of LemmatizerFactory where to get implementation and resources.
-params paramsFile
training parameters file.
-lang language
language which is being processed.
-model modelFile
output model file.
-data sampleData
data to be used, usually a file name.
-encoding charsetName
encoding for reading and writing text, if absent the system default is used.
]]>
</screen>
<para>The supported formats and arguments are:</para>
<informaltable frame='all'><tgroup cols='4' align='left' colsep='1' rowsep='1'>
<thead><row><entry>Format</entry><entry>Argument</entry><entry>Value</entry><entry>Optional</entry><entry>Description</entry></row></thead>
<tbody>
<row>
<entry morerows='2' valign='middle'>conllu</entry>
<entry>tagset</entry>
<entry>tagset</entry>
<entry>Yes</entry>
<entry>U|x u for unified tags and x for language-specific part-of-speech tags</entry>
</row>
<row>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
</tbody>
</tgroup></informaltable>
</section>
<section id='tools.cli.lemmatizer.LemmatizerEvaluator'>
<title>LemmatizerEvaluator</title>
<para>Measures the performance of the Lemmatizer model with the reference data</para>
<screen>
<![CDATA[
Usage: opennlp LemmatizerEvaluator[.conllu] -model model [-misclassified true|false] [-reportOutputFile
outputFile] -data sampleData [-encoding charsetName]
Arguments description:
-model model
the model file to be evaluated.
-misclassified true|false
if true will print false negatives and false positives.
-reportOutputFile outputFile
the path of the fine-grained report file.
-data sampleData
data to be used, usually a file name.
-encoding charsetName
encoding for reading and writing text, if absent the system default is used.
]]>
</screen>
<para>The supported formats and arguments are:</para>
<informaltable frame='all'><tgroup cols='4' align='left' colsep='1' rowsep='1'>
<thead><row><entry>Format</entry><entry>Argument</entry><entry>Value</entry><entry>Optional</entry><entry>Description</entry></row></thead>
<tbody>
<row>
<entry morerows='2' valign='middle'>conllu</entry>
<entry>tagset</entry>
<entry>tagset</entry>
<entry>Yes</entry>
<entry>U|x u for unified tags and x for language-specific part-of-speech tags</entry>
</row>
<row>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
</tbody>
</tgroup></informaltable>
</section>
</section>
<section id='tools.cli.chunker'>
<title>Chunker</title>
<section id='tools.cli.chunker.ChunkerME'>
<title>ChunkerME</title>
<para>Learnable chunker</para>
<screen>
<![CDATA[
Usage: opennlp ChunkerME model < sentences
]]>
</screen>
</section>
<section id='tools.cli.chunker.ChunkerTrainerME'>
<title>ChunkerTrainerME</title>
<para>Trainer for the learnable chunker</para>
<screen>
<![CDATA[
Usage: opennlp ChunkerTrainerME[.ad] [-factory factoryName] [-params paramsFile] -lang language -model
modelFile -data sampleData [-encoding charsetName]
Arguments description:
-factory factoryName
A sub-class of ChunkerFactory where to get implementation and resources.
-params paramsFile
training parameters file.
-lang language
language which is being processed.
-model modelFile
output model file.
-data sampleData
data to be used, usually a file name.
-encoding charsetName
encoding for reading and writing text, if absent the system default is used.
]]>
</screen>
<para>The supported formats and arguments are:</para>
<informaltable frame='all'><tgroup cols='4' align='left' colsep='1' rowsep='1'>
<thead><row><entry>Format</entry><entry>Argument</entry><entry>Value</entry><entry>Optional</entry><entry>Description</entry></row></thead>
<tbody>
<row>
<entry morerows='4' valign='middle'>ad</entry>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>No</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry>lang</entry>
<entry>language</entry>
<entry>No</entry>
<entry>Language which is being processed.</entry>
</row>
<row>
<entry>end</entry>
<entry>end</entry>
<entry>Yes</entry>
<entry>Index of last sentence</entry>
</row>
<row>
<entry>start</entry>
<entry>start</entry>
<entry>Yes</entry>
<entry>Index of first sentence</entry>
</row>
<row>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
</tbody>
</tgroup></informaltable>
</section>
<section id='tools.cli.chunker.ChunkerEvaluator'>
<title>ChunkerEvaluator</title>
<para>Measures the performance of the Chunker model with the reference data</para>
<screen>
<![CDATA[
Usage: opennlp ChunkerEvaluator[.ad] -model model [-misclassified true|false] [-detailedF true|false] -data
sampleData [-encoding charsetName]
Arguments description:
-model model
the model file to be evaluated.
-misclassified true|false
if true will print false negatives and false positives.
-detailedF true|false
if true (default) will print detailed FMeasure results.
-data sampleData
data to be used, usually a file name.
-encoding charsetName
encoding for reading and writing text, if absent the system default is used.
]]>
</screen>
<para>The supported formats and arguments are:</para>
<informaltable frame='all'><tgroup cols='4' align='left' colsep='1' rowsep='1'>
<thead><row><entry>Format</entry><entry>Argument</entry><entry>Value</entry><entry>Optional</entry><entry>Description</entry></row></thead>
<tbody>
<row>
<entry morerows='4' valign='middle'>ad</entry>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>No</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry>lang</entry>
<entry>language</entry>
<entry>No</entry>
<entry>Language which is being processed.</entry>
</row>
<row>
<entry>end</entry>
<entry>end</entry>
<entry>Yes</entry>
<entry>Index of last sentence</entry>
</row>
<row>
<entry>start</entry>
<entry>start</entry>
<entry>Yes</entry>
<entry>Index of first sentence</entry>
</row>
<row>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
</tbody>
</tgroup></informaltable>
</section>
<section id='tools.cli.chunker.ChunkerCrossValidator'>
<title>ChunkerCrossValidator</title>
<para>K-fold cross validator for the chunker</para>
<screen>
<![CDATA[
Usage: opennlp ChunkerCrossValidator[.ad] [-factory factoryName] [-params paramsFile] -lang language
[-misclassified true|false] [-folds num] [-detailedF true|false] -data sampleData [-encoding
charsetName]
Arguments description:
-factory factoryName
A sub-class of ChunkerFactory where to get implementation and resources.
-params paramsFile
training parameters file.
-lang language
language which is being processed.
-misclassified true|false
if true will print false negatives and false positives.
-folds num
number of folds, default is 10.
-detailedF true|false
if true (default) will print detailed FMeasure results.
-data sampleData
data to be used, usually a file name.
-encoding charsetName
encoding for reading and writing text, if absent the system default is used.
]]>
</screen>
<para>The supported formats and arguments are:</para>
<informaltable frame='all'><tgroup cols='4' align='left' colsep='1' rowsep='1'>
<thead><row><entry>Format</entry><entry>Argument</entry><entry>Value</entry><entry>Optional</entry><entry>Description</entry></row></thead>
<tbody>
<row>
<entry morerows='4' valign='middle'>ad</entry>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>No</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry>lang</entry>
<entry>language</entry>
<entry>No</entry>
<entry>Language which is being processed.</entry>
</row>
<row>
<entry>end</entry>
<entry>end</entry>
<entry>Yes</entry>
<entry>Index of last sentence</entry>
</row>
<row>
<entry>start</entry>
<entry>start</entry>
<entry>Yes</entry>
<entry>Index of first sentence</entry>
</row>
<row>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
</tbody>
</tgroup></informaltable>
</section>
<section id='tools.cli.chunker.ChunkerConverter'>
<title>ChunkerConverter</title>
<para>Converts ad data format to native OpenNLP format</para>
<screen>
<![CDATA[
Usage: opennlp ChunkerConverter help|ad [help|options...]
]]>
</screen>
<para>The supported formats and arguments are:</para>
<informaltable frame='all'><tgroup cols='4' align='left' colsep='1' rowsep='1'>
<thead><row><entry>Format</entry><entry>Argument</entry><entry>Value</entry><entry>Optional</entry><entry>Description</entry></row></thead>
<tbody>
<row>
<entry morerows='4' valign='middle'>ad</entry>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>No</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
<entry>lang</entry>
<entry>language</entry>
<entry>No</entry>
<entry>Language which is being processed.</entry>
</row>
<row>
<entry>end</entry>
<entry>end</entry>
<entry>Yes</entry>
<entry>Index of last sentence</entry>
</row>
<row>
<entry>start</entry>
<entry>start</entry>
<entry>Yes</entry>
<entry>Index of first sentence</entry>
</row>
<row>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
</tbody>
</tgroup></informaltable>
</section>
</section>
<section id='tools.cli.parser'>
<title>Parser</title>
<section id='tools.cli.parser.Parser'>
<title>Parser</title>
<para>Performs full syntactic parsing</para>
<screen>
<![CDATA[
Usage: opennlp Parser [-bs n -ap n -k n -tk tok_model] model < sentences
-bs n: Use a beam size of n.
-ap f: Advance outcomes in with at least f% of the probability mass.
-k n: Show the top n parses. This will also display their log-probablities.
-tk tok_model: Use the specified tokenizer model to tokenize the sentences. Defaults to a WhitespaceTokenizer.
]]>
</screen>
</section>
<section id='tools.cli.parser.ParserTrainer'>
<title>ParserTrainer</title>
<para>Trains the learnable parser</para>
<screen>
<![CDATA[
Usage: opennlp ParserTrainer[.ontonotes|.frenchtreebank] [-headRulesSerializerImpl className] -headRules
headRulesFile [-parserType CHUNKING|TREEINSERT] [-fun true|false] [-params paramsFile] -lang language
-model modelFile [-encoding charsetName] -data sampleData
Arguments description:
-headRulesSerializerImpl className
head rules artifact serializer class name
-headRules headRulesFile
head rules file.
-parserType CHUNKING|TREEINSERT
one of CHUNKING or TREEINSERT, default is CHUNKING.
-fun true|false
Learn to generate function tags.
-params paramsFile
training parameters file.
-lang language
language which is being processed.
-model modelFile
output model file.
-encoding charsetName
encoding for reading and writing text, if absent the system default is used.
-data sampleData
data to be used, usually a file name.
]]>
</screen>
<para>The supported formats and arguments are:</para>
<informaltable frame='all'><tgroup cols='4' align='left' colsep='1' rowsep='1'>
<thead><row><entry>Format</entry><entry>Argument</entry><entry>Value</entry><entry>Optional</entry><entry>Description</entry></row></thead>
<tbody>
<row>
<entry morerows='0' valign='middle'>ontonotes</entry>
<entry>ontoNotesDir</entry>
<entry>OntoNotes 4.0 corpus directory</entry>
<entry>No</entry>
<entry></entry>
</row>
<row>
<entry morerows='1' valign='middle'>frenchtreebank</entry>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
</tbody>
</tgroup></informaltable>
</section>
<section id='tools.cli.parser.ParserEvaluator'>
<title>ParserEvaluator</title>
<para>Measures the performance of the Parser model with the reference data</para>
<screen>
<![CDATA[
Usage: opennlp ParserEvaluator[.ontonotes|.frenchtreebank] -model model [-misclassified true|false] -data
sampleData [-encoding charsetName]
Arguments description:
-model model
the model file to be evaluated.
-misclassified true|false
if true will print false negatives and false positives.
-data sampleData
data to be used, usually a file name.
-encoding charsetName
encoding for reading and writing text, if absent the system default is used.
]]>
</screen>
<para>The supported formats and arguments are:</para>
<informaltable frame='all'><tgroup cols='4' align='left' colsep='1' rowsep='1'>
<thead><row><entry>Format</entry><entry>Argument</entry><entry>Value</entry><entry>Optional</entry><entry>Description</entry></row></thead>
<tbody>
<row>
<entry morerows='0' valign='middle'>ontonotes</entry>
<entry>ontoNotesDir</entry>
<entry>OntoNotes 4.0 corpus directory</entry>
<entry>No</entry>
<entry></entry>
</row>
<row>
<entry morerows='1' valign='middle'>frenchtreebank</entry>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
</tbody>
</tgroup></informaltable>
</section>
<section id='tools.cli.parser.ParserConverter'>
<title>ParserConverter</title>
<para>Converts foreign data formats (ontonotes,frenchtreebank) to native OpenNLP format</para>
<screen>
<![CDATA[
Usage: opennlp ParserConverter help|ontonotes|frenchtreebank [help|options...]
]]>
</screen>
<para>The supported formats and arguments are:</para>
<informaltable frame='all'><tgroup cols='4' align='left' colsep='1' rowsep='1'>
<thead><row><entry>Format</entry><entry>Argument</entry><entry>Value</entry><entry>Optional</entry><entry>Description</entry></row></thead>
<tbody>
<row>
<entry morerows='0' valign='middle'>ontonotes</entry>
<entry>ontoNotesDir</entry>
<entry>OntoNotes 4.0 corpus directory</entry>
<entry>No</entry>
<entry></entry>
</row>
<row>
<entry morerows='1' valign='middle'>frenchtreebank</entry>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
</tbody>
</tgroup></informaltable>
</section>
<section id='tools.cli.parser.BuildModelUpdater'>
<title>BuildModelUpdater</title>
<para>Trains and updates the build model in a parser model</para>
<screen>
<![CDATA[
Usage: opennlp BuildModelUpdater[.ontonotes|.frenchtreebank] -model modelFile [-params paramsFile] -lang
language -data sampleData [-encoding charsetName]
Arguments description:
-model modelFile
output model file.
-params paramsFile
training parameters file.
-lang language
language which is being processed.
-data sampleData
data to be used, usually a file name.
-encoding charsetName
encoding for reading and writing text, if absent the system default is used.
]]>
</screen>
<para>The supported formats and arguments are:</para>
<informaltable frame='all'><tgroup cols='4' align='left' colsep='1' rowsep='1'>
<thead><row><entry>Format</entry><entry>Argument</entry><entry>Value</entry><entry>Optional</entry><entry>Description</entry></row></thead>
<tbody>
<row>
<entry morerows='0' valign='middle'>ontonotes</entry>
<entry>ontoNotesDir</entry>
<entry>OntoNotes 4.0 corpus directory</entry>
<entry>No</entry>
<entry></entry>
</row>
<row>
<entry morerows='1' valign='middle'>frenchtreebank</entry>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
</tbody>
</tgroup></informaltable>
</section>
<section id='tools.cli.parser.CheckModelUpdater'>
<title>CheckModelUpdater</title>
<para>Trains and updates the check model in a parser model</para>
<screen>
<![CDATA[
Usage: opennlp CheckModelUpdater[.ontonotes|.frenchtreebank] -model modelFile [-params paramsFile] -lang
language -data sampleData [-encoding charsetName]
Arguments description:
-model modelFile
output model file.
-params paramsFile
training parameters file.
-lang language
language which is being processed.
-data sampleData
data to be used, usually a file name.
-encoding charsetName
encoding for reading and writing text, if absent the system default is used.
]]>
</screen>
<para>The supported formats and arguments are:</para>
<informaltable frame='all'><tgroup cols='4' align='left' colsep='1' rowsep='1'>
<thead><row><entry>Format</entry><entry>Argument</entry><entry>Value</entry><entry>Optional</entry><entry>Description</entry></row></thead>
<tbody>
<row>
<entry morerows='0' valign='middle'>ontonotes</entry>
<entry>ontoNotesDir</entry>
<entry>OntoNotes 4.0 corpus directory</entry>
<entry>No</entry>
<entry></entry>
</row>
<row>
<entry morerows='1' valign='middle'>frenchtreebank</entry>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
<entry>Data to be used, usually a file name.</entry>
</row>
<row>
<entry>encoding</entry>
<entry>charsetName</entry>
<entry>Yes</entry>
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
</tbody>
</tgroup></informaltable>
</section>
<section id='tools.cli.parser.TaggerModelReplacer'>
<title>TaggerModelReplacer</title>
<para>Replaces the tagger model in a parser model</para>
<screen>
<![CDATA[
Usage: opennlp TaggerModelReplacer parser.model tagger.model
]]>
</screen>
</section>
</section>
<section id='tools.cli.entitylinker'>
<title>Entitylinker</title>
<section id='tools.cli.entitylinker.EntityLinker'>
<title>EntityLinker</title>
<para>Links an entity to an external data set</para>
<screen>
<![CDATA[
Usage: opennlp EntityLinker model < sentences
]]>
</screen>
</section>
</section>
<section id='tools.cli.languagemodel'>
<title>Languagemodel</title>
<section id='tools.cli.languagemodel.NGramLanguageModel'>
<title>NGramLanguageModel</title>
<para>Gives the probability and most probable next token(s) of a sequence of tokens in a language model</para>
<screen>
<![CDATA[
Usage: opennlp NGramLanguageModel model
]]>
</screen>
</section>
</section>
</chapter>