| <?xml version="1.0" encoding="UTF-8"?> |
| <!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN" |
| "http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd"[ |
| ]> |
| <!-- |
| Licensed to the Apache Software Foundation (ASF) under one |
| or more contributor license agreements. See the NOTICE file |
| distributed with this work for additional information |
| regarding copyright ownership. The ASF licenses this file |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this file except in compliance |
| with the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, |
| software distributed under the License is distributed on an |
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| KIND, either express or implied. See the License for the |
| specific language governing permissions and limitations |
| under the License. |
| --> |
| |
| |
| <!-- ## Warning ## this content is autogenerated! Please fix issues in to opennlp-tools/src/main/java/opennlp/tools/cmdline/GenerateManualTool.java |
| and execute the following command in opennlp-tool folder to update this file: |
| |
| mvn -e -q exec:java "-Dexec.mainClass=opennlp.tools.cmdline.GenerateManualTool" "-Dexec.args=../opennlp-docs/src/docbkx/cli.xml" |
| --> |
| |
| <chapter id='tools.cli'> |
| |
| <title>The Command Line Interface</title> |
| |
| <para>This section details the available tools and parameters of the Command Line Interface. For a introduction in its usage please refer to <xref linkend='intro.cli'/>. </para> |
| |
| <section id='tools.cli.doccat'> |
| |
| <title>Doccat</title> |
| |
| <section id='tools.cli.doccat.Doccat'> |
| |
| <title>Doccat</title> |
| |
| <para>Learned document categorizer</para> |
| |
| <screen> |
| <![CDATA[ |
| Usage: opennlp Doccat model < documents |
| |
| ]]> |
| </screen> |
| </section> |
| |
| <section id='tools.cli.doccat.DoccatTrainer'> |
| |
| <title>DoccatTrainer</title> |
| |
| <para>Trainer for the learnable document categorizer</para> |
| |
| <screen> |
| <![CDATA[ |
| Usage: opennlp DoccatTrainer[.leipzig] [-factory factoryName] [-featureGenerators fg] [-tokenizer tokenizer] |
| [-params paramsFile] -lang language -model modelFile -data sampleData [-encoding charsetName] |
| Arguments description: |
| -factory factoryName |
| A sub-class of DoccatFactory where to get implementation and resources. |
| -featureGenerators fg |
| Comma separated feature generator classes. Bag of words is used if not specified. |
| -tokenizer tokenizer |
| Tokenizer implementation. WhitespaceTokenizer is used if not specified. |
| -params paramsFile |
| training parameters file. |
| -lang language |
| language which is being processed. |
| -model modelFile |
| output model file. |
| -data sampleData |
| data to be used, usually a file name. |
| -encoding charsetName |
| encoding for reading and writing text, if absent the system default is used. |
| |
| ]]> |
| </screen> |
| <para>The supported formats and arguments are:</para> |
| |
| <informaltable frame='all'><tgroup cols='4' align='left' colsep='1' rowsep='1'> |
| <thead><row><entry>Format</entry><entry>Argument</entry><entry>Value</entry><entry>Optional</entry><entry>Description</entry></row></thead> |
| <tbody> |
| <row> |
| <entry morerows='1' valign='middle'>leipzig</entry> |
| <entry>sentencesDir</entry> |
| <entry>sentencesDir</entry> |
| <entry>No</entry> |
| <entry>Dir with Leipig sentences to be used</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| </tbody> |
| </tgroup></informaltable> |
| |
| </section> |
| |
| <section id='tools.cli.doccat.DoccatEvaluator'> |
| |
| <title>DoccatEvaluator</title> |
| |
| <para>Measures the performance of the Doccat model with the reference data</para> |
| |
| <screen> |
| <![CDATA[ |
| Usage: opennlp DoccatEvaluator[.leipzig] -model model [-misclassified true|false] [-reportOutputFile |
| outputFile] -data sampleData [-encoding charsetName] |
| Arguments description: |
| -model model |
| the model file to be evaluated. |
| -misclassified true|false |
| if true will print false negatives and false positives. |
| -reportOutputFile outputFile |
| the path of the fine-grained report file. |
| -data sampleData |
| data to be used, usually a file name. |
| -encoding charsetName |
| encoding for reading and writing text, if absent the system default is used. |
| |
| ]]> |
| </screen> |
| <para>The supported formats and arguments are:</para> |
| |
| <informaltable frame='all'><tgroup cols='4' align='left' colsep='1' rowsep='1'> |
| <thead><row><entry>Format</entry><entry>Argument</entry><entry>Value</entry><entry>Optional</entry><entry>Description</entry></row></thead> |
| <tbody> |
| <row> |
| <entry morerows='1' valign='middle'>leipzig</entry> |
| <entry>sentencesDir</entry> |
| <entry>sentencesDir</entry> |
| <entry>No</entry> |
| <entry>Dir with Leipig sentences to be used</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| </tbody> |
| </tgroup></informaltable> |
| |
| </section> |
| |
| <section id='tools.cli.doccat.DoccatCrossValidator'> |
| |
| <title>DoccatCrossValidator</title> |
| |
| <para>K-fold cross validator for the learnable Document Categorizer</para> |
| |
| <screen> |
| <![CDATA[ |
| Usage: opennlp DoccatCrossValidator[.leipzig] [-misclassified true|false] [-folds num] [-factory factoryName] |
| [-featureGenerators fg] [-tokenizer tokenizer] [-params paramsFile] -lang language [-reportOutputFile |
| outputFile] -data sampleData [-encoding charsetName] |
| Arguments description: |
| -misclassified true|false |
| if true will print false negatives and false positives. |
| -folds num |
| number of folds, default is 10. |
| -factory factoryName |
| A sub-class of DoccatFactory where to get implementation and resources. |
| -featureGenerators fg |
| Comma separated feature generator classes. Bag of words is used if not specified. |
| -tokenizer tokenizer |
| Tokenizer implementation. WhitespaceTokenizer is used if not specified. |
| -params paramsFile |
| training parameters file. |
| -lang language |
| language which is being processed. |
| -reportOutputFile outputFile |
| the path of the fine-grained report file. |
| -data sampleData |
| data to be used, usually a file name. |
| -encoding charsetName |
| encoding for reading and writing text, if absent the system default is used. |
| |
| ]]> |
| </screen> |
| <para>The supported formats and arguments are:</para> |
| |
| <informaltable frame='all'><tgroup cols='4' align='left' colsep='1' rowsep='1'> |
| <thead><row><entry>Format</entry><entry>Argument</entry><entry>Value</entry><entry>Optional</entry><entry>Description</entry></row></thead> |
| <tbody> |
| <row> |
| <entry morerows='1' valign='middle'>leipzig</entry> |
| <entry>sentencesDir</entry> |
| <entry>sentencesDir</entry> |
| <entry>No</entry> |
| <entry>Dir with Leipig sentences to be used</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| </tbody> |
| </tgroup></informaltable> |
| |
| </section> |
| |
| <section id='tools.cli.doccat.DoccatConverter'> |
| |
| <title>DoccatConverter</title> |
| |
| <para>Converts leipzig data format to native OpenNLP format</para> |
| |
| <screen> |
| <![CDATA[ |
| Usage: opennlp DoccatConverter help|leipzig [help|options...] |
| |
| ]]> |
| </screen> |
| <para>The supported formats and arguments are:</para> |
| |
| <informaltable frame='all'><tgroup cols='4' align='left' colsep='1' rowsep='1'> |
| <thead><row><entry>Format</entry><entry>Argument</entry><entry>Value</entry><entry>Optional</entry><entry>Description</entry></row></thead> |
| <tbody> |
| <row> |
| <entry morerows='1' valign='middle'>leipzig</entry> |
| <entry>sentencesDir</entry> |
| <entry>sentencesDir</entry> |
| <entry>No</entry> |
| <entry>Dir with Leipig sentences to be used</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| </tbody> |
| </tgroup></informaltable> |
| |
| </section> |
| |
| </section> |
| |
| <section id='tools.cli.langdetect'> |
| |
| <title>Langdetect</title> |
| |
| <section id='tools.cli.langdetect.LanguageDetector'> |
| |
| <title>LanguageDetector</title> |
| |
| <para>Learned language detector</para> |
| |
| <screen> |
| <![CDATA[ |
| Usage: opennlp LanguageDetector model < documents |
| |
| ]]> |
| </screen> |
| </section> |
| |
| <section id='tools.cli.langdetect.LanguageDetectorTrainer'> |
| |
| <title>LanguageDetectorTrainer</title> |
| |
| <para>Trainer for the learnable language detector</para> |
| |
| <screen> |
| <![CDATA[ |
| Usage: opennlp LanguageDetectorTrainer[.leipzig] -model modelFile [-params paramsFile] [-factory factoryName] |
| -data sampleData [-encoding charsetName] |
| Arguments description: |
| -model modelFile |
| output model file. |
| -params paramsFile |
| training parameters file. |
| -factory factoryName |
| A sub-class of LanguageDetectorFactory where to get implementation and resources. |
| -data sampleData |
| data to be used, usually a file name. |
| -encoding charsetName |
| encoding for reading and writing text, if absent the system default is used. |
| |
| ]]> |
| </screen> |
| <para>The supported formats and arguments are:</para> |
| |
| <informaltable frame='all'><tgroup cols='4' align='left' colsep='1' rowsep='1'> |
| <thead><row><entry>Format</entry><entry>Argument</entry><entry>Value</entry><entry>Optional</entry><entry>Description</entry></row></thead> |
| <tbody> |
| <row> |
| <entry morerows='3' valign='middle'>leipzig</entry> |
| <entry>sentencesDir</entry> |
| <entry>sentencesDir</entry> |
| <entry>No</entry> |
| <entry>Dir with Leipig sentences to be used</entry> |
| </row> |
| <row> |
| <entry>sentencesPerSample</entry> |
| <entry>sentencesPerSample</entry> |
| <entry>No</entry> |
| <entry>Number of sentences per sample</entry> |
| </row> |
| <row> |
| <entry>samplesPerLanguage</entry> |
| <entry>samplesPerLanguage</entry> |
| <entry>No</entry> |
| <entry>Number of samples per language</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| </tbody> |
| </tgroup></informaltable> |
| |
| </section> |
| |
| <section id='tools.cli.langdetect.LanguageDetectorConverter'> |
| |
| <title>LanguageDetectorConverter</title> |
| |
| <para>Converts leipzig data format to native OpenNLP format</para> |
| |
| <screen> |
| <![CDATA[ |
| Usage: opennlp LanguageDetectorConverter help|leipzig [help|options...] |
| |
| ]]> |
| </screen> |
| <para>The supported formats and arguments are:</para> |
| |
| <informaltable frame='all'><tgroup cols='4' align='left' colsep='1' rowsep='1'> |
| <thead><row><entry>Format</entry><entry>Argument</entry><entry>Value</entry><entry>Optional</entry><entry>Description</entry></row></thead> |
| <tbody> |
| <row> |
| <entry morerows='3' valign='middle'>leipzig</entry> |
| <entry>sentencesDir</entry> |
| <entry>sentencesDir</entry> |
| <entry>No</entry> |
| <entry>Dir with Leipig sentences to be used</entry> |
| </row> |
| <row> |
| <entry>sentencesPerSample</entry> |
| <entry>sentencesPerSample</entry> |
| <entry>No</entry> |
| <entry>Number of sentences per sample</entry> |
| </row> |
| <row> |
| <entry>samplesPerLanguage</entry> |
| <entry>samplesPerLanguage</entry> |
| <entry>No</entry> |
| <entry>Number of samples per language</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| </tbody> |
| </tgroup></informaltable> |
| |
| </section> |
| |
| <section id='tools.cli.langdetect.LanguageDetectorCrossValidator'> |
| |
| <title>LanguageDetectorCrossValidator</title> |
| |
| <para>K-fold cross validator for the learnable Language Detector</para> |
| |
| <screen> |
| <![CDATA[ |
| Usage: opennlp LanguageDetectorCrossValidator[.leipzig] [-misclassified true|false] [-folds num] [-factory |
| factoryName] [-params paramsFile] [-reportOutputFile outputFile] -data sampleData [-encoding |
| charsetName] |
| Arguments description: |
| -misclassified true|false |
| if true will print false negatives and false positives. |
| -folds num |
| number of folds, default is 10. |
| -factory factoryName |
| A sub-class of LanguageDetectorFactory where to get implementation and resources. |
| -params paramsFile |
| training parameters file. |
| -reportOutputFile outputFile |
| the path of the fine-grained report file. |
| -data sampleData |
| data to be used, usually a file name. |
| -encoding charsetName |
| encoding for reading and writing text, if absent the system default is used. |
| |
| ]]> |
| </screen> |
| <para>The supported formats and arguments are:</para> |
| |
| <informaltable frame='all'><tgroup cols='4' align='left' colsep='1' rowsep='1'> |
| <thead><row><entry>Format</entry><entry>Argument</entry><entry>Value</entry><entry>Optional</entry><entry>Description</entry></row></thead> |
| <tbody> |
| <row> |
| <entry morerows='3' valign='middle'>leipzig</entry> |
| <entry>sentencesDir</entry> |
| <entry>sentencesDir</entry> |
| <entry>No</entry> |
| <entry>Dir with Leipig sentences to be used</entry> |
| </row> |
| <row> |
| <entry>sentencesPerSample</entry> |
| <entry>sentencesPerSample</entry> |
| <entry>No</entry> |
| <entry>Number of sentences per sample</entry> |
| </row> |
| <row> |
| <entry>samplesPerLanguage</entry> |
| <entry>samplesPerLanguage</entry> |
| <entry>No</entry> |
| <entry>Number of samples per language</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| </tbody> |
| </tgroup></informaltable> |
| |
| </section> |
| |
| <section id='tools.cli.langdetect.LanguageDetectorEvaluator'> |
| |
| <title>LanguageDetectorEvaluator</title> |
| |
| <para>Measures the performance of the Language Detector model with the reference data</para> |
| |
| <screen> |
| <![CDATA[ |
| Usage: opennlp LanguageDetectorEvaluator[.leipzig] -model model [-misclassified true|false] |
| [-reportOutputFile outputFile] -data sampleData [-encoding charsetName] |
| Arguments description: |
| -model model |
| the model file to be evaluated. |
| -misclassified true|false |
| if true will print false negatives and false positives. |
| -reportOutputFile outputFile |
| the path of the fine-grained report file. |
| -data sampleData |
| data to be used, usually a file name. |
| -encoding charsetName |
| encoding for reading and writing text, if absent the system default is used. |
| |
| ]]> |
| </screen> |
| <para>The supported formats and arguments are:</para> |
| |
| <informaltable frame='all'><tgroup cols='4' align='left' colsep='1' rowsep='1'> |
| <thead><row><entry>Format</entry><entry>Argument</entry><entry>Value</entry><entry>Optional</entry><entry>Description</entry></row></thead> |
| <tbody> |
| <row> |
| <entry morerows='3' valign='middle'>leipzig</entry> |
| <entry>sentencesDir</entry> |
| <entry>sentencesDir</entry> |
| <entry>No</entry> |
| <entry>Dir with Leipig sentences to be used</entry> |
| </row> |
| <row> |
| <entry>sentencesPerSample</entry> |
| <entry>sentencesPerSample</entry> |
| <entry>No</entry> |
| <entry>Number of sentences per sample</entry> |
| </row> |
| <row> |
| <entry>samplesPerLanguage</entry> |
| <entry>samplesPerLanguage</entry> |
| <entry>No</entry> |
| <entry>Number of samples per language</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| </tbody> |
| </tgroup></informaltable> |
| |
| </section> |
| |
| </section> |
| |
| <section id='tools.cli.dictionary'> |
| |
| <title>Dictionary</title> |
| |
| <section id='tools.cli.dictionary.DictionaryBuilder'> |
| |
| <title>DictionaryBuilder</title> |
| |
| <para>Builds a new dictionary</para> |
| |
| <screen> |
| <![CDATA[ |
| Usage: opennlp DictionaryBuilder -outputFile out -inputFile in [-encoding charsetName] |
| |
| Arguments description: |
| -outputFile out |
| The dictionary file. |
| -inputFile in |
| Plain file with one entry per line |
| -encoding charsetName |
| encoding for reading and writing text, if absent the system default is used. |
| |
| ]]> |
| </screen> |
| </section> |
| |
| </section> |
| |
| <section id='tools.cli.tokenizer'> |
| |
| <title>Tokenizer</title> |
| |
| <section id='tools.cli.tokenizer.SimpleTokenizer'> |
| |
| <title>SimpleTokenizer</title> |
| |
| <para>Character class tokenizer</para> |
| |
| <screen> |
| <![CDATA[ |
| Usage: opennlp SimpleTokenizer < sentences |
| |
| ]]> |
| </screen> |
| </section> |
| |
| <section id='tools.cli.tokenizer.TokenizerME'> |
| |
| <title>TokenizerME</title> |
| |
| <para>Learnable tokenizer</para> |
| |
| <screen> |
| <![CDATA[ |
| Usage: opennlp TokenizerME model < sentences |
| |
| ]]> |
| </screen> |
| </section> |
| |
| <section id='tools.cli.tokenizer.TokenizerTrainer'> |
| |
| <title>TokenizerTrainer</title> |
| |
| <para>Trainer for the learnable tokenizer</para> |
| |
| <screen> |
| <![CDATA[ |
| Usage: opennlp TokenizerTrainer[.irishsentencebank|.ad|.pos|.conllx|.namefinder|.parse|.conllu] [-factory |
| factoryName] [-abbDict path] [-alphaNumOpt isAlphaNumOpt] [-params paramsFile] -lang language -model |
| modelFile -data sampleData [-encoding charsetName] |
| Arguments description: |
| -factory factoryName |
| A sub-class of TokenizerFactory where to get implementation and resources. |
| -abbDict path |
| abbreviation dictionary in XML format. |
| -alphaNumOpt isAlphaNumOpt |
| Optimization flag to skip alpha numeric tokens for further tokenization |
| -params paramsFile |
| training parameters file. |
| -lang language |
| language which is being processed. |
| -model modelFile |
| output model file. |
| -data sampleData |
| data to be used, usually a file name. |
| -encoding charsetName |
| encoding for reading and writing text, if absent the system default is used. |
| |
| ]]> |
| </screen> |
| <para>The supported formats and arguments are:</para> |
| |
| <informaltable frame='all'><tgroup cols='4' align='left' colsep='1' rowsep='1'> |
| <thead><row><entry>Format</entry><entry>Argument</entry><entry>Value</entry><entry>Optional</entry><entry>Description</entry></row></thead> |
| <tbody> |
| <row> |
| <entry morerows='1' valign='middle'>irishsentencebank</entry> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry morerows='4' valign='middle'>ad</entry> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>No</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry>lang</entry> |
| <entry>language</entry> |
| <entry>No</entry> |
| <entry>Language which is being processed.</entry> |
| </row> |
| <row> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>splitHyphenatedTokens</entry> |
| <entry>split</entry> |
| <entry>Yes</entry> |
| <entry>If true all hyphenated tokens will be separated (default true)</entry> |
| </row> |
| <row> |
| <entry>detokenizer</entry> |
| <entry>dictionary</entry> |
| <entry>No</entry> |
| <entry>Specifies the file with detokenizer dictionary.</entry> |
| </row> |
| <row> |
| <entry morerows='2' valign='middle'>pos</entry> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry>detokenizer</entry> |
| <entry>dictionary</entry> |
| <entry>No</entry> |
| <entry>Specifies the file with detokenizer dictionary.</entry> |
| </row> |
| <row> |
| <entry morerows='2' valign='middle'>conllx</entry> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry>detokenizer</entry> |
| <entry>dictionary</entry> |
| <entry>No</entry> |
| <entry>Specifies the file with detokenizer dictionary.</entry> |
| </row> |
| <row> |
| <entry morerows='2' valign='middle'>namefinder</entry> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry>detokenizer</entry> |
| <entry>dictionary</entry> |
| <entry>No</entry> |
| <entry>Specifies the file with detokenizer dictionary.</entry> |
| </row> |
| <row> |
| <entry morerows='2' valign='middle'>parse</entry> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry>detokenizer</entry> |
| <entry>dictionary</entry> |
| <entry>No</entry> |
| <entry>Specifies the file with detokenizer dictionary.</entry> |
| </row> |
| <row> |
| <entry morerows='1' valign='middle'>conllu</entry> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| </tbody> |
| </tgroup></informaltable> |
| |
| </section> |
| |
| <section id='tools.cli.tokenizer.TokenizerMEEvaluator'> |
| |
| <title>TokenizerMEEvaluator</title> |
| |
| <para>Evaluator for the learnable tokenizer</para> |
| |
| <screen> |
| <![CDATA[ |
| Usage: opennlp TokenizerMEEvaluator[.irishsentencebank|.ad|.pos|.conllx|.namefinder|.parse|.conllu] -model |
| model [-misclassified true|false] -data sampleData [-encoding charsetName] |
| Arguments description: |
| -model model |
| the model file to be evaluated. |
| -misclassified true|false |
| if true will print false negatives and false positives. |
| -data sampleData |
| data to be used, usually a file name. |
| -encoding charsetName |
| encoding for reading and writing text, if absent the system default is used. |
| |
| ]]> |
| </screen> |
| <para>The supported formats and arguments are:</para> |
| |
| <informaltable frame='all'><tgroup cols='4' align='left' colsep='1' rowsep='1'> |
| <thead><row><entry>Format</entry><entry>Argument</entry><entry>Value</entry><entry>Optional</entry><entry>Description</entry></row></thead> |
| <tbody> |
| <row> |
| <entry morerows='1' valign='middle'>irishsentencebank</entry> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry morerows='4' valign='middle'>ad</entry> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>No</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry>lang</entry> |
| <entry>language</entry> |
| <entry>No</entry> |
| <entry>Language which is being processed.</entry> |
| </row> |
| <row> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>splitHyphenatedTokens</entry> |
| <entry>split</entry> |
| <entry>Yes</entry> |
| <entry>If true all hyphenated tokens will be separated (default true)</entry> |
| </row> |
| <row> |
| <entry>detokenizer</entry> |
| <entry>dictionary</entry> |
| <entry>No</entry> |
| <entry>Specifies the file with detokenizer dictionary.</entry> |
| </row> |
| <row> |
| <entry morerows='2' valign='middle'>pos</entry> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry>detokenizer</entry> |
| <entry>dictionary</entry> |
| <entry>No</entry> |
| <entry>Specifies the file with detokenizer dictionary.</entry> |
| </row> |
| <row> |
| <entry morerows='2' valign='middle'>conllx</entry> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry>detokenizer</entry> |
| <entry>dictionary</entry> |
| <entry>No</entry> |
| <entry>Specifies the file with detokenizer dictionary.</entry> |
| </row> |
| <row> |
| <entry morerows='2' valign='middle'>namefinder</entry> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry>detokenizer</entry> |
| <entry>dictionary</entry> |
| <entry>No</entry> |
| <entry>Specifies the file with detokenizer dictionary.</entry> |
| </row> |
| <row> |
| <entry morerows='2' valign='middle'>parse</entry> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry>detokenizer</entry> |
| <entry>dictionary</entry> |
| <entry>No</entry> |
| <entry>Specifies the file with detokenizer dictionary.</entry> |
| </row> |
| <row> |
| <entry morerows='1' valign='middle'>conllu</entry> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| </tbody> |
| </tgroup></informaltable> |
| |
| </section> |
| |
| <section id='tools.cli.tokenizer.TokenizerCrossValidator'> |
| |
| <title>TokenizerCrossValidator</title> |
| |
| <para>K-fold cross validator for the learnable tokenizer</para> |
| |
| <screen> |
| <![CDATA[ |
| Usage: opennlp TokenizerCrossValidator[.irishsentencebank|.ad|.pos|.conllx|.namefinder|.parse|.conllu] |
| [-misclassified true|false] [-folds num] [-factory factoryName] [-abbDict path] [-alphaNumOpt |
| isAlphaNumOpt] [-params paramsFile] -lang language -data sampleData [-encoding charsetName] |
| Arguments description: |
| -misclassified true|false |
| if true will print false negatives and false positives. |
| -folds num |
| number of folds, default is 10. |
| -factory factoryName |
| A sub-class of TokenizerFactory where to get implementation and resources. |
| -abbDict path |
| abbreviation dictionary in XML format. |
| -alphaNumOpt isAlphaNumOpt |
| Optimization flag to skip alpha numeric tokens for further tokenization |
| -params paramsFile |
| training parameters file. |
| -lang language |
| language which is being processed. |
| -data sampleData |
| data to be used, usually a file name. |
| -encoding charsetName |
| encoding for reading and writing text, if absent the system default is used. |
| |
| ]]> |
| </screen> |
| <para>The supported formats and arguments are:</para> |
| |
| <informaltable frame='all'><tgroup cols='4' align='left' colsep='1' rowsep='1'> |
| <thead><row><entry>Format</entry><entry>Argument</entry><entry>Value</entry><entry>Optional</entry><entry>Description</entry></row></thead> |
| <tbody> |
| <row> |
| <entry morerows='1' valign='middle'>irishsentencebank</entry> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry morerows='4' valign='middle'>ad</entry> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>No</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry>lang</entry> |
| <entry>language</entry> |
| <entry>No</entry> |
| <entry>Language which is being processed.</entry> |
| </row> |
| <row> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>splitHyphenatedTokens</entry> |
| <entry>split</entry> |
| <entry>Yes</entry> |
| <entry>If true all hyphenated tokens will be separated (default true)</entry> |
| </row> |
| <row> |
| <entry>detokenizer</entry> |
| <entry>dictionary</entry> |
| <entry>No</entry> |
| <entry>Specifies the file with detokenizer dictionary.</entry> |
| </row> |
| <row> |
| <entry morerows='2' valign='middle'>pos</entry> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry>detokenizer</entry> |
| <entry>dictionary</entry> |
| <entry>No</entry> |
| <entry>Specifies the file with detokenizer dictionary.</entry> |
| </row> |
| <row> |
| <entry morerows='2' valign='middle'>conllx</entry> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry>detokenizer</entry> |
| <entry>dictionary</entry> |
| <entry>No</entry> |
| <entry>Specifies the file with detokenizer dictionary.</entry> |
| </row> |
| <row> |
| <entry morerows='2' valign='middle'>namefinder</entry> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry>detokenizer</entry> |
| <entry>dictionary</entry> |
| <entry>No</entry> |
| <entry>Specifies the file with detokenizer dictionary.</entry> |
| </row> |
| <row> |
| <entry morerows='2' valign='middle'>parse</entry> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry>detokenizer</entry> |
| <entry>dictionary</entry> |
| <entry>No</entry> |
| <entry>Specifies the file with detokenizer dictionary.</entry> |
| </row> |
| <row> |
| <entry morerows='1' valign='middle'>conllu</entry> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| </tbody> |
| </tgroup></informaltable> |
| |
| </section> |
| |
| <section id='tools.cli.tokenizer.TokenizerConverter'> |
| |
| <title>TokenizerConverter</title> |
| |
| <para>Converts foreign data formats (irishsentencebank,ad,pos,conllx,namefinder,parse,conllu) to native OpenNLP format</para> |
| |
| <screen> |
| <![CDATA[ |
| Usage: opennlp TokenizerConverter help|irishsentencebank|ad|pos|conllx|namefinder|parse|conllu |
| [help|options...] |
| ]]> |
| </screen> |
| <para>The supported formats and arguments are:</para> |
| |
| <informaltable frame='all'><tgroup cols='4' align='left' colsep='1' rowsep='1'> |
| <thead><row><entry>Format</entry><entry>Argument</entry><entry>Value</entry><entry>Optional</entry><entry>Description</entry></row></thead> |
| <tbody> |
| <row> |
| <entry morerows='1' valign='middle'>irishsentencebank</entry> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry morerows='4' valign='middle'>ad</entry> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>No</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry>lang</entry> |
| <entry>language</entry> |
| <entry>No</entry> |
| <entry>Language which is being processed.</entry> |
| </row> |
| <row> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>splitHyphenatedTokens</entry> |
| <entry>split</entry> |
| <entry>Yes</entry> |
| <entry>If true all hyphenated tokens will be separated (default true)</entry> |
| </row> |
| <row> |
| <entry>detokenizer</entry> |
| <entry>dictionary</entry> |
| <entry>No</entry> |
| <entry>Specifies the file with detokenizer dictionary.</entry> |
| </row> |
| <row> |
| <entry morerows='2' valign='middle'>pos</entry> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry>detokenizer</entry> |
| <entry>dictionary</entry> |
| <entry>No</entry> |
| <entry>Specifies the file with detokenizer dictionary.</entry> |
| </row> |
| <row> |
| <entry morerows='2' valign='middle'>conllx</entry> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry>detokenizer</entry> |
| <entry>dictionary</entry> |
| <entry>No</entry> |
| <entry>Specifies the file with detokenizer dictionary.</entry> |
| </row> |
| <row> |
| <entry morerows='2' valign='middle'>namefinder</entry> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry>detokenizer</entry> |
| <entry>dictionary</entry> |
| <entry>No</entry> |
| <entry>Specifies the file with detokenizer dictionary.</entry> |
| </row> |
| <row> |
| <entry morerows='2' valign='middle'>parse</entry> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry>detokenizer</entry> |
| <entry>dictionary</entry> |
| <entry>No</entry> |
| <entry>Specifies the file with detokenizer dictionary.</entry> |
| </row> |
| <row> |
| <entry morerows='1' valign='middle'>conllu</entry> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| </tbody> |
| </tgroup></informaltable> |
| |
| </section> |
| |
| <section id='tools.cli.tokenizer.DictionaryDetokenizer'> |
| |
| <title>DictionaryDetokenizer</title> |
| |
| <para></para> |
| |
| <screen> |
| <![CDATA[ |
| Usage: opennlp DictionaryDetokenizer detokenizerDictionary |
| |
| ]]> |
| </screen> |
| </section> |
| |
| </section> |
| |
| <section id='tools.cli.sentdetect'> |
| |
| <title>Sentdetect</title> |
| |
| <section id='tools.cli.sentdetect.SentenceDetector'> |
| |
| <title>SentenceDetector</title> |
| |
| <para>Learnable sentence detector</para> |
| |
| <screen> |
| <![CDATA[ |
| Usage: opennlp SentenceDetector model < sentences |
| |
| ]]> |
| </screen> |
| </section> |
| |
| <section id='tools.cli.sentdetect.SentenceDetectorTrainer'> |
| |
| <title>SentenceDetectorTrainer</title> |
| |
| <para>Trainer for the learnable sentence detector</para> |
| |
| <screen> |
| <![CDATA[ |
| Usage: opennlp |
| SentenceDetectorTrainer[.irishsentencebank|.ad|.pos|.conllx|.namefinder|.parse|.moses|.conllu|.letsmt] |
| [-factory factoryName] [-eosChars string] [-abbDict path] [-params paramsFile] -lang language -model |
| modelFile -data sampleData [-encoding charsetName] |
| Arguments description: |
| -factory factoryName |
| A sub-class of SentenceDetectorFactory where to get implementation and resources. |
| -eosChars string |
| EOS characters. |
| -abbDict path |
| abbreviation dictionary in XML format. |
| -params paramsFile |
| training parameters file. |
| -lang language |
| language which is being processed. |
| -model modelFile |
| output model file. |
| -data sampleData |
| data to be used, usually a file name. |
| -encoding charsetName |
| encoding for reading and writing text, if absent the system default is used. |
| |
| ]]> |
| </screen> |
| <para>The supported formats and arguments are:</para> |
| |
| <informaltable frame='all'><tgroup cols='4' align='left' colsep='1' rowsep='1'> |
| <thead><row><entry>Format</entry><entry>Argument</entry><entry>Value</entry><entry>Optional</entry><entry>Description</entry></row></thead> |
| <tbody> |
| <row> |
| <entry morerows='1' valign='middle'>irishsentencebank</entry> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry morerows='3' valign='middle'>ad</entry> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>No</entry> |
| <entry>Encoding for reading and writing text.</entry> |
| </row> |
| <row> |
| <entry>lang</entry> |
| <entry>language</entry> |
| <entry>No</entry> |
| <entry>Language which is being processed.</entry> |
| </row> |
| <row> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>includeTitles</entry> |
| <entry>includeTitles</entry> |
| <entry>Yes</entry> |
| <entry>If true will include sentences marked as headlines.</entry> |
| </row> |
| <row> |
| <entry morerows='2' valign='middle'>pos</entry> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry>detokenizer</entry> |
| <entry>dictionary</entry> |
| <entry>No</entry> |
| <entry>Specifies the file with detokenizer dictionary.</entry> |
| </row> |
| <row> |
| <entry morerows='2' valign='middle'>conllx</entry> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry>detokenizer</entry> |
| <entry>dictionary</entry> |
| <entry>No</entry> |
| <entry>Specifies the file with detokenizer dictionary.</entry> |
| </row> |
| <row> |
| <entry morerows='2' valign='middle'>namefinder</entry> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry>detokenizer</entry> |
| <entry>dictionary</entry> |
| <entry>No</entry> |
| <entry>Specifies the file with detokenizer dictionary.</entry> |
| </row> |
| <row> |
| <entry morerows='2' valign='middle'>parse</entry> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry>detokenizer</entry> |
| <entry>dictionary</entry> |
| <entry>No</entry> |
| <entry>Specifies the file with detokenizer dictionary.</entry> |
| </row> |
| <row> |
| <entry morerows='1' valign='middle'>moses</entry> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry morerows='2' valign='middle'>conllu</entry> |
| <entry>sentencesPerSample</entry> |
| <entry>sentencesPerSample</entry> |
| <entry>No</entry> |
| <entry>Number of sentences per sample</entry> |
| </row> |
| <row> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry morerows='2' valign='middle'>letsmt</entry> |
| <entry>detokenizer</entry> |
| <entry>dictionary</entry> |
| <entry>Yes</entry> |
| <entry>Specifies the file with detokenizer dictionary.</entry> |
| </row> |
| <row> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| </tbody> |
| </tgroup></informaltable> |
| |
| </section> |
| |
| <section id='tools.cli.sentdetect.SentenceDetectorEvaluator'> |
| |
| <title>SentenceDetectorEvaluator</title> |
| |
| <para>Evaluator for the learnable sentence detector</para> |
| |
| <screen> |
| <![CDATA[ |
| Usage: opennlp |
| SentenceDetectorEvaluator[.irishsentencebank|.ad|.pos|.conllx|.namefinder|.parse|.moses|.conllu|.letsmt] |
| -model model [-misclassified true|false] -data sampleData [-encoding charsetName] |
| Arguments description: |
| -model model |
| the model file to be evaluated. |
| -misclassified true|false |
| if true will print false negatives and false positives. |
| -data sampleData |
| data to be used, usually a file name. |
| -encoding charsetName |
| encoding for reading and writing text, if absent the system default is used. |
| |
| ]]> |
| </screen> |
| <para>The supported formats and arguments are:</para> |
| |
| <informaltable frame='all'><tgroup cols='4' align='left' colsep='1' rowsep='1'> |
| <thead><row><entry>Format</entry><entry>Argument</entry><entry>Value</entry><entry>Optional</entry><entry>Description</entry></row></thead> |
| <tbody> |
| <row> |
| <entry morerows='1' valign='middle'>irishsentencebank</entry> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry morerows='3' valign='middle'>ad</entry> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>No</entry> |
| <entry>Encoding for reading and writing text.</entry> |
| </row> |
| <row> |
| <entry>lang</entry> |
| <entry>language</entry> |
| <entry>No</entry> |
| <entry>Language which is being processed.</entry> |
| </row> |
| <row> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>includeTitles</entry> |
| <entry>includeTitles</entry> |
| <entry>Yes</entry> |
| <entry>If true will include sentences marked as headlines.</entry> |
| </row> |
| <row> |
| <entry morerows='2' valign='middle'>pos</entry> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry>detokenizer</entry> |
| <entry>dictionary</entry> |
| <entry>No</entry> |
| <entry>Specifies the file with detokenizer dictionary.</entry> |
| </row> |
| <row> |
| <entry morerows='2' valign='middle'>conllx</entry> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry>detokenizer</entry> |
| <entry>dictionary</entry> |
| <entry>No</entry> |
| <entry>Specifies the file with detokenizer dictionary.</entry> |
| </row> |
| <row> |
| <entry morerows='2' valign='middle'>namefinder</entry> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry>detokenizer</entry> |
| <entry>dictionary</entry> |
| <entry>No</entry> |
| <entry>Specifies the file with detokenizer dictionary.</entry> |
| </row> |
| <row> |
| <entry morerows='2' valign='middle'>parse</entry> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry>detokenizer</entry> |
| <entry>dictionary</entry> |
| <entry>No</entry> |
| <entry>Specifies the file with detokenizer dictionary.</entry> |
| </row> |
| <row> |
| <entry morerows='1' valign='middle'>moses</entry> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry morerows='2' valign='middle'>conllu</entry> |
| <entry>sentencesPerSample</entry> |
| <entry>sentencesPerSample</entry> |
| <entry>No</entry> |
| <entry>Number of sentences per sample</entry> |
| </row> |
| <row> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry morerows='2' valign='middle'>letsmt</entry> |
| <entry>detokenizer</entry> |
| <entry>dictionary</entry> |
| <entry>Yes</entry> |
| <entry>Specifies the file with detokenizer dictionary.</entry> |
| </row> |
| <row> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| </tbody> |
| </tgroup></informaltable> |
| |
| </section> |
| |
| <section id='tools.cli.sentdetect.SentenceDetectorCrossValidator'> |
| |
| <title>SentenceDetectorCrossValidator</title> |
| |
| <para>K-fold cross validator for the learnable sentence detector</para> |
| |
| <screen> |
| <![CDATA[ |
| Usage: opennlp |
| SentenceDetectorCrossValidator[.irishsentencebank|.ad|.pos|.conllx|.namefinder|.parse|.moses|.conllu|.letsmt] |
| [-factory factoryName] [-eosChars string] [-abbDict path] [-params paramsFile] -lang language |
| [-misclassified true|false] [-folds num] -data sampleData [-encoding charsetName] |
| Arguments description: |
| -factory factoryName |
| A sub-class of SentenceDetectorFactory where to get implementation and resources. |
| -eosChars string |
| EOS characters. |
| -abbDict path |
| abbreviation dictionary in XML format. |
| -params paramsFile |
| training parameters file. |
| -lang language |
| language which is being processed. |
| -misclassified true|false |
| if true will print false negatives and false positives. |
| -folds num |
| number of folds, default is 10. |
| -data sampleData |
| data to be used, usually a file name. |
| -encoding charsetName |
| encoding for reading and writing text, if absent the system default is used. |
| |
| ]]> |
| </screen> |
| <para>The supported formats and arguments are:</para> |
| |
| <informaltable frame='all'><tgroup cols='4' align='left' colsep='1' rowsep='1'> |
| <thead><row><entry>Format</entry><entry>Argument</entry><entry>Value</entry><entry>Optional</entry><entry>Description</entry></row></thead> |
| <tbody> |
| <row> |
| <entry morerows='1' valign='middle'>irishsentencebank</entry> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry morerows='3' valign='middle'>ad</entry> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>No</entry> |
| <entry>Encoding for reading and writing text.</entry> |
| </row> |
| <row> |
| <entry>lang</entry> |
| <entry>language</entry> |
| <entry>No</entry> |
| <entry>Language which is being processed.</entry> |
| </row> |
| <row> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>includeTitles</entry> |
| <entry>includeTitles</entry> |
| <entry>Yes</entry> |
| <entry>If true will include sentences marked as headlines.</entry> |
| </row> |
| <row> |
| <entry morerows='2' valign='middle'>pos</entry> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry>detokenizer</entry> |
| <entry>dictionary</entry> |
| <entry>No</entry> |
| <entry>Specifies the file with detokenizer dictionary.</entry> |
| </row> |
| <row> |
| <entry morerows='2' valign='middle'>conllx</entry> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry>detokenizer</entry> |
| <entry>dictionary</entry> |
| <entry>No</entry> |
| <entry>Specifies the file with detokenizer dictionary.</entry> |
| </row> |
| <row> |
| <entry morerows='2' valign='middle'>namefinder</entry> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry>detokenizer</entry> |
| <entry>dictionary</entry> |
| <entry>No</entry> |
| <entry>Specifies the file with detokenizer dictionary.</entry> |
| </row> |
| <row> |
| <entry morerows='2' valign='middle'>parse</entry> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry>detokenizer</entry> |
| <entry>dictionary</entry> |
| <entry>No</entry> |
| <entry>Specifies the file with detokenizer dictionary.</entry> |
| </row> |
| <row> |
| <entry morerows='1' valign='middle'>moses</entry> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry morerows='2' valign='middle'>conllu</entry> |
| <entry>sentencesPerSample</entry> |
| <entry>sentencesPerSample</entry> |
| <entry>No</entry> |
| <entry>Number of sentences per sample</entry> |
| </row> |
| <row> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry morerows='2' valign='middle'>letsmt</entry> |
| <entry>detokenizer</entry> |
| <entry>dictionary</entry> |
| <entry>Yes</entry> |
| <entry>Specifies the file with detokenizer dictionary.</entry> |
| </row> |
| <row> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| </tbody> |
| </tgroup></informaltable> |
| |
| </section> |
| |
| <section id='tools.cli.sentdetect.SentenceDetectorConverter'> |
| |
| <title>SentenceDetectorConverter</title> |
| |
| <para>Converts foreign data formats (irishsentencebank,ad,pos,conllx,namefinder,parse,moses,conllu,letsmt) to native OpenNLP format</para> |
| |
| <screen> |
| <![CDATA[ |
| Usage: opennlp SentenceDetectorConverter |
| help|irishsentencebank|ad|pos|conllx|namefinder|parse|moses|conllu|letsmt [help|options...] |
| ]]> |
| </screen> |
| <para>The supported formats and arguments are:</para> |
| |
| <informaltable frame='all'><tgroup cols='4' align='left' colsep='1' rowsep='1'> |
| <thead><row><entry>Format</entry><entry>Argument</entry><entry>Value</entry><entry>Optional</entry><entry>Description</entry></row></thead> |
| <tbody> |
| <row> |
| <entry morerows='1' valign='middle'>irishsentencebank</entry> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry morerows='3' valign='middle'>ad</entry> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>No</entry> |
| <entry>Encoding for reading and writing text.</entry> |
| </row> |
| <row> |
| <entry>lang</entry> |
| <entry>language</entry> |
| <entry>No</entry> |
| <entry>Language which is being processed.</entry> |
| </row> |
| <row> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>includeTitles</entry> |
| <entry>includeTitles</entry> |
| <entry>Yes</entry> |
| <entry>If true will include sentences marked as headlines.</entry> |
| </row> |
| <row> |
| <entry morerows='2' valign='middle'>pos</entry> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry>detokenizer</entry> |
| <entry>dictionary</entry> |
| <entry>No</entry> |
| <entry>Specifies the file with detokenizer dictionary.</entry> |
| </row> |
| <row> |
| <entry morerows='2' valign='middle'>conllx</entry> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry>detokenizer</entry> |
| <entry>dictionary</entry> |
| <entry>No</entry> |
| <entry>Specifies the file with detokenizer dictionary.</entry> |
| </row> |
| <row> |
| <entry morerows='2' valign='middle'>namefinder</entry> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry>detokenizer</entry> |
| <entry>dictionary</entry> |
| <entry>No</entry> |
| <entry>Specifies the file with detokenizer dictionary.</entry> |
| </row> |
| <row> |
| <entry morerows='2' valign='middle'>parse</entry> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry>detokenizer</entry> |
| <entry>dictionary</entry> |
| <entry>No</entry> |
| <entry>Specifies the file with detokenizer dictionary.</entry> |
| </row> |
| <row> |
| <entry morerows='1' valign='middle'>moses</entry> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry morerows='2' valign='middle'>conllu</entry> |
| <entry>sentencesPerSample</entry> |
| <entry>sentencesPerSample</entry> |
| <entry>No</entry> |
| <entry>Number of sentences per sample</entry> |
| </row> |
| <row> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry morerows='2' valign='middle'>letsmt</entry> |
| <entry>detokenizer</entry> |
| <entry>dictionary</entry> |
| <entry>Yes</entry> |
| <entry>Specifies the file with detokenizer dictionary.</entry> |
| </row> |
| <row> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| </tbody> |
| </tgroup></informaltable> |
| |
| </section> |
| |
| </section> |
| |
| <section id='tools.cli.namefind'> |
| |
| <title>Namefind</title> |
| |
| <section id='tools.cli.namefind.TokenNameFinder'> |
| |
| <title>TokenNameFinder</title> |
| |
| <para>Learnable name finder</para> |
| |
| <screen> |
| <![CDATA[ |
| Usage: opennlp TokenNameFinder model1 model2 ... modelN < sentences |
| |
| ]]> |
| </screen> |
| </section> |
| |
| <section id='tools.cli.namefind.TokenNameFinderTrainer'> |
| |
| <title>TokenNameFinderTrainer</title> |
| |
| <para>Trainer for the learnable name finder</para> |
| |
| <screen> |
| <![CDATA[ |
| Usage: opennlp TokenNameFinderTrainer[.evalita|.ad|.conll03|.bionlp2004|.conll02|.muc6|.ontonotes|.brat] |
| [-factory factoryName] [-resources resourcesDir] [-type modelType] [-featuregen featuregenFile] |
| [-nameTypes types] [-sequenceCodec codec] [-params paramsFile] -lang language -model modelFile -data |
| sampleData [-encoding charsetName] |
| Arguments description: |
| -factory factoryName |
| A sub-class of TokenNameFinderFactory |
| -resources resourcesDir |
| The resources directory |
| -type modelType |
| The type of the token name finder model |
| -featuregen featuregenFile |
| The feature generator descriptor file |
| -nameTypes types |
| name types to use for training |
| -sequenceCodec codec |
| sequence codec used to code name spans |
| -params paramsFile |
| training parameters file. |
| -lang language |
| language which is being processed. |
| -model modelFile |
| output model file. |
| -data sampleData |
| data to be used, usually a file name. |
| -encoding charsetName |
| encoding for reading and writing text, if absent the system default is used. |
| |
| ]]> |
| </screen> |
| <para>The supported formats and arguments are:</para> |
| |
| <informaltable frame='all'><tgroup cols='4' align='left' colsep='1' rowsep='1'> |
| <thead><row><entry>Format</entry><entry>Argument</entry><entry>Value</entry><entry>Optional</entry><entry>Description</entry></row></thead> |
| <tbody> |
| <row> |
| <entry morerows='3' valign='middle'>evalita</entry> |
| <entry>lang</entry> |
| <entry>it</entry> |
| <entry>No</entry> |
| <entry></entry> |
| </row> |
| <row> |
| <entry>types</entry> |
| <entry>per,loc,org,gpe</entry> |
| <entry>No</entry> |
| <entry></entry> |
| </row> |
| <row> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry morerows='3' valign='middle'>ad</entry> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>No</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry>lang</entry> |
| <entry>language</entry> |
| <entry>No</entry> |
| <entry>Language which is being processed.</entry> |
| </row> |
| <row> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>splitHyphenatedTokens</entry> |
| <entry>split</entry> |
| <entry>Yes</entry> |
| <entry>If true all hyphenated tokens will be separated (default true)</entry> |
| </row> |
| <row> |
| <entry morerows='3' valign='middle'>conll03</entry> |
| <entry>lang</entry> |
| <entry>eng|deu</entry> |
| <entry>No</entry> |
| <entry></entry> |
| </row> |
| <row> |
| <entry>types</entry> |
| <entry>per,loc,org,misc</entry> |
| <entry>No</entry> |
| <entry></entry> |
| </row> |
| <row> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry morerows='2' valign='middle'>bionlp2004</entry> |
| <entry>types</entry> |
| <entry>DNA,protein,cell_type,cell_line,RNA</entry> |
| <entry>No</entry> |
| <entry></entry> |
| </row> |
| <row> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry morerows='3' valign='middle'>conll02</entry> |
| <entry>lang</entry> |
| <entry>spa|nld</entry> |
| <entry>No</entry> |
| <entry></entry> |
| </row> |
| <row> |
| <entry>types</entry> |
| <entry>per,loc,org,misc</entry> |
| <entry>No</entry> |
| <entry></entry> |
| </row> |
| <row> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry morerows='2' valign='middle'>muc6</entry> |
| <entry>tokenizerModel</entry> |
| <entry>modelFile</entry> |
| <entry>No</entry> |
| <entry></entry> |
| </row> |
| <row> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry morerows='0' valign='middle'>ontonotes</entry> |
| <entry>ontoNotesDir</entry> |
| <entry>OntoNotes 4.0 corpus directory</entry> |
| <entry>No</entry> |
| <entry></entry> |
| </row> |
| <row> |
| <entry morerows='5' valign='middle'>brat</entry> |
| <entry>tokenizerModel</entry> |
| <entry>modelFile</entry> |
| <entry>Yes</entry> |
| <entry></entry> |
| </row> |
| <row> |
| <entry>ruleBasedTokenizer</entry> |
| <entry>name</entry> |
| <entry>Yes</entry> |
| <entry></entry> |
| </row> |
| <row> |
| <entry>annotationConfig</entry> |
| <entry>annConfFile</entry> |
| <entry>No</entry> |
| <entry></entry> |
| </row> |
| <row> |
| <entry>bratDataDir</entry> |
| <entry>bratDataDir</entry> |
| <entry>No</entry> |
| <entry>Location of brat data dir</entry> |
| </row> |
| <row> |
| <entry>recursive</entry> |
| <entry>value</entry> |
| <entry>Yes</entry> |
| <entry></entry> |
| </row> |
| <row> |
| <entry>sentenceDetectorModel</entry> |
| <entry>modelFile</entry> |
| <entry>Yes</entry> |
| <entry></entry> |
| </row> |
| </tbody> |
| </tgroup></informaltable> |
| |
| </section> |
| |
| <section id='tools.cli.namefind.TokenNameFinderEvaluator'> |
| |
| <title>TokenNameFinderEvaluator</title> |
| |
| <para>Measures the performance of the NameFinder model with the reference data</para> |
| |
| <screen> |
| <![CDATA[ |
| Usage: opennlp TokenNameFinderEvaluator[.evalita|.ad|.conll03|.bionlp2004|.conll02|.muc6|.ontonotes|.brat] |
| [-nameTypes types] -model model [-misclassified true|false] [-detailedF true|false] |
| [-reportOutputFile outputFile] -data sampleData [-encoding charsetName] |
| Arguments description: |
| -nameTypes types |
| name types to use for evaluation |
| -model model |
| the model file to be evaluated. |
| -misclassified true|false |
| if true will print false negatives and false positives. |
| -detailedF true|false |
| if true (default) will print detailed FMeasure results. |
| -reportOutputFile outputFile |
| the path of the fine-grained report file. |
| -data sampleData |
| data to be used, usually a file name. |
| -encoding charsetName |
| encoding for reading and writing text, if absent the system default is used. |
| |
| ]]> |
| </screen> |
| <para>The supported formats and arguments are:</para> |
| |
| <informaltable frame='all'><tgroup cols='4' align='left' colsep='1' rowsep='1'> |
| <thead><row><entry>Format</entry><entry>Argument</entry><entry>Value</entry><entry>Optional</entry><entry>Description</entry></row></thead> |
| <tbody> |
| <row> |
| <entry morerows='3' valign='middle'>evalita</entry> |
| <entry>lang</entry> |
| <entry>it</entry> |
| <entry>No</entry> |
| <entry></entry> |
| </row> |
| <row> |
| <entry>types</entry> |
| <entry>per,loc,org,gpe</entry> |
| <entry>No</entry> |
| <entry></entry> |
| </row> |
| <row> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry morerows='3' valign='middle'>ad</entry> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>No</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry>lang</entry> |
| <entry>language</entry> |
| <entry>No</entry> |
| <entry>Language which is being processed.</entry> |
| </row> |
| <row> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>splitHyphenatedTokens</entry> |
| <entry>split</entry> |
| <entry>Yes</entry> |
| <entry>If true all hyphenated tokens will be separated (default true)</entry> |
| </row> |
| <row> |
| <entry morerows='3' valign='middle'>conll03</entry> |
| <entry>lang</entry> |
| <entry>eng|deu</entry> |
| <entry>No</entry> |
| <entry></entry> |
| </row> |
| <row> |
| <entry>types</entry> |
| <entry>per,loc,org,misc</entry> |
| <entry>No</entry> |
| <entry></entry> |
| </row> |
| <row> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry morerows='2' valign='middle'>bionlp2004</entry> |
| <entry>types</entry> |
| <entry>DNA,protein,cell_type,cell_line,RNA</entry> |
| <entry>No</entry> |
| <entry></entry> |
| </row> |
| <row> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry morerows='3' valign='middle'>conll02</entry> |
| <entry>lang</entry> |
| <entry>spa|nld</entry> |
| <entry>No</entry> |
| <entry></entry> |
| </row> |
| <row> |
| <entry>types</entry> |
| <entry>per,loc,org,misc</entry> |
| <entry>No</entry> |
| <entry></entry> |
| </row> |
| <row> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry morerows='2' valign='middle'>muc6</entry> |
| <entry>tokenizerModel</entry> |
| <entry>modelFile</entry> |
| <entry>No</entry> |
| <entry></entry> |
| </row> |
| <row> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry morerows='0' valign='middle'>ontonotes</entry> |
| <entry>ontoNotesDir</entry> |
| <entry>OntoNotes 4.0 corpus directory</entry> |
| <entry>No</entry> |
| <entry></entry> |
| </row> |
| <row> |
| <entry morerows='5' valign='middle'>brat</entry> |
| <entry>tokenizerModel</entry> |
| <entry>modelFile</entry> |
| <entry>Yes</entry> |
| <entry></entry> |
| </row> |
| <row> |
| <entry>ruleBasedTokenizer</entry> |
| <entry>name</entry> |
| <entry>Yes</entry> |
| <entry></entry> |
| </row> |
| <row> |
| <entry>annotationConfig</entry> |
| <entry>annConfFile</entry> |
| <entry>No</entry> |
| <entry></entry> |
| </row> |
| <row> |
| <entry>bratDataDir</entry> |
| <entry>bratDataDir</entry> |
| <entry>No</entry> |
| <entry>Location of brat data dir</entry> |
| </row> |
| <row> |
| <entry>recursive</entry> |
| <entry>value</entry> |
| <entry>Yes</entry> |
| <entry></entry> |
| </row> |
| <row> |
| <entry>sentenceDetectorModel</entry> |
| <entry>modelFile</entry> |
| <entry>Yes</entry> |
| <entry></entry> |
| </row> |
| </tbody> |
| </tgroup></informaltable> |
| |
| </section> |
| |
| <section id='tools.cli.namefind.TokenNameFinderCrossValidator'> |
| |
| <title>TokenNameFinderCrossValidator</title> |
| |
| <para>K-fold cross validator for the learnable Name Finder</para> |
| |
| <screen> |
| <![CDATA[ |
| Usage: opennlp |
| TokenNameFinderCrossValidator[.evalita|.ad|.conll03|.bionlp2004|.conll02|.muc6|.ontonotes|.brat] |
| [-factory factoryName] [-resources resourcesDir] [-type modelType] [-featuregen featuregenFile] |
| [-nameTypes types] [-sequenceCodec codec] [-params paramsFile] -lang language [-misclassified |
| true|false] [-folds num] [-detailedF true|false] [-reportOutputFile outputFile] -data sampleData |
| [-encoding charsetName] |
| Arguments description: |
| -factory factoryName |
| A sub-class of TokenNameFinderFactory |
| -resources resourcesDir |
| The resources directory |
| -type modelType |
| The type of the token name finder model |
| -featuregen featuregenFile |
| The feature generator descriptor file |
| -nameTypes types |
| name types to use for training |
| -sequenceCodec codec |
| sequence codec used to code name spans |
| -params paramsFile |
| training parameters file. |
| -lang language |
| language which is being processed. |
| -misclassified true|false |
| if true will print false negatives and false positives. |
| -folds num |
| number of folds, default is 10. |
| -detailedF true|false |
| if true (default) will print detailed FMeasure results. |
| -reportOutputFile outputFile |
| the path of the fine-grained report file. |
| -data sampleData |
| data to be used, usually a file name. |
| -encoding charsetName |
| encoding for reading and writing text, if absent the system default is used. |
| |
| ]]> |
| </screen> |
| <para>The supported formats and arguments are:</para> |
| |
| <informaltable frame='all'><tgroup cols='4' align='left' colsep='1' rowsep='1'> |
| <thead><row><entry>Format</entry><entry>Argument</entry><entry>Value</entry><entry>Optional</entry><entry>Description</entry></row></thead> |
| <tbody> |
| <row> |
| <entry morerows='3' valign='middle'>evalita</entry> |
| <entry>lang</entry> |
| <entry>it</entry> |
| <entry>No</entry> |
| <entry></entry> |
| </row> |
| <row> |
| <entry>types</entry> |
| <entry>per,loc,org,gpe</entry> |
| <entry>No</entry> |
| <entry></entry> |
| </row> |
| <row> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry morerows='3' valign='middle'>ad</entry> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>No</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry>lang</entry> |
| <entry>language</entry> |
| <entry>No</entry> |
| <entry>Language which is being processed.</entry> |
| </row> |
| <row> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>splitHyphenatedTokens</entry> |
| <entry>split</entry> |
| <entry>Yes</entry> |
| <entry>If true all hyphenated tokens will be separated (default true)</entry> |
| </row> |
| <row> |
| <entry morerows='3' valign='middle'>conll03</entry> |
| <entry>lang</entry> |
| <entry>eng|deu</entry> |
| <entry>No</entry> |
| <entry></entry> |
| </row> |
| <row> |
| <entry>types</entry> |
| <entry>per,loc,org,misc</entry> |
| <entry>No</entry> |
| <entry></entry> |
| </row> |
| <row> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry morerows='2' valign='middle'>bionlp2004</entry> |
| <entry>types</entry> |
| <entry>DNA,protein,cell_type,cell_line,RNA</entry> |
| <entry>No</entry> |
| <entry></entry> |
| </row> |
| <row> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry morerows='3' valign='middle'>conll02</entry> |
| <entry>lang</entry> |
| <entry>spa|nld</entry> |
| <entry>No</entry> |
| <entry></entry> |
| </row> |
| <row> |
| <entry>types</entry> |
| <entry>per,loc,org,misc</entry> |
| <entry>No</entry> |
| <entry></entry> |
| </row> |
| <row> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry morerows='2' valign='middle'>muc6</entry> |
| <entry>tokenizerModel</entry> |
| <entry>modelFile</entry> |
| <entry>No</entry> |
| <entry></entry> |
| </row> |
| <row> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry morerows='0' valign='middle'>ontonotes</entry> |
| <entry>ontoNotesDir</entry> |
| <entry>OntoNotes 4.0 corpus directory</entry> |
| <entry>No</entry> |
| <entry></entry> |
| </row> |
| <row> |
| <entry morerows='5' valign='middle'>brat</entry> |
| <entry>tokenizerModel</entry> |
| <entry>modelFile</entry> |
| <entry>Yes</entry> |
| <entry></entry> |
| </row> |
| <row> |
| <entry>ruleBasedTokenizer</entry> |
| <entry>name</entry> |
| <entry>Yes</entry> |
| <entry></entry> |
| </row> |
| <row> |
| <entry>annotationConfig</entry> |
| <entry>annConfFile</entry> |
| <entry>No</entry> |
| <entry></entry> |
| </row> |
| <row> |
| <entry>bratDataDir</entry> |
| <entry>bratDataDir</entry> |
| <entry>No</entry> |
| <entry>Location of brat data dir</entry> |
| </row> |
| <row> |
| <entry>recursive</entry> |
| <entry>value</entry> |
| <entry>Yes</entry> |
| <entry></entry> |
| </row> |
| <row> |
| <entry>sentenceDetectorModel</entry> |
| <entry>modelFile</entry> |
| <entry>Yes</entry> |
| <entry></entry> |
| </row> |
| </tbody> |
| </tgroup></informaltable> |
| |
| </section> |
| |
| <section id='tools.cli.namefind.TokenNameFinderConverter'> |
| |
| <title>TokenNameFinderConverter</title> |
| |
| <para>Converts foreign data formats (evalita,ad,conll03,bionlp2004,conll02,muc6,ontonotes,brat) to native OpenNLP format</para> |
| |
| <screen> |
| <![CDATA[ |
| Usage: opennlp TokenNameFinderConverter help|evalita|ad|conll03|bionlp2004|conll02|muc6|ontonotes|brat |
| [help|options...] |
| ]]> |
| </screen> |
| <para>The supported formats and arguments are:</para> |
| |
| <informaltable frame='all'><tgroup cols='4' align='left' colsep='1' rowsep='1'> |
| <thead><row><entry>Format</entry><entry>Argument</entry><entry>Value</entry><entry>Optional</entry><entry>Description</entry></row></thead> |
| <tbody> |
| <row> |
| <entry morerows='3' valign='middle'>evalita</entry> |
| <entry>lang</entry> |
| <entry>it</entry> |
| <entry>No</entry> |
| <entry></entry> |
| </row> |
| <row> |
| <entry>types</entry> |
| <entry>per,loc,org,gpe</entry> |
| <entry>No</entry> |
| <entry></entry> |
| </row> |
| <row> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry morerows='3' valign='middle'>ad</entry> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>No</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry>lang</entry> |
| <entry>language</entry> |
| <entry>No</entry> |
| <entry>Language which is being processed.</entry> |
| </row> |
| <row> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>splitHyphenatedTokens</entry> |
| <entry>split</entry> |
| <entry>Yes</entry> |
| <entry>If true all hyphenated tokens will be separated (default true)</entry> |
| </row> |
| <row> |
| <entry morerows='3' valign='middle'>conll03</entry> |
| <entry>lang</entry> |
| <entry>eng|deu</entry> |
| <entry>No</entry> |
| <entry></entry> |
| </row> |
| <row> |
| <entry>types</entry> |
| <entry>per,loc,org,misc</entry> |
| <entry>No</entry> |
| <entry></entry> |
| </row> |
| <row> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry morerows='2' valign='middle'>bionlp2004</entry> |
| <entry>types</entry> |
| <entry>DNA,protein,cell_type,cell_line,RNA</entry> |
| <entry>No</entry> |
| <entry></entry> |
| </row> |
| <row> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry morerows='3' valign='middle'>conll02</entry> |
| <entry>lang</entry> |
| <entry>spa|nld</entry> |
| <entry>No</entry> |
| <entry></entry> |
| </row> |
| <row> |
| <entry>types</entry> |
| <entry>per,loc,org,misc</entry> |
| <entry>No</entry> |
| <entry></entry> |
| </row> |
| <row> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry morerows='2' valign='middle'>muc6</entry> |
| <entry>tokenizerModel</entry> |
| <entry>modelFile</entry> |
| <entry>No</entry> |
| <entry></entry> |
| </row> |
| <row> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry morerows='0' valign='middle'>ontonotes</entry> |
| <entry>ontoNotesDir</entry> |
| <entry>OntoNotes 4.0 corpus directory</entry> |
| <entry>No</entry> |
| <entry></entry> |
| </row> |
| <row> |
| <entry morerows='5' valign='middle'>brat</entry> |
| <entry>tokenizerModel</entry> |
| <entry>modelFile</entry> |
| <entry>Yes</entry> |
| <entry></entry> |
| </row> |
| <row> |
| <entry>ruleBasedTokenizer</entry> |
| <entry>name</entry> |
| <entry>Yes</entry> |
| <entry></entry> |
| </row> |
| <row> |
| <entry>annotationConfig</entry> |
| <entry>annConfFile</entry> |
| <entry>No</entry> |
| <entry></entry> |
| </row> |
| <row> |
| <entry>bratDataDir</entry> |
| <entry>bratDataDir</entry> |
| <entry>No</entry> |
| <entry>Location of brat data dir</entry> |
| </row> |
| <row> |
| <entry>recursive</entry> |
| <entry>value</entry> |
| <entry>Yes</entry> |
| <entry></entry> |
| </row> |
| <row> |
| <entry>sentenceDetectorModel</entry> |
| <entry>modelFile</entry> |
| <entry>Yes</entry> |
| <entry></entry> |
| </row> |
| </tbody> |
| </tgroup></informaltable> |
| |
| </section> |
| |
| <section id='tools.cli.namefind.CensusDictionaryCreator'> |
| |
| <title>CensusDictionaryCreator</title> |
| |
| <para>Converts 1990 US Census names into a dictionary</para> |
| |
| <screen> |
| <![CDATA[ |
| Usage: opennlp CensusDictionaryCreator [-encoding charsetName] [-lang code] -censusData censusDict -dict dict |
| |
| Arguments description: |
| -encoding charsetName |
| -lang code |
| -censusData censusDict |
| -dict dict |
| |
| ]]> |
| </screen> |
| </section> |
| |
| </section> |
| |
| <section id='tools.cli.postag'> |
| |
| <title>Postag</title> |
| |
| <section id='tools.cli.postag.POSTagger'> |
| |
| <title>POSTagger</title> |
| |
| <para>Learnable part of speech tagger</para> |
| |
| <screen> |
| <![CDATA[ |
| Usage: opennlp POSTagger model < sentences |
| |
| ]]> |
| </screen> |
| </section> |
| |
| <section id='tools.cli.postag.POSTaggerTrainer'> |
| |
| <title>POSTaggerTrainer</title> |
| |
| <para>Trains a model for the part-of-speech tagger</para> |
| |
| <screen> |
| <![CDATA[ |
| Usage: opennlp POSTaggerTrainer[.ad|.conllx|.parse|.ontonotes|.conllu] [-factory factoryName] [-resources |
| resourcesDir] [-tagDictCutoff tagDictCutoff] [-featuregen featuregenFile] [-dict dictionaryPath] |
| [-params paramsFile] -lang language -model modelFile -data sampleData [-encoding charsetName] |
| Arguments description: |
| -factory factoryName |
| A sub-class of POSTaggerFactory where to get implementation and resources. |
| -resources resourcesDir |
| The resources directory |
| -tagDictCutoff tagDictCutoff |
| TagDictionary cutoff. If specified will create/expand a mutable TagDictionary |
| -featuregen featuregenFile |
| The feature generator descriptor file |
| -dict dictionaryPath |
| The XML tag dictionary file |
| -params paramsFile |
| training parameters file. |
| -lang language |
| language which is being processed. |
| -model modelFile |
| output model file. |
| -data sampleData |
| data to be used, usually a file name. |
| -encoding charsetName |
| encoding for reading and writing text, if absent the system default is used. |
| |
| ]]> |
| </screen> |
| <para>The supported formats and arguments are:</para> |
| |
| <informaltable frame='all'><tgroup cols='4' align='left' colsep='1' rowsep='1'> |
| <thead><row><entry>Format</entry><entry>Argument</entry><entry>Value</entry><entry>Optional</entry><entry>Description</entry></row></thead> |
| <tbody> |
| <row> |
| <entry morerows='4' valign='middle'>ad</entry> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>No</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry>lang</entry> |
| <entry>language</entry> |
| <entry>No</entry> |
| <entry>Language which is being processed.</entry> |
| </row> |
| <row> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>expandME</entry> |
| <entry>expandME</entry> |
| <entry>Yes</entry> |
| <entry>Expand multiword expressions.</entry> |
| </row> |
| <row> |
| <entry>includeFeatures</entry> |
| <entry>includeFeatures</entry> |
| <entry>Yes</entry> |
| <entry>Combine POS Tags with word features, like number and gender.</entry> |
| </row> |
| <row> |
| <entry morerows='1' valign='middle'>conllx</entry> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry morerows='1' valign='middle'>parse</entry> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry morerows='0' valign='middle'>ontonotes</entry> |
| <entry>ontoNotesDir</entry> |
| <entry>OntoNotes 4.0 corpus directory</entry> |
| <entry>No</entry> |
| <entry></entry> |
| </row> |
| <row> |
| <entry morerows='2' valign='middle'>conllu</entry> |
| <entry>tagset</entry> |
| <entry>tagset</entry> |
| <entry>Yes</entry> |
| <entry>U|x u for unified tags and x for language-specific part-of-speech tags</entry> |
| </row> |
| <row> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| </tbody> |
| </tgroup></informaltable> |
| |
| </section> |
| |
| <section id='tools.cli.postag.POSTaggerEvaluator'> |
| |
| <title>POSTaggerEvaluator</title> |
| |
| <para>Measures the performance of the POS tagger model with the reference data</para> |
| |
| <screen> |
| <![CDATA[ |
| Usage: opennlp POSTaggerEvaluator[.ad|.conllx|.parse|.ontonotes|.conllu] -model model [-misclassified |
| true|false] [-reportOutputFile outputFile] -data sampleData [-encoding charsetName] |
| Arguments description: |
| -model model |
| the model file to be evaluated. |
| -misclassified true|false |
| if true will print false negatives and false positives. |
| -reportOutputFile outputFile |
| the path of the fine-grained report file. |
| -data sampleData |
| data to be used, usually a file name. |
| -encoding charsetName |
| encoding for reading and writing text, if absent the system default is used. |
| |
| ]]> |
| </screen> |
| <para>The supported formats and arguments are:</para> |
| |
| <informaltable frame='all'><tgroup cols='4' align='left' colsep='1' rowsep='1'> |
| <thead><row><entry>Format</entry><entry>Argument</entry><entry>Value</entry><entry>Optional</entry><entry>Description</entry></row></thead> |
| <tbody> |
| <row> |
| <entry morerows='4' valign='middle'>ad</entry> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>No</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry>lang</entry> |
| <entry>language</entry> |
| <entry>No</entry> |
| <entry>Language which is being processed.</entry> |
| </row> |
| <row> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>expandME</entry> |
| <entry>expandME</entry> |
| <entry>Yes</entry> |
| <entry>Expand multiword expressions.</entry> |
| </row> |
| <row> |
| <entry>includeFeatures</entry> |
| <entry>includeFeatures</entry> |
| <entry>Yes</entry> |
| <entry>Combine POS Tags with word features, like number and gender.</entry> |
| </row> |
| <row> |
| <entry morerows='1' valign='middle'>conllx</entry> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry morerows='1' valign='middle'>parse</entry> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry morerows='0' valign='middle'>ontonotes</entry> |
| <entry>ontoNotesDir</entry> |
| <entry>OntoNotes 4.0 corpus directory</entry> |
| <entry>No</entry> |
| <entry></entry> |
| </row> |
| <row> |
| <entry morerows='2' valign='middle'>conllu</entry> |
| <entry>tagset</entry> |
| <entry>tagset</entry> |
| <entry>Yes</entry> |
| <entry>U|x u for unified tags and x for language-specific part-of-speech tags</entry> |
| </row> |
| <row> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| </tbody> |
| </tgroup></informaltable> |
| |
| </section> |
| |
| <section id='tools.cli.postag.POSTaggerCrossValidator'> |
| |
| <title>POSTaggerCrossValidator</title> |
| |
| <para>K-fold cross validator for the learnable POS tagger</para> |
| |
| <screen> |
| <![CDATA[ |
| Usage: opennlp POSTaggerCrossValidator[.ad|.conllx|.parse|.ontonotes|.conllu] [-misclassified true|false] |
| [-folds num] [-factory factoryName] [-resources resourcesDir] [-tagDictCutoff tagDictCutoff] |
| [-featuregen featuregenFile] [-dict dictionaryPath] [-params paramsFile] -lang language |
| [-reportOutputFile outputFile] -data sampleData [-encoding charsetName] |
| Arguments description: |
| -misclassified true|false |
| if true will print false negatives and false positives. |
| -folds num |
| number of folds, default is 10. |
| -factory factoryName |
| A sub-class of POSTaggerFactory where to get implementation and resources. |
| -resources resourcesDir |
| The resources directory |
| -tagDictCutoff tagDictCutoff |
| TagDictionary cutoff. If specified will create/expand a mutable TagDictionary |
| -featuregen featuregenFile |
| The feature generator descriptor file |
| -dict dictionaryPath |
| The XML tag dictionary file |
| -params paramsFile |
| training parameters file. |
| -lang language |
| language which is being processed. |
| -reportOutputFile outputFile |
| the path of the fine-grained report file. |
| -data sampleData |
| data to be used, usually a file name. |
| -encoding charsetName |
| encoding for reading and writing text, if absent the system default is used. |
| |
| ]]> |
| </screen> |
| <para>The supported formats and arguments are:</para> |
| |
| <informaltable frame='all'><tgroup cols='4' align='left' colsep='1' rowsep='1'> |
| <thead><row><entry>Format</entry><entry>Argument</entry><entry>Value</entry><entry>Optional</entry><entry>Description</entry></row></thead> |
| <tbody> |
| <row> |
| <entry morerows='4' valign='middle'>ad</entry> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>No</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry>lang</entry> |
| <entry>language</entry> |
| <entry>No</entry> |
| <entry>Language which is being processed.</entry> |
| </row> |
| <row> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>expandME</entry> |
| <entry>expandME</entry> |
| <entry>Yes</entry> |
| <entry>Expand multiword expressions.</entry> |
| </row> |
| <row> |
| <entry>includeFeatures</entry> |
| <entry>includeFeatures</entry> |
| <entry>Yes</entry> |
| <entry>Combine POS Tags with word features, like number and gender.</entry> |
| </row> |
| <row> |
| <entry morerows='1' valign='middle'>conllx</entry> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry morerows='1' valign='middle'>parse</entry> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry morerows='0' valign='middle'>ontonotes</entry> |
| <entry>ontoNotesDir</entry> |
| <entry>OntoNotes 4.0 corpus directory</entry> |
| <entry>No</entry> |
| <entry></entry> |
| </row> |
| <row> |
| <entry morerows='2' valign='middle'>conllu</entry> |
| <entry>tagset</entry> |
| <entry>tagset</entry> |
| <entry>Yes</entry> |
| <entry>U|x u for unified tags and x for language-specific part-of-speech tags</entry> |
| </row> |
| <row> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| </tbody> |
| </tgroup></informaltable> |
| |
| </section> |
| |
| <section id='tools.cli.postag.POSTaggerConverter'> |
| |
| <title>POSTaggerConverter</title> |
| |
| <para>Converts foreign data formats (ad,conllx,parse,ontonotes,conllu) to native OpenNLP format</para> |
| |
| <screen> |
| <![CDATA[ |
| Usage: opennlp POSTaggerConverter help|ad|conllx|parse|ontonotes|conllu [help|options...] |
| |
| ]]> |
| </screen> |
| <para>The supported formats and arguments are:</para> |
| |
| <informaltable frame='all'><tgroup cols='4' align='left' colsep='1' rowsep='1'> |
| <thead><row><entry>Format</entry><entry>Argument</entry><entry>Value</entry><entry>Optional</entry><entry>Description</entry></row></thead> |
| <tbody> |
| <row> |
| <entry morerows='4' valign='middle'>ad</entry> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>No</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry>lang</entry> |
| <entry>language</entry> |
| <entry>No</entry> |
| <entry>Language which is being processed.</entry> |
| </row> |
| <row> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>expandME</entry> |
| <entry>expandME</entry> |
| <entry>Yes</entry> |
| <entry>Expand multiword expressions.</entry> |
| </row> |
| <row> |
| <entry>includeFeatures</entry> |
| <entry>includeFeatures</entry> |
| <entry>Yes</entry> |
| <entry>Combine POS Tags with word features, like number and gender.</entry> |
| </row> |
| <row> |
| <entry morerows='1' valign='middle'>conllx</entry> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry morerows='1' valign='middle'>parse</entry> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry morerows='0' valign='middle'>ontonotes</entry> |
| <entry>ontoNotesDir</entry> |
| <entry>OntoNotes 4.0 corpus directory</entry> |
| <entry>No</entry> |
| <entry></entry> |
| </row> |
| <row> |
| <entry morerows='2' valign='middle'>conllu</entry> |
| <entry>tagset</entry> |
| <entry>tagset</entry> |
| <entry>Yes</entry> |
| <entry>U|x u for unified tags and x for language-specific part-of-speech tags</entry> |
| </row> |
| <row> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| </tbody> |
| </tgroup></informaltable> |
| |
| </section> |
| |
| </section> |
| |
| <section id='tools.cli.lemmatizer'> |
| |
| <title>Lemmatizer</title> |
| |
| <section id='tools.cli.lemmatizer.LemmatizerME'> |
| |
| <title>LemmatizerME</title> |
| |
| <para>Learnable lemmatizer</para> |
| |
| <screen> |
| <![CDATA[ |
| Usage: opennlp LemmatizerME model < sentences |
| |
| ]]> |
| </screen> |
| </section> |
| |
| <section id='tools.cli.lemmatizer.LemmatizerTrainerME'> |
| |
| <title>LemmatizerTrainerME</title> |
| |
| <para>Trainer for the learnable lemmatizer</para> |
| |
| <screen> |
| <![CDATA[ |
| Usage: opennlp LemmatizerTrainerME[.conllu] [-factory factoryName] [-params paramsFile] -lang language -model |
| modelFile -data sampleData [-encoding charsetName] |
| Arguments description: |
| -factory factoryName |
| A sub-class of LemmatizerFactory where to get implementation and resources. |
| -params paramsFile |
| training parameters file. |
| -lang language |
| language which is being processed. |
| -model modelFile |
| output model file. |
| -data sampleData |
| data to be used, usually a file name. |
| -encoding charsetName |
| encoding for reading and writing text, if absent the system default is used. |
| |
| ]]> |
| </screen> |
| <para>The supported formats and arguments are:</para> |
| |
| <informaltable frame='all'><tgroup cols='4' align='left' colsep='1' rowsep='1'> |
| <thead><row><entry>Format</entry><entry>Argument</entry><entry>Value</entry><entry>Optional</entry><entry>Description</entry></row></thead> |
| <tbody> |
| <row> |
| <entry morerows='2' valign='middle'>conllu</entry> |
| <entry>tagset</entry> |
| <entry>tagset</entry> |
| <entry>Yes</entry> |
| <entry>U|x u for unified tags and x for language-specific part-of-speech tags</entry> |
| </row> |
| <row> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| </tbody> |
| </tgroup></informaltable> |
| |
| </section> |
| |
| <section id='tools.cli.lemmatizer.LemmatizerEvaluator'> |
| |
| <title>LemmatizerEvaluator</title> |
| |
| <para>Measures the performance of the Lemmatizer model with the reference data</para> |
| |
| <screen> |
| <![CDATA[ |
| Usage: opennlp LemmatizerEvaluator[.conllu] -model model [-misclassified true|false] [-reportOutputFile |
| outputFile] -data sampleData [-encoding charsetName] |
| Arguments description: |
| -model model |
| the model file to be evaluated. |
| -misclassified true|false |
| if true will print false negatives and false positives. |
| -reportOutputFile outputFile |
| the path of the fine-grained report file. |
| -data sampleData |
| data to be used, usually a file name. |
| -encoding charsetName |
| encoding for reading and writing text, if absent the system default is used. |
| |
| ]]> |
| </screen> |
| <para>The supported formats and arguments are:</para> |
| |
| <informaltable frame='all'><tgroup cols='4' align='left' colsep='1' rowsep='1'> |
| <thead><row><entry>Format</entry><entry>Argument</entry><entry>Value</entry><entry>Optional</entry><entry>Description</entry></row></thead> |
| <tbody> |
| <row> |
| <entry morerows='2' valign='middle'>conllu</entry> |
| <entry>tagset</entry> |
| <entry>tagset</entry> |
| <entry>Yes</entry> |
| <entry>U|x u for unified tags and x for language-specific part-of-speech tags</entry> |
| </row> |
| <row> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| </tbody> |
| </tgroup></informaltable> |
| |
| </section> |
| |
| </section> |
| |
| <section id='tools.cli.chunker'> |
| |
| <title>Chunker</title> |
| |
| <section id='tools.cli.chunker.ChunkerME'> |
| |
| <title>ChunkerME</title> |
| |
| <para>Learnable chunker</para> |
| |
| <screen> |
| <![CDATA[ |
| Usage: opennlp ChunkerME model < sentences |
| |
| ]]> |
| </screen> |
| </section> |
| |
| <section id='tools.cli.chunker.ChunkerTrainerME'> |
| |
| <title>ChunkerTrainerME</title> |
| |
| <para>Trainer for the learnable chunker</para> |
| |
| <screen> |
| <![CDATA[ |
| Usage: opennlp ChunkerTrainerME[.ad] [-factory factoryName] [-params paramsFile] -lang language -model |
| modelFile -data sampleData [-encoding charsetName] |
| Arguments description: |
| -factory factoryName |
| A sub-class of ChunkerFactory where to get implementation and resources. |
| -params paramsFile |
| training parameters file. |
| -lang language |
| language which is being processed. |
| -model modelFile |
| output model file. |
| -data sampleData |
| data to be used, usually a file name. |
| -encoding charsetName |
| encoding for reading and writing text, if absent the system default is used. |
| |
| ]]> |
| </screen> |
| <para>The supported formats and arguments are:</para> |
| |
| <informaltable frame='all'><tgroup cols='4' align='left' colsep='1' rowsep='1'> |
| <thead><row><entry>Format</entry><entry>Argument</entry><entry>Value</entry><entry>Optional</entry><entry>Description</entry></row></thead> |
| <tbody> |
| <row> |
| <entry morerows='4' valign='middle'>ad</entry> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>No</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry>lang</entry> |
| <entry>language</entry> |
| <entry>No</entry> |
| <entry>Language which is being processed.</entry> |
| </row> |
| <row> |
| <entry>end</entry> |
| <entry>end</entry> |
| <entry>Yes</entry> |
| <entry>Index of last sentence</entry> |
| </row> |
| <row> |
| <entry>start</entry> |
| <entry>start</entry> |
| <entry>Yes</entry> |
| <entry>Index of first sentence</entry> |
| </row> |
| <row> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| </tbody> |
| </tgroup></informaltable> |
| |
| </section> |
| |
| <section id='tools.cli.chunker.ChunkerEvaluator'> |
| |
| <title>ChunkerEvaluator</title> |
| |
| <para>Measures the performance of the Chunker model with the reference data</para> |
| |
| <screen> |
| <![CDATA[ |
| Usage: opennlp ChunkerEvaluator[.ad] -model model [-misclassified true|false] [-detailedF true|false] -data |
| sampleData [-encoding charsetName] |
| Arguments description: |
| -model model |
| the model file to be evaluated. |
| -misclassified true|false |
| if true will print false negatives and false positives. |
| -detailedF true|false |
| if true (default) will print detailed FMeasure results. |
| -data sampleData |
| data to be used, usually a file name. |
| -encoding charsetName |
| encoding for reading and writing text, if absent the system default is used. |
| |
| ]]> |
| </screen> |
| <para>The supported formats and arguments are:</para> |
| |
| <informaltable frame='all'><tgroup cols='4' align='left' colsep='1' rowsep='1'> |
| <thead><row><entry>Format</entry><entry>Argument</entry><entry>Value</entry><entry>Optional</entry><entry>Description</entry></row></thead> |
| <tbody> |
| <row> |
| <entry morerows='4' valign='middle'>ad</entry> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>No</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry>lang</entry> |
| <entry>language</entry> |
| <entry>No</entry> |
| <entry>Language which is being processed.</entry> |
| </row> |
| <row> |
| <entry>end</entry> |
| <entry>end</entry> |
| <entry>Yes</entry> |
| <entry>Index of last sentence</entry> |
| </row> |
| <row> |
| <entry>start</entry> |
| <entry>start</entry> |
| <entry>Yes</entry> |
| <entry>Index of first sentence</entry> |
| </row> |
| <row> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| </tbody> |
| </tgroup></informaltable> |
| |
| </section> |
| |
| <section id='tools.cli.chunker.ChunkerCrossValidator'> |
| |
| <title>ChunkerCrossValidator</title> |
| |
| <para>K-fold cross validator for the chunker</para> |
| |
| <screen> |
| <![CDATA[ |
| Usage: opennlp ChunkerCrossValidator[.ad] [-factory factoryName] [-params paramsFile] -lang language |
| [-misclassified true|false] [-folds num] [-detailedF true|false] -data sampleData [-encoding |
| charsetName] |
| Arguments description: |
| -factory factoryName |
| A sub-class of ChunkerFactory where to get implementation and resources. |
| -params paramsFile |
| training parameters file. |
| -lang language |
| language which is being processed. |
| -misclassified true|false |
| if true will print false negatives and false positives. |
| -folds num |
| number of folds, default is 10. |
| -detailedF true|false |
| if true (default) will print detailed FMeasure results. |
| -data sampleData |
| data to be used, usually a file name. |
| -encoding charsetName |
| encoding for reading and writing text, if absent the system default is used. |
| |
| ]]> |
| </screen> |
| <para>The supported formats and arguments are:</para> |
| |
| <informaltable frame='all'><tgroup cols='4' align='left' colsep='1' rowsep='1'> |
| <thead><row><entry>Format</entry><entry>Argument</entry><entry>Value</entry><entry>Optional</entry><entry>Description</entry></row></thead> |
| <tbody> |
| <row> |
| <entry morerows='4' valign='middle'>ad</entry> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>No</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry>lang</entry> |
| <entry>language</entry> |
| <entry>No</entry> |
| <entry>Language which is being processed.</entry> |
| </row> |
| <row> |
| <entry>end</entry> |
| <entry>end</entry> |
| <entry>Yes</entry> |
| <entry>Index of last sentence</entry> |
| </row> |
| <row> |
| <entry>start</entry> |
| <entry>start</entry> |
| <entry>Yes</entry> |
| <entry>Index of first sentence</entry> |
| </row> |
| <row> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| </tbody> |
| </tgroup></informaltable> |
| |
| </section> |
| |
| <section id='tools.cli.chunker.ChunkerConverter'> |
| |
| <title>ChunkerConverter</title> |
| |
| <para>Converts ad data format to native OpenNLP format</para> |
| |
| <screen> |
| <![CDATA[ |
| Usage: opennlp ChunkerConverter help|ad [help|options...] |
| |
| ]]> |
| </screen> |
| <para>The supported formats and arguments are:</para> |
| |
| <informaltable frame='all'><tgroup cols='4' align='left' colsep='1' rowsep='1'> |
| <thead><row><entry>Format</entry><entry>Argument</entry><entry>Value</entry><entry>Optional</entry><entry>Description</entry></row></thead> |
| <tbody> |
| <row> |
| <entry morerows='4' valign='middle'>ad</entry> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>No</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| <row> |
| <entry>lang</entry> |
| <entry>language</entry> |
| <entry>No</entry> |
| <entry>Language which is being processed.</entry> |
| </row> |
| <row> |
| <entry>end</entry> |
| <entry>end</entry> |
| <entry>Yes</entry> |
| <entry>Index of last sentence</entry> |
| </row> |
| <row> |
| <entry>start</entry> |
| <entry>start</entry> |
| <entry>Yes</entry> |
| <entry>Index of first sentence</entry> |
| </row> |
| <row> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| </tbody> |
| </tgroup></informaltable> |
| |
| </section> |
| |
| </section> |
| |
| <section id='tools.cli.parser'> |
| |
| <title>Parser</title> |
| |
| <section id='tools.cli.parser.Parser'> |
| |
| <title>Parser</title> |
| |
| <para>Performs full syntactic parsing</para> |
| |
| <screen> |
| <![CDATA[ |
| Usage: opennlp Parser [-bs n -ap n -k n -tk tok_model] model < sentences |
| -bs n: Use a beam size of n. |
| -ap f: Advance outcomes in with at least f% of the probability mass. |
| -k n: Show the top n parses. This will also display their log-probablities. |
| -tk tok_model: Use the specified tokenizer model to tokenize the sentences. Defaults to a WhitespaceTokenizer. |
| |
| ]]> |
| </screen> |
| </section> |
| |
| <section id='tools.cli.parser.ParserTrainer'> |
| |
| <title>ParserTrainer</title> |
| |
| <para>Trains the learnable parser</para> |
| |
| <screen> |
| <![CDATA[ |
| Usage: opennlp ParserTrainer[.ontonotes|.frenchtreebank] [-headRulesSerializerImpl className] -headRules |
| headRulesFile [-parserType CHUNKING|TREEINSERT] [-fun true|false] [-params paramsFile] -lang language |
| -model modelFile [-encoding charsetName] -data sampleData |
| Arguments description: |
| -headRulesSerializerImpl className |
| head rules artifact serializer class name |
| -headRules headRulesFile |
| head rules file. |
| -parserType CHUNKING|TREEINSERT |
| one of CHUNKING or TREEINSERT, default is CHUNKING. |
| -fun true|false |
| Learn to generate function tags. |
| -params paramsFile |
| training parameters file. |
| -lang language |
| language which is being processed. |
| -model modelFile |
| output model file. |
| -encoding charsetName |
| encoding for reading and writing text, if absent the system default is used. |
| -data sampleData |
| data to be used, usually a file name. |
| |
| ]]> |
| </screen> |
| <para>The supported formats and arguments are:</para> |
| |
| <informaltable frame='all'><tgroup cols='4' align='left' colsep='1' rowsep='1'> |
| <thead><row><entry>Format</entry><entry>Argument</entry><entry>Value</entry><entry>Optional</entry><entry>Description</entry></row></thead> |
| <tbody> |
| <row> |
| <entry morerows='0' valign='middle'>ontonotes</entry> |
| <entry>ontoNotesDir</entry> |
| <entry>OntoNotes 4.0 corpus directory</entry> |
| <entry>No</entry> |
| <entry></entry> |
| </row> |
| <row> |
| <entry morerows='1' valign='middle'>frenchtreebank</entry> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| </tbody> |
| </tgroup></informaltable> |
| |
| </section> |
| |
| <section id='tools.cli.parser.ParserEvaluator'> |
| |
| <title>ParserEvaluator</title> |
| |
| <para>Measures the performance of the Parser model with the reference data</para> |
| |
| <screen> |
| <![CDATA[ |
| Usage: opennlp ParserEvaluator[.ontonotes|.frenchtreebank] -model model [-misclassified true|false] -data |
| sampleData [-encoding charsetName] |
| Arguments description: |
| -model model |
| the model file to be evaluated. |
| -misclassified true|false |
| if true will print false negatives and false positives. |
| -data sampleData |
| data to be used, usually a file name. |
| -encoding charsetName |
| encoding for reading and writing text, if absent the system default is used. |
| |
| ]]> |
| </screen> |
| <para>The supported formats and arguments are:</para> |
| |
| <informaltable frame='all'><tgroup cols='4' align='left' colsep='1' rowsep='1'> |
| <thead><row><entry>Format</entry><entry>Argument</entry><entry>Value</entry><entry>Optional</entry><entry>Description</entry></row></thead> |
| <tbody> |
| <row> |
| <entry morerows='0' valign='middle'>ontonotes</entry> |
| <entry>ontoNotesDir</entry> |
| <entry>OntoNotes 4.0 corpus directory</entry> |
| <entry>No</entry> |
| <entry></entry> |
| </row> |
| <row> |
| <entry morerows='1' valign='middle'>frenchtreebank</entry> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| </tbody> |
| </tgroup></informaltable> |
| |
| </section> |
| |
| <section id='tools.cli.parser.ParserConverter'> |
| |
| <title>ParserConverter</title> |
| |
| <para>Converts foreign data formats (ontonotes,frenchtreebank) to native OpenNLP format</para> |
| |
| <screen> |
| <![CDATA[ |
| Usage: opennlp ParserConverter help|ontonotes|frenchtreebank [help|options...] |
| |
| ]]> |
| </screen> |
| <para>The supported formats and arguments are:</para> |
| |
| <informaltable frame='all'><tgroup cols='4' align='left' colsep='1' rowsep='1'> |
| <thead><row><entry>Format</entry><entry>Argument</entry><entry>Value</entry><entry>Optional</entry><entry>Description</entry></row></thead> |
| <tbody> |
| <row> |
| <entry morerows='0' valign='middle'>ontonotes</entry> |
| <entry>ontoNotesDir</entry> |
| <entry>OntoNotes 4.0 corpus directory</entry> |
| <entry>No</entry> |
| <entry></entry> |
| </row> |
| <row> |
| <entry morerows='1' valign='middle'>frenchtreebank</entry> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| </tbody> |
| </tgroup></informaltable> |
| |
| </section> |
| |
| <section id='tools.cli.parser.BuildModelUpdater'> |
| |
| <title>BuildModelUpdater</title> |
| |
| <para>Trains and updates the build model in a parser model</para> |
| |
| <screen> |
| <![CDATA[ |
| Usage: opennlp BuildModelUpdater[.ontonotes|.frenchtreebank] -model modelFile [-params paramsFile] -lang |
| language -data sampleData [-encoding charsetName] |
| Arguments description: |
| -model modelFile |
| output model file. |
| -params paramsFile |
| training parameters file. |
| -lang language |
| language which is being processed. |
| -data sampleData |
| data to be used, usually a file name. |
| -encoding charsetName |
| encoding for reading and writing text, if absent the system default is used. |
| |
| ]]> |
| </screen> |
| <para>The supported formats and arguments are:</para> |
| |
| <informaltable frame='all'><tgroup cols='4' align='left' colsep='1' rowsep='1'> |
| <thead><row><entry>Format</entry><entry>Argument</entry><entry>Value</entry><entry>Optional</entry><entry>Description</entry></row></thead> |
| <tbody> |
| <row> |
| <entry morerows='0' valign='middle'>ontonotes</entry> |
| <entry>ontoNotesDir</entry> |
| <entry>OntoNotes 4.0 corpus directory</entry> |
| <entry>No</entry> |
| <entry></entry> |
| </row> |
| <row> |
| <entry morerows='1' valign='middle'>frenchtreebank</entry> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| </tbody> |
| </tgroup></informaltable> |
| |
| </section> |
| |
| <section id='tools.cli.parser.CheckModelUpdater'> |
| |
| <title>CheckModelUpdater</title> |
| |
| <para>Trains and updates the check model in a parser model</para> |
| |
| <screen> |
| <![CDATA[ |
| Usage: opennlp CheckModelUpdater[.ontonotes|.frenchtreebank] -model modelFile [-params paramsFile] -lang |
| language -data sampleData [-encoding charsetName] |
| Arguments description: |
| -model modelFile |
| output model file. |
| -params paramsFile |
| training parameters file. |
| -lang language |
| language which is being processed. |
| -data sampleData |
| data to be used, usually a file name. |
| -encoding charsetName |
| encoding for reading and writing text, if absent the system default is used. |
| |
| ]]> |
| </screen> |
| <para>The supported formats and arguments are:</para> |
| |
| <informaltable frame='all'><tgroup cols='4' align='left' colsep='1' rowsep='1'> |
| <thead><row><entry>Format</entry><entry>Argument</entry><entry>Value</entry><entry>Optional</entry><entry>Description</entry></row></thead> |
| <tbody> |
| <row> |
| <entry morerows='0' valign='middle'>ontonotes</entry> |
| <entry>ontoNotesDir</entry> |
| <entry>OntoNotes 4.0 corpus directory</entry> |
| <entry>No</entry> |
| <entry></entry> |
| </row> |
| <row> |
| <entry morerows='1' valign='middle'>frenchtreebank</entry> |
| <entry>data</entry> |
| <entry>sampleData</entry> |
| <entry>No</entry> |
| <entry>Data to be used, usually a file name.</entry> |
| </row> |
| <row> |
| <entry>encoding</entry> |
| <entry>charsetName</entry> |
| <entry>Yes</entry> |
| <entry>Encoding for reading and writing text, if absent the system default is used.</entry> |
| </row> |
| </tbody> |
| </tgroup></informaltable> |
| |
| </section> |
| |
| <section id='tools.cli.parser.TaggerModelReplacer'> |
| |
| <title>TaggerModelReplacer</title> |
| |
| <para>Replaces the tagger model in a parser model</para> |
| |
| <screen> |
| <![CDATA[ |
| Usage: opennlp TaggerModelReplacer parser.model tagger.model |
| |
| ]]> |
| </screen> |
| </section> |
| |
| </section> |
| |
| <section id='tools.cli.entitylinker'> |
| |
| <title>Entitylinker</title> |
| |
| <section id='tools.cli.entitylinker.EntityLinker'> |
| |
| <title>EntityLinker</title> |
| |
| <para>Links an entity to an external data set</para> |
| |
| <screen> |
| <![CDATA[ |
| Usage: opennlp EntityLinker model < sentences |
| |
| ]]> |
| </screen> |
| </section> |
| |
| </section> |
| |
| <section id='tools.cli.languagemodel'> |
| |
| <title>Languagemodel</title> |
| |
| <section id='tools.cli.languagemodel.NGramLanguageModel'> |
| |
| <title>NGramLanguageModel</title> |
| |
| <para>Gives the probability and most probable next token(s) of a sequence of tokens in a language model</para> |
| |
| <screen> |
| <![CDATA[ |
| Usage: opennlp NGramLanguageModel model |
| |
| ]]> |
| </screen> |
| </section> |
| |
| </section> |
| |
| |
| |
| </chapter> |