blob: 9ba77271292215b66ae205c2fc38f13912e9f26a [file] [log] [blame]
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN"
"http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd"[
]>
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
<chapter id="opennlp">
<title>Introduction</title>
<section id="intro.description">
<title>Description</title>
<para>
The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text.
It supports the most common NLP tasks, such as tokenization, sentence segmentation,
part-of-speech tagging, named entity extraction, chunking, parsing, and coreference resolution.
These tasks are usually required to build more advanced text processing services.
OpenNLP also includes maximum entropy and perceptron based machine learning.
</para>
<para>
The goal of the OpenNLP project will be to create a mature toolkit for the abovementioned tasks.
An additional goal is to provide a large number of pre-built models for a variety of languages, as
well as the annotated text resources that those models are derived from.
</para>
</section>
<section id="intro.general.library.structure">
<title>General Library Structure</title>
<para>The Apache OpenNLP library contains several components, enabling one to build
a full natural language processing pipeline. These components
include: sentence detector, tokenizer,
name finder, document categorizer, part-of-speech tagger, chunker, parser,
coreference resolution. Components contain parts which enable one to execute the
respective natural language processing task, to train a model and often also to evaluate a
model. Each of these facilities is accessible via its application program
interface (API). In addition, a command line interface (CLI) is provided for convenience
of experiments and training.
</para>
</section>
<section id="intro.api">
<title>Application Program Interface (API). Generic Example</title>
<para>
OpenNLP components have similar APIs. Normally, to execute a task,
one should provide a model and an input.
</para>
<para>
A model is usually loaded by providing a FileInputStream with a model to a
constructor of the model class:
<programlisting language="java">
<![CDATA[
try (InputStream modelIn = new FileInputStream("lang-model-name.bin")) {
SomeModel model = new SomeModel(modelIn);
}
]]>
</programlisting>
</para>
<para>
After the model is loaded the tool itself can be instantiated.
<programlisting language="java">
<![CDATA[
ToolName toolName = new ToolName(model);]]>
</programlisting>
After the tool is instantiated, the processing task can be executed. The input and the
output formats are specific to the tool, but often the output is an array of String,
and the input is a String or an array of String.
<programlisting language="java">
<![CDATA[
String output[] = toolName.executeTask("This is a sample text.");]]>
</programlisting>
</para>
</section>
<section id="intro.cli">
<title>Command line interface (CLI)</title>
<section id="intro.cli.description">
<title>Description</title>
<para>
OpenNLP provides a command line script, serving as a unique entry point to all
included tools. The script is located in the bin directory of OpenNLP binary
distribution. Included are versions for Windows: opennlp.bat and Linux or
compatible systems: opennlp.
</para>
</section>
<section id="intro.cli.toolslist">
<title>List of tools</title>
<para>
The list of command line tools for Apache OpenNLP <?eval ${project.version}?>,
as well as a description of its arguments, is available at section <xref linkend="tools.cli"/>.
</para>
</section>
<section id="intro.cli.setup">
<title>Setting up</title>
<para>
OpenNLP script uses JAVA_CMD and JAVA_HOME variables to determine which command to
use to execute Java virtual machine.
</para>
<para>
OpenNLP script uses OPENNLP_HOME variable to determine the location of the binary
distribution of OpenNLP. It is recommended to point this variable to the binary
distribution of current OpenNLP version and update PATH variable to include
$OPENNLP_HOME/bin or %OPENNLP_HOME%\bin.
</para>
<para>
Such configuration allows calling OpenNLP conveniently. Examples below
suppose this configuration has been done.
</para>
</section>
<section id="intro.cli.generic">
<title>Generic Example</title>
<para>
Apache OpenNLP provides a common command line script to access all its tools:
<screen>
<![CDATA[
$ opennlp]]>
</screen>
This script prints current version of the library and lists all available tools:
<screen>
<![CDATA[
OpenNLP <VERSION>. Usage: opennlp TOOL
where TOOL is one of:
Doccat learnable document categorizer
DoccatTrainer trainer for the learnable document categorizer
DoccatConverter converts leipzig data format to native OpenNLP format
DictionaryBuilder builds a new dictionary
SimpleTokenizer character class tokenizer
TokenizerME learnable tokenizer
TokenizerTrainer trainer for the learnable tokenizer
TokenizerMEEvaluator evaluator for the learnable tokenizer
TokenizerCrossValidator K-fold cross validator for the learnable tokenizer
TokenizerConverter converts foreign data formats (namefinder,conllx,pos) to native OpenNLP format
DictionaryDetokenizer
SentenceDetector learnable sentence detector
SentenceDetectorTrainer trainer for the learnable sentence detector
SentenceDetectorEvaluator evaluator for the learnable sentence detector
SentenceDetectorCrossValidator K-fold cross validator for the learnable sentence detector
SentenceDetectorConverter converts foreign data formats (namefinder,conllx,pos) to native OpenNLP format
TokenNameFinder learnable name finder
TokenNameFinderTrainer trainer for the learnable name finder
TokenNameFinderEvaluator Measures the performance of the NameFinder model with the reference data
TokenNameFinderCrossValidator K-fold cross validator for the learnable Name Finder
TokenNameFinderConverter converts foreign data formats (bionlp2004,conll03,conll02,ad) to native OpenNLP format
CensusDictionaryCreator Converts 1990 US Census names into a dictionary
POSTagger learnable part of speech tagger
POSTaggerTrainer trains a model for the part-of-speech tagger
POSTaggerEvaluator Measures the performance of the POS tagger model with the reference data
POSTaggerCrossValidator K-fold cross validator for the learnable POS tagger
POSTaggerConverter converts conllx data format to native OpenNLP format
ChunkerME learnable chunker
ChunkerTrainerME trainer for the learnable chunker
ChunkerEvaluator Measures the performance of the Chunker model with the reference data
ChunkerCrossValidator K-fold cross validator for the chunker
ChunkerConverter converts ad data format to native OpenNLP format
Parser performs full syntactic parsing
ParserTrainer trains the learnable parser
ParserEvaluator Measures the performance of the Parser model with the reference data
BuildModelUpdater trains and updates the build model in a parser model
CheckModelUpdater trains and updates the check model in a parser model
TaggerModelReplacer replaces the tagger model in a parser model
All tools print help when invoked with help parameter
Example: opennlp SimpleTokenizer help
]]>
</screen>
</para>
<para>OpenNLP tools have similar command line structure and options. To discover tool
options, run it with no parameters:
<screen>
<![CDATA[
$ opennlp ToolName]]>
</screen>
The tool will output two blocks of help.
</para>
<para>
The first block describes the general structure of this tool command line:
<screen>
<![CDATA[
Usage: opennlp TokenizerTrainer[.namefinder|.conllx|.pos] [-abbDict path] ... -model modelFile ...]]>
</screen>
The general structure of this tool command line includes the obligatory tool name
(TokenizerTrainer), the optional format parameters ([.namefinder|.conllx|.pos]),
the optional parameters ([-abbDict path] ...), and the obligatory parameters
(-model modelFile ...).
</para>
<para>
The format parameters enable direct processing of non-native data without conversion.
Each format might have its own parameters, which are displayed if the tool is
executed without or with help parameter:
<screen>
<![CDATA[
$ opennlp TokenizerTrainer.conllx help]]>
</screen>
<screen>
<![CDATA[
Usage: opennlp TokenizerTrainer.conllx [-abbDict path] [-alphaNumOpt isAlphaNumOpt] ...
Arguments description:
-abbDict path
abbreviation dictionary in XML format.
...]]>
</screen>
To switch the tool to a specific format, add a dot and the format name after
the tool name:
<screen>
<![CDATA[
$ opennlp TokenizerTrainer.conllx -model en-pos.bin ...]]>
</screen>
</para>
<para>
The second block of the help message describes the individual arguments:
<screen>
<![CDATA[
Arguments description:
-type maxent|perceptron|perceptron_sequence
The type of the token name finder model. One of maxent|perceptron|perceptron_sequence.
-dict dictionaryPath
The XML tag dictionary file
...]]>
</screen>
</para>
<para>
Most tools for processing need to be provided at least a model:
<screen>
<![CDATA[
$ opennlp ToolName lang-model-name.bin]]>
</screen>
When tool is executed this way, the model is loaded and the tool is waiting for
the input from standard input. This input is processed and printed to standard
output.
</para>
<para>Alternative, or one should say, most commonly used way is to use console input and
output redirection options to provide also an input and an output files:
<screen>
<![CDATA[
$ opennlp ToolName lang-model-name.bin < input.txt > output.txt]]>
</screen>
</para>
<para>
Most tools for model training need to be provided first a model name,
optionally some training options (such as model type, number of iterations),
and then the data.
</para>
<para>
A model name is just a file name.
</para>
<para>
Training options often include number of iterations, cutoff,
abbreviations dictionary or something else. Sometimes it is possible to provide these
options via training options file. In this case these options are ignored and the
ones from the file are used.
</para>
<para>
For the data one has to specify the location of the data (filename) and often
language and encoding.
</para>
<para>
A generic example of a command line to launch a tool trainer might be:
<screen>
<![CDATA[
$ opennlp ToolNameTrainer -model en-model-name.bin -lang en -data input.train -encoding UTF-8]]>
</screen>
or with a format:
<screen>
<![CDATA[
$ opennlp ToolNameTrainer.conll03 -model en-model-name.bin -lang en -data input.train \
-types per -encoding UTF-8]]>
</screen>
</para>
<para>Most tools for model evaluation are similar to those for task execution, and
need to be provided fist a model name, optionally some evaluation options (such
as whether to print misclassified samples), and then the test data. A generic
example of a command line to launch an evaluation tool might be:
<screen>
<![CDATA[
$ opennlp ToolNameEvaluator -model en-model-name.bin -lang en -data input.test -encoding UTF-8]]>
</screen>
</para>
</section>
</section>
</chapter>