blob: b67a7fdc76f7ef3f5dcc0dae89445d48ed2e73ca [file] [log] [blame]
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN"
"http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd"[
]>
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
<chapter id="tools.chunker">
<title>Chunker</title>
<section id="tools.parser.chunking">
<title>Chunking</title>
<para>
Text chunking consists of dividing a text in syntactically correlated parts of words,
like noun groups, verb groups, but does not specify their internal structure, nor their role in the main sentence.
</para>
<section id="tools.parser.chunking.cmdline">
<title>Chunker Tool</title>
<para>
The easiest way to try out the Chunker is the command line tool. The tool is only intended
for demonstration and testing.
</para>
<para>
Download the english maxent chunker model from the website and start the Chunker Tool with this command:
</para>
<para>
<screen>
<![CDATA[
$ opennlp ChunkerME en-chunker.bin]]>
</screen>
The Chunker now reads a pos tagged sentence per line from stdin.
Copy these two sentences to the console:
<screen>
<![CDATA[
Rockwell_NNP International_NNP Corp._NNP 's_POS Tulsa_NNP unit_NN said_VBD it_PRP signed_VBD
a_DT tentative_JJ agreement_NN extending_VBG its_PRP$ contract_NN with_IN Boeing_NNP Co._NNP
to_TO provide_VB structural_JJ parts_NNS for_IN Boeing_NNP 's_POS 747_CD jetliners_NNS ._.
Rockwell_NNP said_VBD the_DT agreement_NN calls_VBZ for_IN it_PRP to_TO supply_VB 200_CD
additional_JJ so-called_JJ shipsets_NNS for_IN the_DT planes_NNS ._.]]>
</screen>
The Chunker will now echo the sentences grouped tokens to the console:
<screen>
<![CDATA[
[NP Rockwell_NNP International_NNP Corp._NNP ] [NP 's_POS Tulsa_NNP unit_NN ] [VP said_VBD ]
[NP it_PRP ] [VP signed_VBD ] [NP a_DT tentative_JJ agreement_NN ] [VP extending_VBG ]
[NP its_PRP$ contract_NN ] [PP with_IN ] [NP Boeing_NNP Co._NNP ] [VP to_TO provide_VB ]
[NP structural_JJ parts_NNS ] [PP for_IN ] [NP Boeing_NNP ] [NP 's_POS 747_CD jetliners_NNS ] ._.
[NP Rockwell_NNP ] [VP said_VBD ] [NP the_DT agreement_NN ] [VP calls_VBZ ] [SBAR for_IN ]
[NP it_PRP ] [VP to_TO supply_VB ] [NP 200_CD additional_JJ so-called_JJ shipsets_NNS ]
[PP for_IN ] [NP the_DT planes_NNS ] ._.]]>
</screen>
The tag set used by the english pos model is the <ulink url="http://www.cis.upenn.edu/~treebank/">Penn Treebank tag set</ulink>.
</para>
</section>
<section id="tools.parser.chunking.api">
<title>Chunking API</title>
<para>
The Chunker can be embedded into an application via its API.
First the chunker model must be loaded into memory from disk or an other source.
In the sample below its loaded from disk.
<programlisting language="java">
<![CDATA[
InputStream modelIn = null;
ChunkerModel model = null;
try (modelIn = new FileInputStream("en-chunker.bin")){
model = new ChunkerModel(modelIn);
}]]>
</programlisting>
After the model is loaded a Chunker can be instantiated.
<programlisting language="java">
<![CDATA[
ChunkerME chunker = new ChunkerME(model);]]>
</programlisting>
The Chunker instance is now ready to tag data. It expects a tokenized sentence
as input, which is represented as a String array, each String object in the array
is one token, and the POS tags associated with each token.
</para>
<para>
The following code shows how to determine the most likely chunk tag sequence for a sentence.
<programlisting language="java">
<![CDATA[
String sent[] = new String[] { "Rockwell", "International", "Corp.", "'s",
"Tulsa", "unit", "said", "it", "signed", "a", "tentative", "agreement",
"extending", "its", "contract", "with", "Boeing", "Co.", "to",
"provide", "structural", "parts", "for", "Boeing", "'s", "747",
"jetliners", "." };
String pos[] = new String[] { "NNP", "NNP", "NNP", "POS", "NNP", "NN",
"VBD", "PRP", "VBD", "DT", "JJ", "NN", "VBG", "PRP$", "NN", "IN",
"NNP", "NNP", "TO", "VB", "JJ", "NNS", "IN", "NNP", "POS", "CD", "NNS",
"." };
String tag[] = chunker.chunk(sent, pos);]]>
</programlisting>
The tags array contains one chunk tag for each token in the input array. The corresponding
tag can be found at the same index as the token has in the input array.
The confidence scores for the returned tags can be easily retrieved from
a ChunkerME with the following method call:
<programlisting language="java">
<![CDATA[
double probs[] = chunker.probs();]]>
</programlisting>
The call to probs is stateful and will always return the probabilities of the last
tagged sentence. The probs method should only be called when the tag method
was called before, otherwise the behavior is undefined.
</para>
<para>
Some applications need to retrieve the n-best chunk tag sequences and not
only the best sequence.
The topKSequences method is capable of returning the top sequences.
It can be called in a similar way as chunk.
<programlisting language="java">
<![CDATA[
Sequence topSequences[] = chunk.topKSequences(sent, pos);]]>
</programlisting>
Each Sequence object contains one sequence. The sequence can be retrieved
via Sequence.getOutcomes() which returns a tags array
and Sequence.getProbs() returns the probability array for this sequence.
</para>
</section>
</section>
<section id="tools.chunker.training">
<title>Chunker Training</title>
<para>
The pre-trained models might not be available for a desired language,
can not detect important entities or the performance is not good enough outside the news domain.
</para>
<para>
These are the typical reason to do custom training of the chunker on a ne
corpus or on a corpus which is extended by private training data taken from the data which should be analyzed.
</para>
<para>
The training data can be converted to the OpenNLP chunker training format,
that is based on <ulink url="http://www.cnts.ua.ac.be/conll2000/chunking">CoNLL2000</ulink>.
Other formats may also be available.
The train data consist of three columns separated one single space. Each word has been put on a
separate line and there is an empty line after each sentence. The first column contains
the current word, the second its part-of-speech tag and the third its chunk tag.
The chunk tags contain the name of the chunk type, for example I-NP for noun phrase words
and I-VP for verb phrase words. Most chunk types have two types of chunk tags,
B-CHUNK for the first word of the chunk and I-CHUNK for each other word in the chunk.
Here is an example of the file format:
</para>
<para>
Sample sentence of the training data:
<screen>
<![CDATA[
He PRP B-NP
reckons VBZ B-VP
the DT B-NP
current JJ I-NP
account NN I-NP
deficit NN I-NP
will MD B-VP
narrow VB I-VP
to TO B-PP
only RB B-NP
# # I-NP
1.8 CD I-NP
billion CD I-NP
in IN B-PP
September NNP B-NP
. . O]]>
</screen>
Note that for improved visualization the example above uses tabs instead of a single space as column separator.
</para>
<section id="tools.chunker.training.tool">
<title>Training Tool</title>
<para>
OpenNLP has a command line tool which is used to train the models available from the
model download page on various corpora.
</para>
<para>
Usage of the tool:
<screen>
<![CDATA[
$ opennlp ChunkerTrainerME
Usage: opennlp ChunkerTrainerME[.ad] [-params paramsFile] [-iterations num] [-cutoff num] \
-model modelFile -lang language -data sampleData [-encoding charsetName]
Arguments description:
-params paramsFile
training parameters file.
-iterations num
number of training iterations, ignored if -params is used.
-cutoff num
minimal number of times a feature must be seen, ignored if -params is used.
-model modelFile
output model file.
-lang language
language which is being processed.
-data sampleData
data to be used, usually a file name.
-encoding charsetName
encoding for reading and writing text, if absent the system default is used.]]>
</screen>
Its now assumed that the english chunker model should be trained from a file called
en-chunker.train which is encoded as UTF-8. The following command will train the
name finder and write the model to en-chunker.bin:
<screen>
<![CDATA[
$ opennlp ChunkerTrainerME -model en-chunker.bin -lang en -data en-chunker.train -encoding UTF-8]]>
</screen>
Additionally its possible to specify the number of iterations, the cutoff and to overwrite
all types in the training data with a single type.
</para>
</section>
<section id="tools.chunker.training.api">
<title>Training API</title>
<para>
The Chunker offers an API to train a new chunker model. The following sample code
illustrates how to do it:
<programlisting language="java">
<![CDATA[
ObjectStream<String> lineStream =
new PlainTextByLineStream(new FileInputStream("en-chunker.train"), StandardCharsets.UTF_8);
ChunkerModel model;
try(ObjectStream<ChunkSample> sampleStream = new ChunkSampleStream(lineStream)) {
model = ChunkerME.train("en", sampleStream,
new DefaultChunkerContextGenerator(), TrainingParameters.defaultParams());
}
try (OutputStream modelOut = new BufferedOutputStream(new FileOutputStream(modelFile))) {
model.serialize(modelOut);
}]]>
</programlisting>
</para>
</section>
</section>
<section id="tools.chunker.evaluation">
<title>Chunker Evaluation</title>
<para>
The built in evaluation can measure the chunker performance. The performance is either
measured on a test dataset or via cross validation.
</para>
<section id="tools.chunker.evaluation.tool">
<title>Chunker Evaluation Tool</title>
<para>
The following command shows how the tool can be run:
<screen>
<![CDATA[
$ opennlp ChunkerEvaluator
Usage: opennlp ChunkerEvaluator[.ad] -model model [-misclassified true|false] \
[-detailedF true|false] -lang language -data sampleData [-encoding charsetName]]]>
</screen>
A sample of the command considering you have a data sample named en-chunker.eval
and you trained a model called en-chunker.bin:
<screen>
<![CDATA[
$ opennlp ChunkerEvaluator -model en-chunker.bin -data en-chunker.eval -encoding UTF-8]]>
</screen>
and here is a sample output:
<screen>
<![CDATA[
Precision: 0.9255923572240226
Recall: 0.9220610430991112
F-Measure: 0.9238233255623465]]>
</screen>
You can also use the tool to perform 10-fold cross validation of the Chunker.
he following command shows how the tool can be run:
<screen>
<![CDATA[
$ opennlp ChunkerCrossValidator
Usage: opennlp ChunkerCrossValidator[.ad] [-params paramsFile] [-iterations num] [-cutoff num] \
[-misclassified true|false] [-folds num] [-detailedF true|false] \
-lang language -data sampleData [-encoding charsetName]
Arguments description:
-params paramsFile
training parameters file.
-iterations num
number of training iterations, ignored if -params is used.
-cutoff num
minimal number of times a feature must be seen, ignored if -params is used.
-misclassified true|false
if true will print false negatives and false positives.
-folds num
number of folds, default is 10.
-detailedF true|false
if true will print detailed FMeasure results.
-lang language
language which is being processed.
-data sampleData
data to be used, usually a file name.
-encoding charsetName
encoding for reading and writing text, if absent the system default is used.]]>
</screen>
It is not necessary to pass a model. The tool will automatically split the data to train and evaluate:
<screen>
<![CDATA[
$ opennlp ChunkerCrossValidator -lang pt -data en-chunker.cross -encoding UTF-8]]>
</screen>
</para>
</section>
</section>
</chapter>