opennlp-docs/src/docbkx/chunker.xml - opennlp - Git at Google

 <?xml version="1.0" encoding="UTF-8"?>
 <!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN"
 "http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd"[
 ]>
 <!--
 Licensed to the Apache Software Foundation (ASF) under one
 or more contributor license agreements.  See the NOTICE file
 distributed with this work for additional information
 regarding copyright ownership.  The ASF licenses this file
 to you under the Apache License, Version 2.0 (the
 "License"); you may not use this file except in compliance
 with the License.  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing,
 software distributed under the License is distributed on an
 "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 KIND, either express or implied.  See the License for the
 specific language governing permissions and limitations
 under the License.
 -->

 <chapter id="tools.chunker">

 	<title>Chunker</title>

 	<section id="tools.parser.chunking">
 		<title>Chunking</title>
 		<para>
 		Text chunking consists of dividing a text in syntactically correlated parts of words,
 		like noun groups, verb groups, but does not specify their internal structure, nor their role in the main sentence.
 		</para>

 		<section id="tools.parser.chunking.cmdline">
 		<title>Chunker Tool</title>
 		<para>
 		The easiest way to try out the Chunker is the command line tool. The tool is only intended
 		for demonstration and testing.
 		</para>
 		<para>
 		Download the english maxent chunker model from the website and start the Chunker Tool with this command:
 		</para>
 		<para>
         <screen>
 				<![CDATA[
 $ opennlp ChunkerME en-chunker.bin]]>
 		</screen>
 		The Chunker now reads a pos tagged sentence per line from stdin.
 		Copy these two sentences to the console:
 		<screen>
 				<![CDATA[
 Rockwell_NNP International_NNP Corp._NNP 's_POS Tulsa_NNP unit_NN said_VBD it_PRP signed_VBD
     a_DT tentative_JJ agreement_NN extending_VBG its_PRP$ contract_NN with_IN Boeing_NNP Co._NNP
     to_TO provide_VB structural_JJ parts_NNS for_IN Boeing_NNP 's_POS 747_CD jetliners_NNS ._.
 Rockwell_NNP said_VBD the_DT agreement_NN calls_VBZ for_IN it_PRP to_TO supply_VB 200_CD
     additional_JJ so-called_JJ shipsets_NNS for_IN the_DT planes_NNS ._.]]>
 		</screen>
 		The Chunker will now echo the sentences grouped tokens to the console:
 				<screen>
 				<![CDATA[
 [NP Rockwell_NNP International_NNP Corp._NNP ] [NP 's_POS Tulsa_NNP unit_NN ] [VP said_VBD ]
     [NP it_PRP ] [VP signed_VBD ] [NP a_DT tentative_JJ agreement_NN ] [VP extending_VBG ]
     [NP its_PRP$ contract_NN ] [PP with_IN ] [NP Boeing_NNP Co._NNP ] [VP to_TO provide_VB ]
     [NP structural_JJ parts_NNS ] [PP for_IN ] [NP Boeing_NNP ] [NP 's_POS 747_CD jetliners_NNS ] ._.
 [NP Rockwell_NNP ] [VP said_VBD ] [NP the_DT agreement_NN ] [VP calls_VBZ ] [SBAR for_IN ]
     [NP it_PRP ] [VP to_TO supply_VB ] [NP 200_CD additional_JJ so-called_JJ shipsets_NNS ]
     [PP for_IN ] [NP the_DT planes_NNS ] ._.]]>
 		</screen>
 		The tag set used by the english pos model is the <ulink url="http://www.cis.upenn.edu/~treebank/">Penn Treebank tag set</ulink>.
 		</para>
 		</section>
 		<section id="tools.parser.chunking.api">
 		<title>Chunking API</title>
 		<para>
 		    The Chunker can be embedded into an application via its API.
 			First the chunker model must be loaded into memory from disk or an other source.
 			In the sample below its loaded from disk.
 			<programlisting language="java">
 				<![CDATA[
 InputStream modelIn = null;
 ChunkerModel model = null;

 try (modelIn = new FileInputStream("en-chunker.bin")){
   model = new ChunkerModel(modelIn);
 }]]>
 			</programlisting>
 			After the model is loaded a Chunker can be instantiated.
 			<programlisting language="java">
 				<![CDATA[
 ChunkerME chunker = new ChunkerME(model);]]>
 			</programlisting>
 			The Chunker instance is now ready to tag data. It expects a tokenized sentence
 			as input, which is represented as a String array, each String object in the array
 			is one token, and the POS tags associated with each token.
 	   </para>
 	   <para>
 	   The following code shows how to determine the most likely chunk tag sequence for a sentence.
 	   	<programlisting language="java">
 		  <![CDATA[
 String sent[] = new String[] { "Rockwell", "International", "Corp.", "'s",
     "Tulsa", "unit", "said", "it", "signed", "a", "tentative", "agreement",
     "extending", "its", "contract", "with", "Boeing", "Co.", "to",
     "provide", "structural", "parts", "for", "Boeing", "'s", "747",
     "jetliners", "." };

 String pos[] = new String[] { "NNP", "NNP", "NNP", "POS", "NNP", "NN",
     "VBD", "PRP", "VBD", "DT", "JJ", "NN", "VBG", "PRP$", "NN", "IN",
     "NNP", "NNP", "TO", "VB", "JJ", "NNS", "IN", "NNP", "POS", "CD", "NNS",
     "." };

 String tag[] = chunker.chunk(sent, pos);]]>
 			</programlisting>
 			The tags array contains one chunk tag for each token in the input array. The corresponding
 			tag can be found at the same index as the token has in the input array.
 			The confidence scores for the returned tags can be easily retrieved from
 			a ChunkerME with the following method call:
 				   	<programlisting language="java">
 		  <![CDATA[
 double probs[] = chunker.probs();]]>
 			</programlisting>
 			The call to probs is stateful and will always return the probabilities of the last
 			tagged sentence. The probs method should only be called when the tag method
 			was called before, otherwise the behavior is undefined.
 			</para>
 			<para>
 			Some applications need to retrieve the n-best chunk tag sequences and not
 			only the best sequence.
 			The topKSequences method is capable of returning the top sequences.
 			It can be called in a similar way as chunk.
 			<programlisting language="java">
 		  <![CDATA[
 Sequence topSequences[] = chunk.topKSequences(sent, pos);]]>
 			</programlisting>
 			Each Sequence object contains one sequence. The sequence can be retrieved
 			via Sequence.getOutcomes() which returns a tags array
 			and Sequence.getProbs() returns the probability array for this sequence.
 	  		 </para>
 		</section>
 	</section>
 	<section id="tools.chunker.training">
 		<title>Chunker Training</title>
 		<para>
 		The pre-trained models might not be available for a desired language,
 		can not detect important entities or the performance is not good enough outside the news domain.
 		</para>
 		<para>
 		These are the typical reason to do custom training of the chunker on a ne
 	    corpus or on a corpus which is extended by private training data taken from the data which should be analyzed.
 		</para>
 		<para>
 		The training data can be converted to the OpenNLP chunker training format,
 		that is based on <ulink url="http://www.cnts.ua.ac.be/conll2000/chunking">CoNLL2000</ulink>.
         Other formats may also be available.
 		The train data consist of three columns separated one single space. Each word has been put on a
 		separate line and there is an empty line after each sentence. The first column contains
 		the current word, the second its part-of-speech tag and the third its chunk tag.
 		The chunk tags contain the name of the chunk type, for example I-NP for noun phrase words
 		and I-VP for verb phrase words. Most chunk types have two types of chunk tags,
 		B-CHUNK for the first word of the chunk and I-CHUNK for each other word in the chunk.
 		Here is an example of the file format:
 		</para>
 		<para>
 		Sample sentence of the training data:
 		<screen>
 				<![CDATA[
 He        PRP  B-NP
 reckons   VBZ  B-VP
 the       DT   B-NP
 current   JJ   I-NP
 account   NN   I-NP
 deficit   NN   I-NP
 will      MD   B-VP
 narrow    VB   I-VP
 to        TO   B-PP
 only      RB   B-NP
 #         #    I-NP
 1.8       CD   I-NP
 billion   CD   I-NP
 in        IN   B-PP
 September NNP  B-NP
 .         .    O]]>
 		</screen>
 		Note that for improved visualization the example above uses tabs instead of a single space as column separator.
 		</para>
 		<section id="tools.chunker.training.tool">
 		<title>Training Tool</title>
 		<para>
 		OpenNLP has a command line tool which is used to train the models available from the
 		model download page on various corpora.
 		</para>
 		<para>
 		    Usage of the tool:
             <screen>
 				<![CDATA[
 $ opennlp ChunkerTrainerME
 Usage: opennlp ChunkerTrainerME[.ad] [-params paramsFile] [-iterations num] [-cutoff num] \
                -model modelFile -lang language -data sampleData [-encoding charsetName]

 Arguments description:
         -params paramsFile
                 training parameters file.
         -iterations num
                 number of training iterations, ignored if -params is used.
         -cutoff num
                 minimal number of times a feature must be seen, ignored if -params is used.
         -model modelFile
                 output model file.
         -lang language
                 language which is being processed.
         -data sampleData
                 data to be used, usually a file name.
         -encoding charsetName
                 encoding for reading and writing text, if absent the system default is used.]]>
 		</screen>
 		Its now assumed that the english chunker model should be trained from a file called
 		en-chunker.train which is encoded as UTF-8. The following command will train the
 		name finder and write the model to en-chunker.bin:
 		<screen>
 		<![CDATA[
 $ opennlp ChunkerTrainerME -model en-chunker.bin -lang en -data en-chunker.train -encoding UTF-8]]>
 		</screen>
 		Additionally its possible to specify the number of iterations, the cutoff and to overwrite
 		all types in the training data with a single type.
 		</para>
 		</section>
         <section id="tools.chunker.training.api">
 		<title>Training API</title>
             <para>
                 The Chunker offers an API to train a new chunker model. The following sample code
                 illustrates how to do it:
                 <programlisting language="java">
                     <![CDATA[
 ObjectStream<String> lineStream =
     new PlainTextByLineStream(new FileInputStream("en-chunker.train"), StandardCharsets.UTF_8);

 ChunkerModel model;

 try(ObjectStream<ChunkSample> sampleStream = new ChunkSampleStream(lineStream)) {
   model = ChunkerME.train("en", sampleStream,
       new DefaultChunkerContextGenerator(), TrainingParameters.defaultParams());
 }

 try (OutputStream modelOut = new BufferedOutputStream(new FileOutputStream(modelFile))) {
   model.serialize(modelOut);
 }]]>
                 </programlisting>
             </para>
 		</section>
 	</section>

 	<section id="tools.chunker.evaluation">
 		<title>Chunker Evaluation</title>
 		<para>
 		The built in evaluation can measure the chunker performance. The performance is either
 		measured on a test dataset or via cross validation.
 		</para>
 		<section id="tools.chunker.evaluation.tool">
 		<title>Chunker Evaluation Tool</title>
 		<para>
 		    The following command shows how the tool can be run:
             <screen>
 				<![CDATA[
 $ opennlp ChunkerEvaluator
 Usage: opennlp ChunkerEvaluator[.ad] -model model [-misclassified true|false] \
                [-detailedF true|false] -lang language -data sampleData [-encoding charsetName]]]>
 		</screen>
 		A sample of the command considering you have a data sample named en-chunker.eval
 		and you trained a model called en-chunker.bin:
             <screen>
 				<![CDATA[
 $ opennlp ChunkerEvaluator -model en-chunker.bin -data en-chunker.eval -encoding UTF-8]]>
 		</screen>
 		and here is a sample output:
 		<screen>
 		<![CDATA[
 Precision: 0.9255923572240226
 Recall: 0.9220610430991112
 F-Measure: 0.9238233255623465]]>
 		</screen>
 		You can also use the tool to perform 10-fold cross validation of the Chunker.
 he following command shows how the tool can be run:
         <screen>
 				<![CDATA[
 $ opennlp ChunkerCrossValidator
 Usage: opennlp ChunkerCrossValidator[.ad] [-params paramsFile] [-iterations num] [-cutoff num] \
                [-misclassified true|false] [-folds num] [-detailedF true|false] \
                -lang language -data sampleData [-encoding charsetName]

 Arguments description:
         -params paramsFile
                 training parameters file.
         -iterations num
                 number of training iterations, ignored if -params is used.
         -cutoff num
                 minimal number of times a feature must be seen, ignored if -params is used.
         -misclassified true|false
                 if true will print false negatives and false positives.
         -folds num
                 number of folds, default is 10.
         -detailedF true|false
                 if true will print detailed FMeasure results.
         -lang language
                 language which is being processed.
         -data sampleData
                 data to be used, usually a file name.
         -encoding charsetName
                 encoding for reading and writing text, if absent the system default is used.]]>
 		</screen>
 		It is not necessary to pass a model. The tool will automatically split the data to train and evaluate:
         <screen>
             <![CDATA[
 $ opennlp ChunkerCrossValidator -lang pt -data en-chunker.cross -encoding UTF-8]]>
 		</screen>
 		</para>
 		</section>
 	</section>
 </chapter>
	<?xml version="1.0" encoding="UTF-8"?>
	<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN"
	"http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd"[
	]>
	<!--
	Licensed to the Apache Software Foundation (ASF) under one
	or more contributor license agreements. See the NOTICE file
	distributed with this work for additional information
	regarding copyright ownership. The ASF licenses this file
	to you under the Apache License, Version 2.0 (the
	"License"); you may not use this file except in compliance
	with the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing,
	software distributed under the License is distributed on an
	"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	KIND, either express or implied. See the License for the
	specific language governing permissions and limitations
	under the License.
	-->

	<chapter id="tools.chunker">

	<title>Chunker</title>

	<section id="tools.parser.chunking">
	<title>Chunking</title>
	<para>
	Text chunking consists of dividing a text in syntactically correlated parts of words,
	like noun groups, verb groups, but does not specify their internal structure, nor their role in the main sentence.
	</para>

	<section id="tools.parser.chunking.cmdline">
	<title>Chunker Tool</title>
	<para>
	The easiest way to try out the Chunker is the command line tool. The tool is only intended
	for demonstration and testing.
	</para>
	<para>
	Download the english maxent chunker model from the website and start the Chunker Tool with this command:
	</para>
	<para>
	<screen>
	<![CDATA[
	$ opennlp ChunkerME en-chunker.bin]]>
	</screen>
	The Chunker now reads a pos tagged sentence per line from stdin.
	Copy these two sentences to the console:
	<screen>
	<![CDATA[
	Rockwell_NNP International_NNP Corp._NNP 's_POS Tulsa_NNP unit_NN said_VBD it_PRP signed_VBD
	a_DT tentative_JJ agreement_NN extending_VBG its_PRP$ contract_NN with_IN Boeing_NNP Co._NNP
	to_TO provide_VB structural_JJ parts_NNS for_IN Boeing_NNP 's_POS 747_CD jetliners_NNS ._.
	Rockwell_NNP said_VBD the_DT agreement_NN calls_VBZ for_IN it_PRP to_TO supply_VB 200_CD
	additional_JJ so-called_JJ shipsets_NNS for_IN the_DT planes_NNS ._.]]>
	</screen>
	The Chunker will now echo the sentences grouped tokens to the console:
	<screen>
	<![CDATA[
	[NP Rockwell_NNP International_NNP Corp._NNP ] [NP 's_POS Tulsa_NNP unit_NN ] [VP said_VBD ]
	[NP it_PRP ] [VP signed_VBD ] [NP a_DT tentative_JJ agreement_NN ] [VP extending_VBG ]
	[NP its_PRP$ contract_NN ] [PP with_IN ] [NP Boeing_NNP Co._NNP ] [VP to_TO provide_VB ]
	[NP structural_JJ parts_NNS ] [PP for_IN ] [NP Boeing_NNP ] [NP 's_POS 747_CD jetliners_NNS ] ._.
	[NP Rockwell_NNP ] [VP said_VBD ] [NP the_DT agreement_NN ] [VP calls_VBZ ] [SBAR for_IN ]
	[NP it_PRP ] [VP to_TO supply_VB ] [NP 200_CD additional_JJ so-called_JJ shipsets_NNS ]
	[PP for_IN ] [NP the_DT planes_NNS ] ._.]]>
	</screen>
	The tag set used by the english pos model is the <ulink url="http://www.cis.upenn.edu/~treebank/">Penn Treebank tag set</ulink>.
	</para>
	</section>
	<section id="tools.parser.chunking.api">
	<title>Chunking API</title>
	<para>
	The Chunker can be embedded into an application via its API.
	First the chunker model must be loaded into memory from disk or an other source.
	In the sample below its loaded from disk.
	<programlisting language="java">
	<![CDATA[
	InputStream modelIn = null;
	ChunkerModel model = null;

	try (modelIn = new FileInputStream("en-chunker.bin")){
	model = new ChunkerModel(modelIn);
	}]]>
	</programlisting>
	After the model is loaded a Chunker can be instantiated.
	<programlisting language="java">
	<![CDATA[
	ChunkerME chunker = new ChunkerME(model);]]>
	</programlisting>
	The Chunker instance is now ready to tag data. It expects a tokenized sentence
	as input, which is represented as a String array, each String object in the array
	is one token, and the POS tags associated with each token.
	</para>
	<para>
	The following code shows how to determine the most likely chunk tag sequence for a sentence.
	<programlisting language="java">
	<![CDATA[
	String sent[] = new String[] { "Rockwell", "International", "Corp.", "'s",
	"Tulsa", "unit", "said", "it", "signed", "a", "tentative", "agreement",
	"extending", "its", "contract", "with", "Boeing", "Co.", "to",
	"provide", "structural", "parts", "for", "Boeing", "'s", "747",
	"jetliners", "." };

	String pos[] = new String[] { "NNP", "NNP", "NNP", "POS", "NNP", "NN",
	"VBD", "PRP", "VBD", "DT", "JJ", "NN", "VBG", "PRP$", "NN", "IN",
	"NNP", "NNP", "TO", "VB", "JJ", "NNS", "IN", "NNP", "POS", "CD", "NNS",
	"." };

	String tag[] = chunker.chunk(sent, pos);]]>
	</programlisting>
	The tags array contains one chunk tag for each token in the input array. The corresponding
	tag can be found at the same index as the token has in the input array.
	The confidence scores for the returned tags can be easily retrieved from
	a ChunkerME with the following method call:
	<programlisting language="java">
	<![CDATA[
	double probs[] = chunker.probs();]]>
	</programlisting>
	The call to probs is stateful and will always return the probabilities of the last
	tagged sentence. The probs method should only be called when the tag method
	was called before, otherwise the behavior is undefined.
	</para>
	<para>
	Some applications need to retrieve the n-best chunk tag sequences and not
	only the best sequence.
	The topKSequences method is capable of returning the top sequences.
	It can be called in a similar way as chunk.
	<programlisting language="java">
	<![CDATA[
	Sequence topSequences[] = chunk.topKSequences(sent, pos);]]>
	</programlisting>
	Each Sequence object contains one sequence. The sequence can be retrieved
	via Sequence.getOutcomes() which returns a tags array
	and Sequence.getProbs() returns the probability array for this sequence.
	</para>
	</section>
	</section>
	<section id="tools.chunker.training">
	<title>Chunker Training</title>
	<para>
	The pre-trained models might not be available for a desired language,
	can not detect important entities or the performance is not good enough outside the news domain.
	</para>
	<para>
	These are the typical reason to do custom training of the chunker on a ne
	corpus or on a corpus which is extended by private training data taken from the data which should be analyzed.
	</para>
	<para>
	The training data can be converted to the OpenNLP chunker training format,
	that is based on <ulink url="http://www.cnts.ua.ac.be/conll2000/chunking">CoNLL2000</ulink>.
	Other formats may also be available.
	The train data consist of three columns separated one single space. Each word has been put on a
	separate line and there is an empty line after each sentence. The first column contains
	the current word, the second its part-of-speech tag and the third its chunk tag.
	The chunk tags contain the name of the chunk type, for example I-NP for noun phrase words
	and I-VP for verb phrase words. Most chunk types have two types of chunk tags,
	B-CHUNK for the first word of the chunk and I-CHUNK for each other word in the chunk.
	Here is an example of the file format:
	</para>
	<para>
	Sample sentence of the training data:
	<screen>
	<![CDATA[
	He PRP B-NP
	reckons VBZ B-VP
	the DT B-NP
	current JJ I-NP
	account NN I-NP
	deficit NN I-NP
	will MD B-VP
	narrow VB I-VP
	to TO B-PP
	only RB B-NP
	# # I-NP
	1.8 CD I-NP
	billion CD I-NP
	in IN B-PP
	September NNP B-NP
	. . O]]>
	</screen>
	Note that for improved visualization the example above uses tabs instead of a single space as column separator.
	</para>
	<section id="tools.chunker.training.tool">
	<title>Training Tool</title>
	<para>
	OpenNLP has a command line tool which is used to train the models available from the
	model download page on various corpora.
	</para>
	<para>
	Usage of the tool:
	<screen>
	<![CDATA[
	$ opennlp ChunkerTrainerME
	Usage: opennlp ChunkerTrainerME[.ad] [-params paramsFile] [-iterations num] [-cutoff num] \
	-model modelFile -lang language -data sampleData [-encoding charsetName]

	Arguments description:
	-params paramsFile
	training parameters file.
	-iterations num
	number of training iterations, ignored if -params is used.
	-cutoff num
	minimal number of times a feature must be seen, ignored if -params is used.
	-model modelFile
	output model file.
	-lang language
	language which is being processed.
	-data sampleData
	data to be used, usually a file name.
	-encoding charsetName
	encoding for reading and writing text, if absent the system default is used.]]>
	</screen>
	Its now assumed that the english chunker model should be trained from a file called
	en-chunker.train which is encoded as UTF-8. The following command will train the
	name finder and write the model to en-chunker.bin:
	<screen>
	<![CDATA[
	$ opennlp ChunkerTrainerME -model en-chunker.bin -lang en -data en-chunker.train -encoding UTF-8]]>
	</screen>
	Additionally its possible to specify the number of iterations, the cutoff and to overwrite
	all types in the training data with a single type.
	</para>
	</section>
	<section id="tools.chunker.training.api">
	<title>Training API</title>
	<para>
	The Chunker offers an API to train a new chunker model. The following sample code
	illustrates how to do it:
	<programlisting language="java">
	<![CDATA[
	ObjectStream<String> lineStream =
	new PlainTextByLineStream(new FileInputStream("en-chunker.train"), StandardCharsets.UTF_8);

	ChunkerModel model;

	try(ObjectStream<ChunkSample> sampleStream = new ChunkSampleStream(lineStream)) {
	model = ChunkerME.train("en", sampleStream,
	new DefaultChunkerContextGenerator(), TrainingParameters.defaultParams());
	}

	try (OutputStream modelOut = new BufferedOutputStream(new FileOutputStream(modelFile))) {
	model.serialize(modelOut);
	}]]>
	</programlisting>
	</para>
	</section>
	</section>

	<section id="tools.chunker.evaluation">
	<title>Chunker Evaluation</title>
	<para>
	The built in evaluation can measure the chunker performance. The performance is either
	measured on a test dataset or via cross validation.
	</para>
	<section id="tools.chunker.evaluation.tool">
	<title>Chunker Evaluation Tool</title>
	<para>
	The following command shows how the tool can be run:
	<screen>
	<![CDATA[
	$ opennlp ChunkerEvaluator
	Usage: opennlp ChunkerEvaluator[.ad] -model model [-misclassified true\|false] \
	[-detailedF true\|false] -lang language -data sampleData [-encoding charsetName]]]>
	</screen>
	A sample of the command considering you have a data sample named en-chunker.eval
	and you trained a model called en-chunker.bin:
	<screen>
	<![CDATA[
	$ opennlp ChunkerEvaluator -model en-chunker.bin -data en-chunker.eval -encoding UTF-8]]>
	</screen>
	and here is a sample output:
	<screen>
	<![CDATA[
	Precision: 0.9255923572240226
	Recall: 0.9220610430991112
	F-Measure: 0.9238233255623465]]>
	</screen>
	You can also use the tool to perform 10-fold cross validation of the Chunker.
	he following command shows how the tool can be run:
	<screen>
	<![CDATA[
	$ opennlp ChunkerCrossValidator
	Usage: opennlp ChunkerCrossValidator[.ad] [-params paramsFile] [-iterations num] [-cutoff num] \
	[-misclassified true\|false] [-folds num] [-detailedF true\|false] \
	-lang language -data sampleData [-encoding charsetName]

	Arguments description:
	-params paramsFile
	training parameters file.
	-iterations num
	number of training iterations, ignored if -params is used.
	-cutoff num
	minimal number of times a feature must be seen, ignored if -params is used.
	-misclassified true\|false
	if true will print false negatives and false positives.
	-folds num
	number of folds, default is 10.
	-detailedF true\|false
	if true will print detailed FMeasure results.
	-lang language
	language which is being processed.
	-data sampleData
	data to be used, usually a file name.
	-encoding charsetName
	encoding for reading and writing text, if absent the system default is used.]]>
	</screen>
	It is not necessary to pass a model. The tool will automatically split the data to train and evaluate:
	<screen>
	<![CDATA[
	$ opennlp ChunkerCrossValidator -lang pt -data en-chunker.cross -encoding UTF-8]]>
	</screen>
	</para>
	</section>
	</section>
	</chapter>