opennlp-docs/src/docbkx/parser.xml - opennlp - Git at Google

 <?xml version="1.0" encoding="UTF-8"?>
 <!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN"
 "http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd"[
 ]>
 <!--
 Licensed to the Apache Software Foundation (ASF) under one
 or more contributor license agreements.  See the NOTICE file
 distributed with this work for additional information
 regarding copyright ownership.  The ASF licenses this file
 to you under the Apache License, Version 2.0 (the
 "License"); you may not use this file except in compliance
 with the License.  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing,
 software distributed under the License is distributed on an
 "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 KIND, either express or implied.  See the License for the
 specific language governing permissions and limitations
 under the License.
 -->

 <chapter id="tools.parser">

 	<title>Parser</title>

 	<section id="tools.parser.parsing">
 		<title>Parsing</title>
 		<para>
 	    A parser returns a parse tree from a sentence according to a phrase structure grammar. A parse tree specifies
 		the internal structure of a sentence. For example, the following image represents a parse tree for
 		the sentence 'The cellphone was broken in two days':

 		<imagedata fileref="images/parsetree1.png" scalefit="1"/>

 		A parse tree can be used to determine the role of subtrees or constituents in the sentence. For example, it is possible to
 		know that 'The cellphone' is the subject of the sentence and the verb (action) is 'was broken.'
 		</para>

 		<section id="tools.parser.parsing.cmdline">
 		<title>Parser Tool</title>
 		<para>
 		The easiest way to try out the Parser is the command line tool.
 		The tool is only intended for demonstration and testing.
 		Download the English chunking parser model from the our website and start the Parse
  		Tool with the following command.
 				<screen>
 				<![CDATA[
 $ opennlp Parser en-parser-chunking.bin]]>
 		</screen>
 		Loading the big parser model can take several seconds, be patient.
 		Copy this sample sentence to the console.
 		<screen>
 				<![CDATA[
 The cellphone was broken in two days .]]>
 		</screen>
 		The parser should now print the following to the console.
 				<screen>
 				<![CDATA[
 (TOP (S (NP (DT The) (NN cellphone)) (VP (VBD was) (VP (VBN broken) (PP (IN in) (NP (CD two) (NNS days))))) (. .)))]]>
 		</screen>
 		With the following command the input can be read from a file and be written to an output file.
 				<screen>
 				<![CDATA[
 $ opennlp Parser en-parser-chunking.bin < article-tokenized.txt > article-parsed.txt.]]>
 		</screen>
 		The article-tokenized.txt file must contain one sentence per line which is
 		tokenized with the English tokenizer model from our website.
 		See the Tokenizer documentation for further details.
 		</para>
 		</section>
 		<section id="tools.parser.parsing.api">
 		<title>Parsing API</title>
 		<para>
 			The Parser can be easily integrated into an application via its API.
 			To instantiate a Parser the parser model must be loaded first.
 			<programlisting language="java">
 				<![CDATA[
 InputStream modelIn = new FileInputStream("en-parser-chunking.bin");
 try {
   ParserModel model = new ParserModel(modelIn);
 }
 catch (IOException e) {
   e.printStackTrace();
 }
 finally {
   if (modelIn != null) {
     try {
       modelIn.close();
     }
     catch (IOException e) {
     }
   }
 }]]>
 		</programlisting>
 		Unlike the other components to instantiate the Parser a factory method
 		should be used instead of creating the Parser via the new operator.
 		The parser model is either trained for the chunking parser or the tree
 		insert parser the parser implementation must be chosen correctly.
 		The factory method will read a type parameter from the model and create
 		an instance of the corresponding parser implementation.
 		<programlisting language="java">
 				<![CDATA[
 Parser parser = ParserFactory.create(model);]]>
 		</programlisting>
 		Right now the tree insert parser is still experimental and there is no pre-trained model for it.
 		The parser expect a whitespace tokenized sentence. A utility method from the command
 		line tool can parse the sentence String. The following code shows how the parser can be called.
 				<programlisting language="java">
 				<![CDATA[
 String sentence = "The quick brown fox jumps over the lazy dog .";
 Parse topParses[] = ParserTool.parseLine(sentence, parser, 1);]]>
 		</programlisting>

 		The topParses array only contains one parse because the number of parses is set to 1.
 		The Parse object contains the parse tree.
 		To display the parse tree call the show method. It either prints the parse to
 		the console or into a provided StringBuffer. Similar to Exception.printStackTrace.
 		</para>
 		<para>
 		TODO: Extend this section with more information about the Parse object.
 		</para>
 		</section>
 	</section>
 	<section id="tools.parser.training">
 		<title>Parser Training</title>
 		<para>
 				The OpenNLP offers two different parser implementations, the chunking parser and the
 		treeinsert parser. The later one is still experimental and not recommended for production use.
 		(TODO: Add a section which explains the two different approaches)
 		The training can either be done with the command line tool or the training API.
 		In the first case the training data must be available in the OpenNLP format. Which is
 		the Penn Treebank format, but with the limitation of a sentence per line.
 		<programlisting>
 				<![CDATA[
 (TOP (S (NP-SBJ (DT Some) )(VP (VBP say) (NP (NNP November) ))(. .) ))
 (TOP (S (NP-SBJ (PRP I) )(VP (VBP say) (NP (CD 1992) ))(. .) ('' '') ))]]>
 		</programlisting>
 		Penn Treebank annotation guidelines can be found on the
             <ulink url="https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html">Penn Treebank home page</ulink>.
 		A parser model also contains a pos tagger model, depending on the amount of available
 		training data it is recommend to switch the tagger model against a tagger model which
 		was trained on a larger corpus. The pre-trained parser model provided on the website
 		is doing this to achieve a better performance. (TODO: On which data is the model on
 		the website trained, and say on which data the tagger model is trained)
 		</para>
 		<section id="tools.parser.training.tool">
 		<title>Training Tool</title>
 		<para>
 		OpenNLP has a command line tool which is used to train the models available from the
 		model download page on various corpora. The data must be converted to the OpenNLP parser
 		training format, which is shortly explained above.
 		To train the parser a head rules file is also needed. (TODO: Add documentation about the head rules file)
 		Usage of the tool:
             <screen>
 				<![CDATA[
 $ opennlp ParserTrainer
 Usage: opennlp ParserTrainer -headRules headRulesFile [-parserType CHUNKING|TREEINSERT] \
                              [-params paramsFile] [-iterations num] [-cutoff num] \
                              -model modelFile -lang language -data sampleData \
                              [-encoding charsetName]

 Arguments description:
 	-headRules headRulesFile
 		head rules file.
 	-parserType CHUNKING|TREEINSERT
 		one of CHUNKING or TREEINSERT, default is CHUNKING.
 	-params paramsFile
 		training parameters file.
 	-iterations num
 		number of training iterations, ignored if -params is used.
 	-cutoff num
 		minimal number of times a feature must be seen, ignored if -params is used.
 	-model modelFile
 		output model file.
 	-format formatName
 		data format, might have its own parameters.
 	-encoding charsetName
 		encoding for reading and writing text, if absent the system default is used.
 	-lang language
 		language which is being processed.
 	-data sampleData
 		data to be used, usually a file name.
 	-encoding charsetName
 		encoding for reading and writing text, if absent the system default is used.]]>
         </screen>
 		The model on the website was trained with the following command:
 		<screen>
 		<![CDATA[
 $ opennlp ParserTrainer -model en-parser-chunking.bin -parserType CHUNKING \
                         -headRules head_rules \
                         -lang en -data train.all -encoding ISO-8859-1
     ]]>
 		</screen>
 		Its also possible to specify the cutoff and the number of iterations, these parameters
 		are used for all trained models. The -parserType parameter is an optional parameter,
 		to use the tree insertion parser, specify TREEINSERT as type. The TaggerModelReplacer
 		tool replaces the tagger model inside the parser model with a new one.
 		</para>
 		<para>
 		Note: The original parser model will be overwritten with the new parser model which
 		contains the replaced tagger model.
         <screen>
 		<![CDATA[
 $ opennlp TaggerModelReplacer en-parser-chunking.bin en-pos-maxent.bin]]>
 		</screen>
 		Additionally there are tools to just retrain the build or the check model.
 		</para>
 		</section>

 		<section id="tools.parser.training.api">
 		  <title>Training API</title>
 		  <para>
 		The Parser training API supports the training of a new parser model.
 		Four steps are necessary to train it:
 		<itemizedlist>
 		  <listitem>
 			  <para>A HeadRules class needs to be instantiated: currently EnglishHeadRules and AncoraSpanishHeadRules are available.</para>
 			</listitem>
 			<listitem>
 				<para>The application must open a sample data stream.</para>
 			</listitem>
 			<listitem>
 				<para>Call a Parser train method: This can be either the CHUNKING or the TREEINSERT parser.</para>
 			</listitem>
 			<listitem>
 				<para>Save the ParseModel to a file</para>
 			</listitem>
 		</itemizedlist>
 		The following code snippet shows how to instantiate the HeadRules:
 		<programlisting language="java">
 				<![CDATA[
 static HeadRules createHeadRules(TrainerToolParams params) throws IOException {

     ArtifactSerializer headRulesSerializer = null;

     if (params.getHeadRulesSerializerImpl() != null) {
       headRulesSerializer = ExtensionLoader.instantiateExtension(ArtifactSerializer.class,
               params.getHeadRulesSerializerImpl());
     }
     else {
       if ("en".equals(params.getLang())) {
         headRulesSerializer = new opennlp.tools.parser.lang.en.HeadRules.HeadRulesSerializer();
       }
       else if ("es".equals(params.getLang())) {
         headRulesSerializer = new opennlp.tools.parser.lang.es.AncoraSpanishHeadRules.HeadRulesSerializer();
       }
       else {
         // default for now, this case should probably cause an error ...
         headRulesSerializer = new opennlp.tools.parser.lang.en.HeadRules.HeadRulesSerializer();
       }
     }

     Object headRulesObject = headRulesSerializer.create(new FileInputStream(params.getHeadRules()));

     if (headRulesObject instanceof HeadRules) {
       return (HeadRules) headRulesObject;
     }
     else {
       throw new TerminateToolException(-1, "HeadRules Artifact Serializer must create an object of type HeadRules!");
     }
 }]]>
 	</programlisting>
 		The following code illustrates the three other steps, namely, opening the data, training
 	the model and saving the ParserModel into an output file.
 		<programlisting language="java">
 				<![CDATA[
 ParserModel model = null;
 File modelOutFile = params.getModel();
 CmdLineUtil.checkOutputFile("parser model", modelOutFile);

 try {
   HeadRules rules = createHeadRules(params);
   InputStreamFactory inputStreamFactory = new MarkableFileInputStreamFactory(new File("parsing.train"));
   ObjectStream<String> stringStream = new PlainTextByLineStream(inputStreamFactory, StandardCharsets.UTF_8);
   ObjectStream<Parse> sampleStream = new ParseSample(stringStream);

   ParserType type = parseParserType(params.getParserType());
   if (ParserType.CHUNKING.equals(type)) {
     model = opennlp.tools.parser.chunking.Parser.train(
             params.getLang(), sampleStream, rules,
             mlParams);
   } else if (ParserType.TREEINSERT.equals(type)) {
     model = opennlp.tools.parser.treeinsert.Parser.train(params.getLang(), sampleStream, rules,
             mlParams);
     }
   }
   catch (IOException e) {
       throw new TerminateToolException(-1, "IO error while reading training data or indexing data: "
           + e.getMessage(), e);
   }
   finally {
       try {
         sampleStream.close();
       }
       catch (IOException e) {
         // sorry that this can fail
       }
   }
   CmdLineUtil.writeModel("parser", modelOutFile, model);
 ]]>
 	</programlisting>
 		</para>
 		</section>
 	</section>
 	<section id="tools.parser.evaluation">
 		<title>Parser Evaluation</title>
 		<para>
 			The built in evaluation can measure the parser performance. The
 			performance is measured
 			on a test dataset.
 		</para>
 		<section id="tools.parser.evaluation.tool">
 			<title>Parser Evaluation Tool</title>
 			<para>
 				The following command shows how the tool can be run:
 				<screen>
 				<![CDATA[
 $ opennlp ParserEvaluator
 Usage: opennlp ParserEvaluator[.ontonotes|frenchtreebank] [-misclassified true|false] -model model \
                -data sampleData [-encoding charsetName]]]>
 		</screen>
 				A sample of the command considering you have a data sample named
 				en-parser-chunking.eval
 				and you trained a model called en-parser-chunking.bin:
 				<screen>
 				<![CDATA[
 $ opennlp ParserEvaluator -model en-parser-chunking.bin -data en-parser-chunking.eval -encoding UTF-8]]>
 		</screen>
 				and here is a sample output:
 				<screen>
 		<![CDATA[
 Precision: 0.9009744742967609
 Recall: 0.8962012400910446
 F-Measure: 0.8985815184245214]]>
 		</screen>
 			</para>
 			<para>
 				The Parser Evaluation tool reimplements the PARSEVAL scoring method
 				as implemented by the
 				<ulink url="http://nlp.cs.nyu.edu/evalb/">EVALB</ulink>
 				script, which is the most widely used evaluation
 				tool for constituent parsing. Note however that currently the Parser
 				Evaluation tool does not allow
 				to make exceptions in the constituents to be evaluated, in the way
 				Collins or Bikel usually do. Any
 				contributions are very welcome. If you want to contribute please contact us on
 				the mailing list or comment
 				on the jira issue
 				<ulink url="https://issues.apache.org/jira/browse/OPENNLP-688">OPENNLP-688</ulink>.
 			</para>
 		</section>
 		<section id="tools.parser.evaluation.api">
 		<title>Evaluation API</title>
 		<para>
 		The evaluation can be performed on a pre-trained model and a test dataset or via cross validation.
 		In the first case the model must be loaded and a Parse ObjectStream must be created (see code samples above),
 		assuming these two objects exist the following code shows how to perform the evaluation:
 			<programlisting language="java">
 			  <![CDATA[
 Parser parser = ParserFactory.create(model);
 ParserEvaluator evaluator = new ParserEvaluator(parser);
 evaluator.evaluate(sampleStream);

 FMeasure result = evaluator.getFMeasure();

 System.out.println(result.toString());]]>
 			</programlisting>
 			In the cross validation case all the training arguments must be
 			provided (see the Training API section above).
 			To perform cross validation the ObjectStream must be resettable.
 			<programlisting language="java">
 				<![CDATA[
 InputStreamFactory inputStreamFactory = new MarkableFileInputStreamFactory(new File("parsing.train"));
 ObjectStream<String> stringStream = new PlainTextByLineStream(inputStreamFactory, StandardCharsets.UTF_8);
 ObjectStream<Parse> sampleStream = new ParseSample(stringStream);
 ParserCrossValidator evaluator = new ParserCrossValidator("en", trainParameters, headRules, \
 parserType, listeners.toArray(new ParserEvaluationMonitor[listeners.size()])));
 evaluator.evaluate(sampleStream, 10);

 FMeasure result = evaluator.getFMeasure();

 System.out.println(result.toString());]]>
 			</programlisting>
 		</para>
 		</section>
 	</section>
 </chapter>
	<?xml version="1.0" encoding="UTF-8"?>
	<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN"
	"http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd"[
	]>
	<!--
	Licensed to the Apache Software Foundation (ASF) under one
	or more contributor license agreements. See the NOTICE file
	distributed with this work for additional information
	regarding copyright ownership. The ASF licenses this file
	to you under the Apache License, Version 2.0 (the
	"License"); you may not use this file except in compliance
	with the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing,
	software distributed under the License is distributed on an
	"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	KIND, either express or implied. See the License for the
	specific language governing permissions and limitations
	under the License.
	-->

	<chapter id="tools.parser">

	<title>Parser</title>

	<section id="tools.parser.parsing">
	<title>Parsing</title>
	<para>
	A parser returns a parse tree from a sentence according to a phrase structure grammar. A parse tree specifies
	the internal structure of a sentence. For example, the following image represents a parse tree for
	the sentence 'The cellphone was broken in two days':

	<imagedata fileref="images/parsetree1.png" scalefit="1"/>

	A parse tree can be used to determine the role of subtrees or constituents in the sentence. For example, it is possible to
	know that 'The cellphone' is the subject of the sentence and the verb (action) is 'was broken.'
	</para>

	<section id="tools.parser.parsing.cmdline">
	<title>Parser Tool</title>
	<para>
	The easiest way to try out the Parser is the command line tool.
	The tool is only intended for demonstration and testing.
	Download the English chunking parser model from the our website and start the Parse
	Tool with the following command.
	<screen>
	<![CDATA[
	$ opennlp Parser en-parser-chunking.bin]]>
	</screen>
	Loading the big parser model can take several seconds, be patient.
	Copy this sample sentence to the console.
	<screen>
	<![CDATA[
	The cellphone was broken in two days .]]>
	</screen>
	The parser should now print the following to the console.
	<screen>
	<![CDATA[
	(TOP (S (NP (DT The) (NN cellphone)) (VP (VBD was) (VP (VBN broken) (PP (IN in) (NP (CD two) (NNS days))))) (. .)))]]>
	</screen>
	With the following command the input can be read from a file and be written to an output file.
	<screen>
	<![CDATA[
	$ opennlp Parser en-parser-chunking.bin < article-tokenized.txt > article-parsed.txt.]]>
	</screen>
	The article-tokenized.txt file must contain one sentence per line which is
	tokenized with the English tokenizer model from our website.
	See the Tokenizer documentation for further details.
	</para>
	</section>
	<section id="tools.parser.parsing.api">
	<title>Parsing API</title>
	<para>
	The Parser can be easily integrated into an application via its API.
	To instantiate a Parser the parser model must be loaded first.
	<programlisting language="java">
	<![CDATA[
	InputStream modelIn = new FileInputStream("en-parser-chunking.bin");
	try {
	ParserModel model = new ParserModel(modelIn);
	}
	catch (IOException e) {
	e.printStackTrace();
	}
	finally {
	if (modelIn != null) {
	try {
	modelIn.close();
	}
	catch (IOException e) {
	}
	}
	}]]>
	</programlisting>
	Unlike the other components to instantiate the Parser a factory method
	should be used instead of creating the Parser via the new operator.
	The parser model is either trained for the chunking parser or the tree
	insert parser the parser implementation must be chosen correctly.
	The factory method will read a type parameter from the model and create
	an instance of the corresponding parser implementation.
	<programlisting language="java">
	<![CDATA[
	Parser parser = ParserFactory.create(model);]]>
	</programlisting>
	Right now the tree insert parser is still experimental and there is no pre-trained model for it.
	The parser expect a whitespace tokenized sentence. A utility method from the command
	line tool can parse the sentence String. The following code shows how the parser can be called.
	<programlisting language="java">
	<![CDATA[
	String sentence = "The quick brown fox jumps over the lazy dog .";
	Parse topParses[] = ParserTool.parseLine(sentence, parser, 1);]]>
	</programlisting>

	The topParses array only contains one parse because the number of parses is set to 1.
	The Parse object contains the parse tree.
	To display the parse tree call the show method. It either prints the parse to
	the console or into a provided StringBuffer. Similar to Exception.printStackTrace.
	</para>
	<para>
	TODO: Extend this section with more information about the Parse object.
	</para>
	</section>
	</section>
	<section id="tools.parser.training">
	<title>Parser Training</title>
	<para>
	The OpenNLP offers two different parser implementations, the chunking parser and the
	treeinsert parser. The later one is still experimental and not recommended for production use.
	(TODO: Add a section which explains the two different approaches)
	The training can either be done with the command line tool or the training API.
	In the first case the training data must be available in the OpenNLP format. Which is
	the Penn Treebank format, but with the limitation of a sentence per line.
	<programlisting>
	<![CDATA[
	(TOP (S (NP-SBJ (DT Some) )(VP (VBP say) (NP (NNP November) ))(. .) ))
	(TOP (S (NP-SBJ (PRP I) )(VP (VBP say) (NP (CD 1992) ))(. .) ('' '') ))]]>
	</programlisting>
	Penn Treebank annotation guidelines can be found on the
	<ulink url="https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html">Penn Treebank home page</ulink>.
	A parser model also contains a pos tagger model, depending on the amount of available
	training data it is recommend to switch the tagger model against a tagger model which
	was trained on a larger corpus. The pre-trained parser model provided on the website
	is doing this to achieve a better performance. (TODO: On which data is the model on
	the website trained, and say on which data the tagger model is trained)
	</para>
	<section id="tools.parser.training.tool">
	<title>Training Tool</title>
	<para>
	OpenNLP has a command line tool which is used to train the models available from the
	model download page on various corpora. The data must be converted to the OpenNLP parser
	training format, which is shortly explained above.
	To train the parser a head rules file is also needed. (TODO: Add documentation about the head rules file)
	Usage of the tool:
	<screen>
	<![CDATA[
	$ opennlp ParserTrainer
	Usage: opennlp ParserTrainer -headRules headRulesFile [-parserType CHUNKING\|TREEINSERT] \
	[-params paramsFile] [-iterations num] [-cutoff num] \
	-model modelFile -lang language -data sampleData \
	[-encoding charsetName]

	Arguments description:
	-headRules headRulesFile
	head rules file.
	-parserType CHUNKING\|TREEINSERT
	one of CHUNKING or TREEINSERT, default is CHUNKING.
	-params paramsFile
	training parameters file.
	-iterations num
	number of training iterations, ignored if -params is used.
	-cutoff num
	minimal number of times a feature must be seen, ignored if -params is used.
	-model modelFile
	output model file.
	-format formatName
	data format, might have its own parameters.
	-encoding charsetName
	encoding for reading and writing text, if absent the system default is used.
	-lang language
	language which is being processed.
	-data sampleData
	data to be used, usually a file name.
	-encoding charsetName
	encoding for reading and writing text, if absent the system default is used.]]>
	</screen>
	The model on the website was trained with the following command:
	<screen>
	<![CDATA[
	$ opennlp ParserTrainer -model en-parser-chunking.bin -parserType CHUNKING \
	-headRules head_rules \
	-lang en -data train.all -encoding ISO-8859-1
	]]>
	</screen>
	Its also possible to specify the cutoff and the number of iterations, these parameters
	are used for all trained models. The -parserType parameter is an optional parameter,
	to use the tree insertion parser, specify TREEINSERT as type. The TaggerModelReplacer
	tool replaces the tagger model inside the parser model with a new one.
	</para>
	<para>
	Note: The original parser model will be overwritten with the new parser model which
	contains the replaced tagger model.
	<screen>
	<![CDATA[
	$ opennlp TaggerModelReplacer en-parser-chunking.bin en-pos-maxent.bin]]>
	</screen>
	Additionally there are tools to just retrain the build or the check model.
	</para>
	</section>

	<section id="tools.parser.training.api">
	<title>Training API</title>
	<para>
	The Parser training API supports the training of a new parser model.
	Four steps are necessary to train it:
	<itemizedlist>
	<listitem>
	<para>A HeadRules class needs to be instantiated: currently EnglishHeadRules and AncoraSpanishHeadRules are available.</para>
	</listitem>
	<listitem>
	<para>The application must open a sample data stream.</para>
	</listitem>
	<listitem>
	<para>Call a Parser train method: This can be either the CHUNKING or the TREEINSERT parser.</para>
	</listitem>
	<listitem>
	<para>Save the ParseModel to a file</para>
	</listitem>
	</itemizedlist>
	The following code snippet shows how to instantiate the HeadRules:
	<programlisting language="java">
	<![CDATA[
	static HeadRules createHeadRules(TrainerToolParams params) throws IOException {

	ArtifactSerializer headRulesSerializer = null;

	if (params.getHeadRulesSerializerImpl() != null) {
	headRulesSerializer = ExtensionLoader.instantiateExtension(ArtifactSerializer.class,
	params.getHeadRulesSerializerImpl());
	}
	else {
	if ("en".equals(params.getLang())) {
	headRulesSerializer = new opennlp.tools.parser.lang.en.HeadRules.HeadRulesSerializer();
	}
	else if ("es".equals(params.getLang())) {
	headRulesSerializer = new opennlp.tools.parser.lang.es.AncoraSpanishHeadRules.HeadRulesSerializer();
	}
	else {
	// default for now, this case should probably cause an error ...
	headRulesSerializer = new opennlp.tools.parser.lang.en.HeadRules.HeadRulesSerializer();
	}
	}

	Object headRulesObject = headRulesSerializer.create(new FileInputStream(params.getHeadRules()));

	if (headRulesObject instanceof HeadRules) {
	return (HeadRules) headRulesObject;
	}
	else {
	throw new TerminateToolException(-1, "HeadRules Artifact Serializer must create an object of type HeadRules!");
	}
	}]]>
	</programlisting>
	The following code illustrates the three other steps, namely, opening the data, training
	the model and saving the ParserModel into an output file.
	<programlisting language="java">
	<![CDATA[
	ParserModel model = null;
	File modelOutFile = params.getModel();
	CmdLineUtil.checkOutputFile("parser model", modelOutFile);

	try {
	HeadRules rules = createHeadRules(params);
	InputStreamFactory inputStreamFactory = new MarkableFileInputStreamFactory(new File("parsing.train"));
	ObjectStream<String> stringStream = new PlainTextByLineStream(inputStreamFactory, StandardCharsets.UTF_8);
	ObjectStream<Parse> sampleStream = new ParseSample(stringStream);

	ParserType type = parseParserType(params.getParserType());
	if (ParserType.CHUNKING.equals(type)) {
	model = opennlp.tools.parser.chunking.Parser.train(
	params.getLang(), sampleStream, rules,
	mlParams);
	} else if (ParserType.TREEINSERT.equals(type)) {
	model = opennlp.tools.parser.treeinsert.Parser.train(params.getLang(), sampleStream, rules,
	mlParams);
	}
	}
	catch (IOException e) {
	throw new TerminateToolException(-1, "IO error while reading training data or indexing data: "
	+ e.getMessage(), e);
	}
	finally {
	try {
	sampleStream.close();
	}
	catch (IOException e) {
	// sorry that this can fail
	}
	}
	CmdLineUtil.writeModel("parser", modelOutFile, model);
	]]>
	</programlisting>
	</para>
	</section>
	</section>
	<section id="tools.parser.evaluation">
	<title>Parser Evaluation</title>
	<para>
	The built in evaluation can measure the parser performance. The
	performance is measured
	on a test dataset.
	</para>
	<section id="tools.parser.evaluation.tool">
	<title>Parser Evaluation Tool</title>
	<para>
	The following command shows how the tool can be run:
	<screen>
	<![CDATA[
	$ opennlp ParserEvaluator
	Usage: opennlp ParserEvaluator[.ontonotes\|frenchtreebank] [-misclassified true\|false] -model model \
	-data sampleData [-encoding charsetName]]]>
	</screen>
	A sample of the command considering you have a data sample named
	en-parser-chunking.eval
	and you trained a model called en-parser-chunking.bin:
	<screen>
	<![CDATA[
	$ opennlp ParserEvaluator -model en-parser-chunking.bin -data en-parser-chunking.eval -encoding UTF-8]]>
	</screen>
	and here is a sample output:
	<screen>
	<![CDATA[
	Precision: 0.9009744742967609
	Recall: 0.8962012400910446
	F-Measure: 0.8985815184245214]]>
	</screen>
	</para>
	<para>
	The Parser Evaluation tool reimplements the PARSEVAL scoring method
	as implemented by the
	<ulink url="http://nlp.cs.nyu.edu/evalb/">EVALB</ulink>
	script, which is the most widely used evaluation
	tool for constituent parsing. Note however that currently the Parser
	Evaluation tool does not allow
	to make exceptions in the constituents to be evaluated, in the way
	Collins or Bikel usually do. Any
	contributions are very welcome. If you want to contribute please contact us on
	the mailing list or comment
	on the jira issue
	<ulink url="https://issues.apache.org/jira/browse/OPENNLP-688">OPENNLP-688</ulink>.
	</para>
	</section>
	<section id="tools.parser.evaluation.api">
	<title>Evaluation API</title>
	<para>
	The evaluation can be performed on a pre-trained model and a test dataset or via cross validation.
	In the first case the model must be loaded and a Parse ObjectStream must be created (see code samples above),
	assuming these two objects exist the following code shows how to perform the evaluation:
	<programlisting language="java">
	<![CDATA[
	Parser parser = ParserFactory.create(model);
	ParserEvaluator evaluator = new ParserEvaluator(parser);
	evaluator.evaluate(sampleStream);

	FMeasure result = evaluator.getFMeasure();

	System.out.println(result.toString());]]>
	</programlisting>
	In the cross validation case all the training arguments must be
	provided (see the Training API section above).
	To perform cross validation the ObjectStream must be resettable.
	<programlisting language="java">
	<![CDATA[
	InputStreamFactory inputStreamFactory = new MarkableFileInputStreamFactory(new File("parsing.train"));
	ObjectStream<String> stringStream = new PlainTextByLineStream(inputStreamFactory, StandardCharsets.UTF_8);
	ObjectStream<Parse> sampleStream = new ParseSample(stringStream);
	ParserCrossValidator evaluator = new ParserCrossValidator("en", trainParameters, headRules, \
	parserType, listeners.toArray(new ParserEvaluationMonitor[listeners.size()])));
	evaluator.evaluate(sampleStream, 10);

	FMeasure result = evaluator.getFMeasure();

	System.out.println(result.toString());]]>
	</programlisting>
	</para>
	</section>
	</section>
	</chapter>