| <?xml version="1.0" encoding="UTF-8"?> |
| <!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN" |
| "http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd"[ |
| ]> |
| <!-- |
| Licensed to the Apache Software Foundation (ASF) under one |
| or more contributor license agreements. See the NOTICE file |
| distributed with this work for additional information |
| regarding copyright ownership. The ASF licenses this file |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this file except in compliance |
| with the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, |
| software distributed under the License is distributed on an |
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| KIND, either express or implied. See the License for the |
| specific language governing permissions and limitations |
| under the License. |
| --> |
| |
| <chapter id="tools.parser"> |
| |
| <title>Parser</title> |
| |
| <section id="tools.parser.parsing"> |
| <title>Parsing</title> |
| <para> |
| A parser returns a parse tree from a sentence according to a phrase structure grammar. A parse tree specifies |
| the internal structure of a sentence. For example, the following image represents a parse tree for |
| the sentence 'The cellphone was broken in two days': |
| |
| <imagedata fileref="images/parsetree1.png" scalefit="1"/> |
| |
| A parse tree can be used to determine the role of subtrees or constituents in the sentence. For example, it is possible to |
| know that 'The cellphone' is the subject of the sentence and the verb (action) is 'was broken.' |
| </para> |
| |
| <section id="tools.parser.parsing.cmdline"> |
| <title>Parser Tool</title> |
| <para> |
| The easiest way to try out the Parser is the command line tool. |
| The tool is only intended for demonstration and testing. |
| Download the English chunking parser model from the our website and start the Parse |
| Tool with the following command. |
| <screen> |
| <![CDATA[ |
| $ opennlp Parser en-parser-chunking.bin]]> |
| </screen> |
| Loading the big parser model can take several seconds, be patient. |
| Copy this sample sentence to the console. |
| <screen> |
| <![CDATA[ |
| The cellphone was broken in two days .]]> |
| </screen> |
| The parser should now print the following to the console. |
| <screen> |
| <![CDATA[ |
| (TOP (S (NP (DT The) (NN cellphone)) (VP (VBD was) (VP (VBN broken) (PP (IN in) (NP (CD two) (NNS days))))) (. .)))]]> |
| </screen> |
| With the following command the input can be read from a file and be written to an output file. |
| <screen> |
| <![CDATA[ |
| $ opennlp Parser en-parser-chunking.bin < article-tokenized.txt > article-parsed.txt.]]> |
| </screen> |
| The article-tokenized.txt file must contain one sentence per line which is |
| tokenized with the English tokenizer model from our website. |
| See the Tokenizer documentation for further details. |
| </para> |
| </section> |
| <section id="tools.parser.parsing.api"> |
| <title>Parsing API</title> |
| <para> |
| The Parser can be easily integrated into an application via its API. |
| To instantiate a Parser the parser model must be loaded first. |
| <programlisting language="java"> |
| <![CDATA[ |
| InputStream modelIn = new FileInputStream("en-parser-chunking.bin"); |
| try { |
| ParserModel model = new ParserModel(modelIn); |
| } |
| catch (IOException e) { |
| e.printStackTrace(); |
| } |
| finally { |
| if (modelIn != null) { |
| try { |
| modelIn.close(); |
| } |
| catch (IOException e) { |
| } |
| } |
| }]]> |
| </programlisting> |
| Unlike the other components to instantiate the Parser a factory method |
| should be used instead of creating the Parser via the new operator. |
| The parser model is either trained for the chunking parser or the tree |
| insert parser the parser implementation must be chosen correctly. |
| The factory method will read a type parameter from the model and create |
| an instance of the corresponding parser implementation. |
| <programlisting language="java"> |
| <![CDATA[ |
| Parser parser = ParserFactory.create(model);]]> |
| </programlisting> |
| Right now the tree insert parser is still experimental and there is no pre-trained model for it. |
| The parser expect a whitespace tokenized sentence. A utility method from the command |
| line tool can parse the sentence String. The following code shows how the parser can be called. |
| <programlisting language="java"> |
| <![CDATA[ |
| String sentence = "The quick brown fox jumps over the lazy dog ."; |
| Parse topParses[] = ParserTool.parseLine(sentence, parser, 1);]]> |
| </programlisting> |
| |
| The topParses array only contains one parse because the number of parses is set to 1. |
| The Parse object contains the parse tree. |
| To display the parse tree call the show method. It either prints the parse to |
| the console or into a provided StringBuffer. Similar to Exception.printStackTrace. |
| </para> |
| <para> |
| TODO: Extend this section with more information about the Parse object. |
| </para> |
| </section> |
| </section> |
| <section id="tools.parser.training"> |
| <title>Parser Training</title> |
| <para> |
| The OpenNLP offers two different parser implementations, the chunking parser and the |
| treeinsert parser. The later one is still experimental and not recommended for production use. |
| (TODO: Add a section which explains the two different approaches) |
| The training can either be done with the command line tool or the training API. |
| In the first case the training data must be available in the OpenNLP format. Which is |
| the Penn Treebank format, but with the limitation of a sentence per line. |
| <programlisting> |
| <![CDATA[ |
| (TOP (S (NP-SBJ (DT Some) )(VP (VBP say) (NP (NNP November) ))(. .) )) |
| (TOP (S (NP-SBJ (PRP I) )(VP (VBP say) (NP (CD 1992) ))(. .) ('' '') ))]]> |
| </programlisting> |
| Penn Treebank annotation guidelines can be found on the |
| <ulink url="https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html">Penn Treebank home page</ulink>. |
| A parser model also contains a pos tagger model, depending on the amount of available |
| training data it is recommend to switch the tagger model against a tagger model which |
| was trained on a larger corpus. The pre-trained parser model provided on the website |
| is doing this to achieve a better performance. (TODO: On which data is the model on |
| the website trained, and say on which data the tagger model is trained) |
| </para> |
| <section id="tools.parser.training.tool"> |
| <title>Training Tool</title> |
| <para> |
| OpenNLP has a command line tool which is used to train the models available from the |
| model download page on various corpora. The data must be converted to the OpenNLP parser |
| training format, which is shortly explained above. |
| To train the parser a head rules file is also needed. (TODO: Add documentation about the head rules file) |
| Usage of the tool: |
| <screen> |
| <![CDATA[ |
| $ opennlp ParserTrainer |
| Usage: opennlp ParserTrainer -headRules headRulesFile [-parserType CHUNKING|TREEINSERT] \ |
| [-params paramsFile] [-iterations num] [-cutoff num] \ |
| -model modelFile -lang language -data sampleData \ |
| [-encoding charsetName] |
| |
| Arguments description: |
| -headRules headRulesFile |
| head rules file. |
| -parserType CHUNKING|TREEINSERT |
| one of CHUNKING or TREEINSERT, default is CHUNKING. |
| -params paramsFile |
| training parameters file. |
| -iterations num |
| number of training iterations, ignored if -params is used. |
| -cutoff num |
| minimal number of times a feature must be seen, ignored if -params is used. |
| -model modelFile |
| output model file. |
| -format formatName |
| data format, might have its own parameters. |
| -encoding charsetName |
| encoding for reading and writing text, if absent the system default is used. |
| -lang language |
| language which is being processed. |
| -data sampleData |
| data to be used, usually a file name. |
| -encoding charsetName |
| encoding for reading and writing text, if absent the system default is used.]]> |
| </screen> |
| The model on the website was trained with the following command: |
| <screen> |
| <![CDATA[ |
| $ opennlp ParserTrainer -model en-parser-chunking.bin -parserType CHUNKING \ |
| -headRules head_rules \ |
| -lang en -data train.all -encoding ISO-8859-1 |
| ]]> |
| </screen> |
| Its also possible to specify the cutoff and the number of iterations, these parameters |
| are used for all trained models. The -parserType parameter is an optional parameter, |
| to use the tree insertion parser, specify TREEINSERT as type. The TaggerModelReplacer |
| tool replaces the tagger model inside the parser model with a new one. |
| </para> |
| <para> |
| Note: The original parser model will be overwritten with the new parser model which |
| contains the replaced tagger model. |
| <screen> |
| <![CDATA[ |
| $ opennlp TaggerModelReplacer en-parser-chunking.bin en-pos-maxent.bin]]> |
| </screen> |
| Additionally there are tools to just retrain the build or the check model. |
| </para> |
| </section> |
| |
| <section id="tools.parser.training.api"> |
| <title>Training API</title> |
| <para> |
| The Parser training API supports the training of a new parser model. |
| Four steps are necessary to train it: |
| <itemizedlist> |
| <listitem> |
| <para>A HeadRules class needs to be instantiated: currently EnglishHeadRules and AncoraSpanishHeadRules are available.</para> |
| </listitem> |
| <listitem> |
| <para>The application must open a sample data stream.</para> |
| </listitem> |
| <listitem> |
| <para>Call a Parser train method: This can be either the CHUNKING or the TREEINSERT parser.</para> |
| </listitem> |
| <listitem> |
| <para>Save the ParseModel to a file</para> |
| </listitem> |
| </itemizedlist> |
| The following code snippet shows how to instantiate the HeadRules: |
| <programlisting language="java"> |
| <![CDATA[ |
| static HeadRules createHeadRules(TrainerToolParams params) throws IOException { |
| |
| ArtifactSerializer headRulesSerializer = null; |
| |
| if (params.getHeadRulesSerializerImpl() != null) { |
| headRulesSerializer = ExtensionLoader.instantiateExtension(ArtifactSerializer.class, |
| params.getHeadRulesSerializerImpl()); |
| } |
| else { |
| if ("en".equals(params.getLang())) { |
| headRulesSerializer = new opennlp.tools.parser.lang.en.HeadRules.HeadRulesSerializer(); |
| } |
| else if ("es".equals(params.getLang())) { |
| headRulesSerializer = new opennlp.tools.parser.lang.es.AncoraSpanishHeadRules.HeadRulesSerializer(); |
| } |
| else { |
| // default for now, this case should probably cause an error ... |
| headRulesSerializer = new opennlp.tools.parser.lang.en.HeadRules.HeadRulesSerializer(); |
| } |
| } |
| |
| Object headRulesObject = headRulesSerializer.create(new FileInputStream(params.getHeadRules())); |
| |
| if (headRulesObject instanceof HeadRules) { |
| return (HeadRules) headRulesObject; |
| } |
| else { |
| throw new TerminateToolException(-1, "HeadRules Artifact Serializer must create an object of type HeadRules!"); |
| } |
| }]]> |
| </programlisting> |
| The following code illustrates the three other steps, namely, opening the data, training |
| the model and saving the ParserModel into an output file. |
| <programlisting language="java"> |
| <![CDATA[ |
| ParserModel model = null; |
| File modelOutFile = params.getModel(); |
| CmdLineUtil.checkOutputFile("parser model", modelOutFile); |
| |
| try { |
| HeadRules rules = createHeadRules(params); |
| InputStreamFactory inputStreamFactory = new MarkableFileInputStreamFactory(new File("parsing.train")); |
| ObjectStream<String> stringStream = new PlainTextByLineStream(inputStreamFactory, StandardCharsets.UTF_8); |
| ObjectStream<Parse> sampleStream = new ParseSample(stringStream); |
| |
| ParserType type = parseParserType(params.getParserType()); |
| if (ParserType.CHUNKING.equals(type)) { |
| model = opennlp.tools.parser.chunking.Parser.train( |
| params.getLang(), sampleStream, rules, |
| mlParams); |
| } else if (ParserType.TREEINSERT.equals(type)) { |
| model = opennlp.tools.parser.treeinsert.Parser.train(params.getLang(), sampleStream, rules, |
| mlParams); |
| } |
| } |
| catch (IOException e) { |
| throw new TerminateToolException(-1, "IO error while reading training data or indexing data: " |
| + e.getMessage(), e); |
| } |
| finally { |
| try { |
| sampleStream.close(); |
| } |
| catch (IOException e) { |
| // sorry that this can fail |
| } |
| } |
| CmdLineUtil.writeModel("parser", modelOutFile, model); |
| ]]> |
| </programlisting> |
| </para> |
| </section> |
| </section> |
| <section id="tools.parser.evaluation"> |
| <title>Parser Evaluation</title> |
| <para> |
| The built in evaluation can measure the parser performance. The |
| performance is measured |
| on a test dataset. |
| </para> |
| <section id="tools.parser.evaluation.tool"> |
| <title>Parser Evaluation Tool</title> |
| <para> |
| The following command shows how the tool can be run: |
| <screen> |
| <![CDATA[ |
| $ opennlp ParserEvaluator |
| Usage: opennlp ParserEvaluator[.ontonotes|frenchtreebank] [-misclassified true|false] -model model \ |
| -data sampleData [-encoding charsetName]]]> |
| </screen> |
| A sample of the command considering you have a data sample named |
| en-parser-chunking.eval |
| and you trained a model called en-parser-chunking.bin: |
| <screen> |
| <![CDATA[ |
| $ opennlp ParserEvaluator -model en-parser-chunking.bin -data en-parser-chunking.eval -encoding UTF-8]]> |
| </screen> |
| and here is a sample output: |
| <screen> |
| <![CDATA[ |
| Precision: 0.9009744742967609 |
| Recall: 0.8962012400910446 |
| F-Measure: 0.8985815184245214]]> |
| </screen> |
| </para> |
| <para> |
| The Parser Evaluation tool reimplements the PARSEVAL scoring method |
| as implemented by the |
| <ulink url="http://nlp.cs.nyu.edu/evalb/">EVALB</ulink> |
| script, which is the most widely used evaluation |
| tool for constituent parsing. Note however that currently the Parser |
| Evaluation tool does not allow |
| to make exceptions in the constituents to be evaluated, in the way |
| Collins or Bikel usually do. Any |
| contributions are very welcome. If you want to contribute please contact us on |
| the mailing list or comment |
| on the jira issue |
| <ulink url="https://issues.apache.org/jira/browse/OPENNLP-688">OPENNLP-688</ulink>. |
| </para> |
| </section> |
| <section id="tools.parser.evaluation.api"> |
| <title>Evaluation API</title> |
| <para> |
| The evaluation can be performed on a pre-trained model and a test dataset or via cross validation. |
| In the first case the model must be loaded and a Parse ObjectStream must be created (see code samples above), |
| assuming these two objects exist the following code shows how to perform the evaluation: |
| <programlisting language="java"> |
| <![CDATA[ |
| Parser parser = ParserFactory.create(model); |
| ParserEvaluator evaluator = new ParserEvaluator(parser); |
| evaluator.evaluate(sampleStream); |
| |
| FMeasure result = evaluator.getFMeasure(); |
| |
| System.out.println(result.toString());]]> |
| </programlisting> |
| In the cross validation case all the training arguments must be |
| provided (see the Training API section above). |
| To perform cross validation the ObjectStream must be resettable. |
| <programlisting language="java"> |
| <![CDATA[ |
| InputStreamFactory inputStreamFactory = new MarkableFileInputStreamFactory(new File("parsing.train")); |
| ObjectStream<String> stringStream = new PlainTextByLineStream(inputStreamFactory, StandardCharsets.UTF_8); |
| ObjectStream<Parse> sampleStream = new ParseSample(stringStream); |
| ParserCrossValidator evaluator = new ParserCrossValidator("en", trainParameters, headRules, \ |
| parserType, listeners.toArray(new ParserEvaluationMonitor[listeners.size()]))); |
| evaluator.evaluate(sampleStream, 10); |
| |
| FMeasure result = evaluator.getFMeasure(); |
| |
| System.out.println(result.toString());]]> |
| </programlisting> |
| </para> |
| </section> |
| </section> |
| </chapter> |