| <?xml version="1.0" encoding="UTF-8"?> |
| <!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN" |
| "http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd"[ |
| ]> |
| <!-- |
| Licensed to the Apache Software Foundation (ASF) under one |
| or more contributor license agreements. See the NOTICE file |
| distributed with this work for additional information |
| regarding copyright ownership. The ASF licenses this file |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this file except in compliance |
| with the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, |
| software distributed under the License is distributed on an |
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| KIND, either express or implied. See the License for the |
| specific language governing permissions and limitations |
| under the License. |
| --> |
| |
| <chapter id="tools.sentdetect"> |
| |
| <title>Sentence Detector</title> |
| |
| <section id="tools.sentdetect.detection"> |
| <title>Sentence Detection</title> |
| <para> |
| The OpenNLP Sentence Detector can detect that a punctuation character |
| marks the end of a sentence or not. In this sense a sentence is defined |
| as the longest white space trimmed character sequence between two punctuation |
| marks. The first and last sentence make an exception to this rule. The first |
| non whitespace character is assumed to be the begin of a sentence, and the |
| last non whitespace character is assumed to be a sentence end. |
| The sample text below should be segmented into its sentences. |
| <screen> |
| <![CDATA[ |
| Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29. Mr. Vinken is |
| chairman of Elsevier N.V., the Dutch publishing group. Rudolph Agnew, 55 years |
| old and former chairman of Consolidated Gold Fields PLC, was named a director of this |
| British industrial conglomerate.]]> |
| </screen> |
| After detecting the sentence boundaries each sentence is written in its own line. |
| <screen> |
| <![CDATA[ |
| Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29. |
| Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group. |
| Rudolph Agnew, 55 years old and former chairman of Consolidated Gold Fields PLC, |
| was named a director of this British industrial conglomerate.]]> |
| </screen> |
| Usually Sentence Detection is done before the text is tokenized and that's the way the pre-trained models on the web site are trained, |
| but it is also possible to perform tokenization first and let the Sentence Detector process the already tokenized text. |
| The OpenNLP Sentence Detector cannot identify sentence boundaries based on the contents of the sentence. A prominent example is the first sentence in an article where the title is mistakenly identified to be the first part of the first sentence. |
| Most components in OpenNLP expect input which is segmented into sentences. |
| </para> |
| |
| <section id="tools.sentdetect.detection.cmdline"> |
| <title>Sentence Detection Tool</title> |
| <para> |
| The easiest way to try out the Sentence Detector is the command line tool. The tool is only intended for demonstration and testing. |
| Download the english sentence detector model and start the Sentence Detector Tool with this command: |
| <screen> |
| <![CDATA[ |
| $ opennlp SentenceDetector en-sent.bin]]> |
| </screen> |
| Just copy the sample text from above to the console. The Sentence Detector will read it and echo one sentence per line to the console. |
| Usually the input is read from a file and the output is redirected to another file. This can be achieved with the following command. |
| <screen> |
| <![CDATA[ |
| $ opennlp SentenceDetector en-sent.bin < input.txt > output.txt]]> |
| </screen> |
| For the english sentence model from the website the input text should not be tokenized. |
| </para> |
| </section> |
| <section id="tools.sentdetect.detection.api"> |
| <title>Sentence Detection API</title> |
| <para> |
| The Sentence Detector can be easily integrated into an application via its API. |
| To instantiate the Sentence Detector the sentence model must be loaded first. |
| <programlisting language="java"> |
| <![CDATA[ |
| |
| try (InputStream modelIn = new FileInputStream("en-sent.bin")) { |
| SentenceModel model = new SentenceModel(modelIn); |
| }]]> |
| </programlisting> |
| After the model is loaded the SentenceDetectorME can be instantiated. |
| <programlisting language="java"> |
| <![CDATA[ |
| SentenceDetectorME sentenceDetector = new SentenceDetectorME(model);]]> |
| </programlisting> |
| The Sentence Detector can output an array of Strings, where each String is one sentence. |
| <programlisting language="java"> |
| <![CDATA[ |
| String sentences[] = sentenceDetector.sentDetect(" First sentence. Second sentence. ");]]> |
| </programlisting> |
| The result array now contains two entries. The first String is "First sentence." and the |
| second String is "Second sentence." The whitespace before, between and after the input String is removed. |
| The API also offers a method which simply returns the span of the sentence in the input string. |
| <programlisting language="java"> |
| <![CDATA[ |
| Span sentences[] = sentenceDetector.sentPosDetect(" First sentence. Second sentence. ");]]> |
| </programlisting> |
| The result array again contains two entries. The first span beings at index 2 and ends at |
| 17. The second span begins at 18 and ends at 34. The utility method Span.getCoveredText can be used to create a substring which only covers the chars in the span. |
| </para> |
| </section> |
| </section> |
| <section id="tools.sentdetect.training"> |
| <title>Sentence Detector Training</title> |
| <para/> |
| <section id="tools.sentdetect.training.tool"> |
| <title>Training Tool</title> |
| <para> |
| OpenNLP has a command line tool which is used to train the models available from the model |
| download page on various corpora. The data must be converted to the OpenNLP Sentence Detector |
| training format. Which is one sentence per line. An empty line indicates a document boundary. |
| In case the document boundary is unknown, its recommended to have an empty line every few ten |
| sentences. Exactly like the output in the sample above. |
| Usage of the tool: |
| <screen> |
| <![CDATA[ |
| $ opennlp SentenceDetectorTrainer |
| Usage: opennlp SentenceDetectorTrainer[.namefinder|.conllx|.pos] [-abbDict path] \ |
| [-params paramsFile] [-iterations num] [-cutoff num] -model modelFile \ |
| -lang language -data sampleData [-encoding charsetName] |
| |
| Arguments description: |
| -abbDict path |
| abbreviation dictionary in XML format. |
| -params paramsFile |
| training parameters file. |
| -iterations num |
| number of training iterations, ignored if -params is used. |
| -cutoff num |
| minimal number of times a feature must be seen, ignored if -params is used. |
| -model modelFile |
| output model file. |
| -lang language |
| language which is being processed. |
| -data sampleData |
| data to be used, usually a file name. |
| -encoding charsetName |
| encoding for reading and writing text, if absent the system default is used.]]> |
| </screen> |
| To train an English sentence detector use the following command: |
| <screen> |
| <![CDATA[ |
| $ opennlp SentenceDetectorTrainer -model en-sent.bin -lang en -data en-sent.train -encoding UTF-8 |
| ]]> |
| </screen> |
| It should produce the following output: |
| <screen> |
| <![CDATA[ |
| Indexing events using cutoff of 5 |
| |
| Computing event counts... done. 4883 events |
| Indexing... done. |
| Sorting and merging events... done. Reduced 4883 events to 2945. |
| Done indexing. |
| Incorporating indexed data for training... |
| done. |
| Number of Event Tokens: 2945 |
| Number of Outcomes: 2 |
| Number of Predicates: 467 |
| ...done. |
| Computing model parameters... |
| Performing 100 iterations. |
| 1: .. loglikelihood=-3384.6376826743144 0.38951464263772273 |
| 2: .. loglikelihood=-2191.9266688597672 0.9397911120212984 |
| 3: .. loglikelihood=-1645.8640771555981 0.9643661683391358 |
| 4: .. loglikelihood=-1340.386303774519 0.9739913987302887 |
| 5: .. loglikelihood=-1148.4141548519624 0.9748105672742167 |
| |
| ...<skipping a bunch of iterations>... |
| |
| 95: .. loglikelihood=-288.25556805874436 0.9834118369854598 |
| 96: .. loglikelihood=-287.2283680343481 0.9834118369854598 |
| 97: .. loglikelihood=-286.2174830344526 0.9834118369854598 |
| 98: .. loglikelihood=-285.222486981048 0.9834118369854598 |
| 99: .. loglikelihood=-284.24296917223916 0.9834118369854598 |
| 100: .. loglikelihood=-283.2785335773966 0.9834118369854598 |
| Wrote sentence detector model. |
| Path: en-sent.bin |
| ]]> |
| </screen> |
| </para> |
| </section> |
| <section id="tools.sentdetect.training.api"> |
| <title>Training API</title> |
| <para> |
| The Sentence Detector also offers an API to train a new sentence detection model. |
| Basically three steps are necessary to train it: |
| <itemizedlist> |
| <listitem> |
| <para>The application must open a sample data stream</para> |
| </listitem> |
| <listitem> |
| <para>Call the SentenceDetectorME.train method</para> |
| </listitem> |
| <listitem> |
| <para>Save the SentenceModel to a file or directly use it</para> |
| </listitem> |
| </itemizedlist> |
| The following sample code illustrates these steps: |
| <programlisting language="java"> |
| <![CDATA[ |
| ObjectStream<String> lineStream = |
| new PlainTextByLineStream(new FileInputStream("en-sent.train"), StandardCharsets.UTF_8); |
| |
| SentenceModel model; |
| |
| try (ObjectStream<SentenceSample> sampleStream = new SentenceSampleStream(lineStream)) { |
| model = SentenceDetectorME.train("en", sampleStream, true, null, TrainingParameters.defaultParams()); |
| } |
| |
| try (OutputStream modelOut = new BufferedOutputStream(new FileOutputStream(modelFile))) { |
| model.serialize(modelOut); |
| }]]> |
| </programlisting> |
| </para> |
| </section> |
| </section> |
| <section id="tools.sentdetect.eval"> |
| <title>Evaluation</title> |
| <para> |
| </para> |
| <section id="tools.sentdetect.eval.tool"> |
| <title>Evaluation Tool</title> |
| <para> |
| The command shows how the evaluator tool can be run: |
| <screen> |
| <![CDATA[ |
| $ opennlp SentenceDetectorEvaluator -model en-sent.bin -data en-sent.eval -encoding UTF-8 |
| |
| Loading model ... done |
| Evaluating ... done |
| |
| Precision: 0.9465737514518002 |
| Recall: 0.9095982142857143 |
| F-Measure: 0.9277177006260672]]> |
| </screen> |
| The en-sent.eval file has the same format as the training data. |
| </para> |
| </section> |
| </section> |
| </chapter> |