opennlp-docs/src/docbkx/sentdetect.xml - opennlp - Git at Google

 <?xml version="1.0" encoding="UTF-8"?>
 <!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN"
 "http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd"[
 ]>
 <!--
 Licensed to the Apache Software Foundation (ASF) under one
 or more contributor license agreements.  See the NOTICE file
 distributed with this work for additional information
 regarding copyright ownership.  The ASF licenses this file
 to you under the Apache License, Version 2.0 (the
 "License"); you may not use this file except in compliance
 with the License.  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing,
 software distributed under the License is distributed on an
 "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 KIND, either express or implied.  See the License for the
 specific language governing permissions and limitations
 under the License.
 -->

 <chapter id="tools.sentdetect">

 	<title>Sentence Detector</title>

 	<section id="tools.sentdetect.detection">
 		<title>Sentence Detection</title>
 		<para>
 		The OpenNLP Sentence Detector can detect that a punctuation character
 		marks the end of a sentence or not. In this sense a sentence is defined
 		as the longest white space trimmed character sequence between two punctuation
 		marks. The first and last sentence make an exception to this rule. The first
 		non whitespace character is assumed to be the begin of a sentence, and the
 		last non whitespace character is assumed to be a sentence end.
 		The sample text below should be segmented into its sentences.
 		<screen>
 				<![CDATA[
 Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29. Mr. Vinken is
 chairman of Elsevier N.V., the Dutch publishing group. Rudolph Agnew, 55 years
 old and former chairman of Consolidated Gold Fields PLC, was named a director of this
 British industrial conglomerate.]]>
 		</screen>
 		After detecting the sentence boundaries each sentence is written in its own line.
 		<screen>
 				<![CDATA[
 Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29.
 Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group.
 Rudolph Agnew, 55 years old and former chairman of Consolidated Gold Fields PLC,
     was named a director of this British industrial conglomerate.]]>
 		</screen>
 		Usually Sentence Detection is done before the text is tokenized and that's the way the pre-trained models on the web site are trained,
 		but it is also possible to perform tokenization first and let the Sentence Detector process the already tokenized text.
 		The OpenNLP Sentence Detector cannot identify sentence boundaries based on the contents of the sentence. A prominent example is the first sentence in an article where the title is mistakenly identified to be the first part of the first sentence.
 		Most components in OpenNLP expect input which is segmented into sentences.
 		</para>

 		<section id="tools.sentdetect.detection.cmdline">
 		<title>Sentence Detection Tool</title>
 		<para>
 		The easiest way to try out the Sentence Detector is the command line tool. The tool is only intended for demonstration and testing.
 		Download the english sentence detector model and start the Sentence Detector Tool with this command:
         <screen>
         <![CDATA[
 $ opennlp SentenceDetector en-sent.bin]]>
 		</screen>
 		Just copy the sample text from above to the console. The Sentence Detector will read it and echo one sentence per line to the console.
 		Usually the input is read from a file and the output is redirected to another file. This can be achieved with the following command.
 		<screen>
 				<![CDATA[
 $ opennlp SentenceDetector en-sent.bin < input.txt > output.txt]]>
 		</screen>
 		For the english sentence model from the website the input text should not be tokenized.
 		</para>
 		</section>
 		<section id="tools.sentdetect.detection.api">
 		<title>Sentence Detection API</title>
 		<para>
 		The Sentence Detector can be easily integrated into an application via its API.
 		To instantiate the Sentence Detector the sentence model must be loaded first.
 		<programlisting language="java">
 				<![CDATA[

 try (InputStream modelIn = new FileInputStream("en-sent.bin")) {
   SentenceModel model = new SentenceModel(modelIn);
 }]]>
 		</programlisting>
 		After the model is loaded the SentenceDetectorME can be instantiated.
 		<programlisting language="java">
 				<![CDATA[
 SentenceDetectorME sentenceDetector = new SentenceDetectorME(model);]]>
 		</programlisting>
 		The Sentence Detector can output an array of Strings, where each String is one sentence.
 				<programlisting language="java">
 				<![CDATA[
 String sentences[] = sentenceDetector.sentDetect("  First sentence. Second sentence. ");]]>
 		</programlisting>
 		The result array now contains two entries. The first String is "First sentence." and the
         second String is "Second sentence." The whitespace before, between and after the input String is removed.
 		The API also offers a method which simply returns the span of the sentence in the input string.
 		<programlisting language="java">
 				<![CDATA[
 Span sentences[] = sentenceDetector.sentPosDetect("  First sentence. Second sentence. ");]]>
 		</programlisting>
 		The result array again contains two entries. The first span beings at index 2 and ends at
             17. The second span begins at 18 and ends at 34. The utility method Span.getCoveredText can be used to create a substring which only covers the chars in the span.
 		</para>
 		</section>
 	</section>
 	<section id="tools.sentdetect.training">
 		<title>Sentence Detector Training</title>
 		<para/>
 		<section id="tools.sentdetect.training.tool">
 		<title>Training Tool</title>
 		<para>
 		OpenNLP has a command line tool which is used to train the models available from the model
 		download page on various corpora. The data must be converted to the OpenNLP Sentence Detector
 		training format. Which is one sentence per line. An empty line indicates a document boundary.
 		In case the document boundary is unknown, its recommended to have an empty line every few ten
 		sentences. Exactly like the output in the sample above.
 		Usage of the tool:
 		<screen>
 				<![CDATA[
 $ opennlp SentenceDetectorTrainer
 Usage: opennlp SentenceDetectorTrainer[.namefinder|.conllx|.pos] [-abbDict path] \
                [-params paramsFile] [-iterations num] [-cutoff num] -model modelFile \
                -lang language -data sampleData [-encoding charsetName]

 Arguments description:
         -abbDict path
                 abbreviation dictionary in XML format.
         -params paramsFile
                 training parameters file.
         -iterations num
                 number of training iterations, ignored if -params is used.
         -cutoff num
                 minimal number of times a feature must be seen, ignored if -params is used.
         -model modelFile
                 output model file.
         -lang language
                 language which is being processed.
         -data sampleData
                 data to be used, usually a file name.
         -encoding charsetName
                 encoding for reading and writing text, if absent the system default is used.]]>
 	</screen>
 		To train an English sentence detector use the following command:
         <screen>
 				<![CDATA[
 $ opennlp SentenceDetectorTrainer -model en-sent.bin -lang en -data en-sent.train -encoding UTF-8
                         ]]>
         </screen>
             It should produce the following output:
             <screen>
                 <![CDATA[
 Indexing events using cutoff of 5

 	Computing event counts...  done. 4883 events
 	Indexing...  done.
 Sorting and merging events... done. Reduced 4883 events to 2945.
 Done indexing.
 Incorporating indexed data for training...
 done.
 	Number of Event Tokens: 2945
 	    Number of Outcomes: 2
 	  Number of Predicates: 467
 ...done.
 Computing model parameters...
 Performing 100 iterations.
   1:  .. loglikelihood=-3384.6376826743144	0.38951464263772273
   2:  .. loglikelihood=-2191.9266688597672	0.9397911120212984
   3:  .. loglikelihood=-1645.8640771555981	0.9643661683391358
   4:  .. loglikelihood=-1340.386303774519	0.9739913987302887
   5:  .. loglikelihood=-1148.4141548519624	0.9748105672742167

  ...<skipping a bunch of iterations>...

  95:  .. loglikelihood=-288.25556805874436	0.9834118369854598
  96:  .. loglikelihood=-287.2283680343481	0.9834118369854598
  97:  .. loglikelihood=-286.2174830344526	0.9834118369854598
  98:  .. loglikelihood=-285.222486981048	0.9834118369854598
  99:  .. loglikelihood=-284.24296917223916	0.9834118369854598
 100:  .. loglikelihood=-283.2785335773966	0.9834118369854598
 Wrote sentence detector model.
 Path: en-sent.bin
 ]]>
 		</screen>
 		</para>
 		</section>
 		<section id="tools.sentdetect.training.api">
 		<title>Training API</title>
 		<para>
 		The Sentence Detector also offers an API to train a new sentence detection model.
 		Basically three steps are necessary to train it:
 		<itemizedlist>
 				<listitem>
 					<para>The application must open a sample data stream</para>
 				</listitem>
 				<listitem>
 					<para>Call the SentenceDetectorME.train method</para>
 				</listitem>
 				<listitem>
 					<para>Save the SentenceModel to a file or directly use it</para>
 				</listitem>
 			</itemizedlist>
 			The following sample code illustrates these steps:
 					<programlisting language="java">
 				<![CDATA[
 ObjectStream<String> lineStream =
   new PlainTextByLineStream(new FileInputStream("en-sent.train"), StandardCharsets.UTF_8);

 SentenceModel model;

 try (ObjectStream<SentenceSample> sampleStream = new SentenceSampleStream(lineStream)) {
   model = SentenceDetectorME.train("en", sampleStream, true, null, TrainingParameters.defaultParams());
 }

 try (OutputStream modelOut = new BufferedOutputStream(new FileOutputStream(modelFile))) {
   model.serialize(modelOut);
 }]]>
 		</programlisting>
 		</para>
 		</section>
 	</section>
 	<section id="tools.sentdetect.eval">
 		<title>Evaluation</title>
 		<para>
 		</para>
 		<section id="tools.sentdetect.eval.tool">
 			<title>Evaluation Tool</title>
 			<para>
                 The command shows how the evaluator tool can be run:
                 <screen>
 				<![CDATA[
 $ opennlp SentenceDetectorEvaluator -model en-sent.bin -data en-sent.eval -encoding UTF-8

 Loading model ... done
 Evaluating ... done

 Precision: 0.9465737514518002
 Recall: 0.9095982142857143
 F-Measure: 0.9277177006260672]]>
                 </screen>
                 The en-sent.eval file has the same format as the training data.
 			</para>
 		</section>
 	</section>
 </chapter>
	<?xml version="1.0" encoding="UTF-8"?>
	<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN"
	"http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd"[
	]>
	<!--
	Licensed to the Apache Software Foundation (ASF) under one
	or more contributor license agreements. See the NOTICE file
	distributed with this work for additional information
	regarding copyright ownership. The ASF licenses this file
	to you under the Apache License, Version 2.0 (the
	"License"); you may not use this file except in compliance
	with the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing,
	software distributed under the License is distributed on an
	"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	KIND, either express or implied. See the License for the
	specific language governing permissions and limitations
	under the License.
	-->

	<chapter id="tools.sentdetect">

	<title>Sentence Detector</title>

	<section id="tools.sentdetect.detection">
	<title>Sentence Detection</title>
	<para>
	The OpenNLP Sentence Detector can detect that a punctuation character
	marks the end of a sentence or not. In this sense a sentence is defined
	as the longest white space trimmed character sequence between two punctuation
	marks. The first and last sentence make an exception to this rule. The first
	non whitespace character is assumed to be the begin of a sentence, and the
	last non whitespace character is assumed to be a sentence end.
	The sample text below should be segmented into its sentences.
	<screen>
	<![CDATA[
	Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29. Mr. Vinken is
	chairman of Elsevier N.V., the Dutch publishing group. Rudolph Agnew, 55 years
	old and former chairman of Consolidated Gold Fields PLC, was named a director of this
	British industrial conglomerate.]]>
	</screen>
	After detecting the sentence boundaries each sentence is written in its own line.
	<screen>
	<![CDATA[
	Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29.
	Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group.
	Rudolph Agnew, 55 years old and former chairman of Consolidated Gold Fields PLC,
	was named a director of this British industrial conglomerate.]]>
	</screen>
	Usually Sentence Detection is done before the text is tokenized and that's the way the pre-trained models on the web site are trained,
	but it is also possible to perform tokenization first and let the Sentence Detector process the already tokenized text.
	The OpenNLP Sentence Detector cannot identify sentence boundaries based on the contents of the sentence. A prominent example is the first sentence in an article where the title is mistakenly identified to be the first part of the first sentence.
	Most components in OpenNLP expect input which is segmented into sentences.
	</para>

	<section id="tools.sentdetect.detection.cmdline">
	<title>Sentence Detection Tool</title>
	<para>
	The easiest way to try out the Sentence Detector is the command line tool. The tool is only intended for demonstration and testing.
	Download the english sentence detector model and start the Sentence Detector Tool with this command:
	<screen>
	<![CDATA[
	$ opennlp SentenceDetector en-sent.bin]]>
	</screen>
	Just copy the sample text from above to the console. The Sentence Detector will read it and echo one sentence per line to the console.
	Usually the input is read from a file and the output is redirected to another file. This can be achieved with the following command.
	<screen>
	<![CDATA[
	$ opennlp SentenceDetector en-sent.bin < input.txt > output.txt]]>
	</screen>
	For the english sentence model from the website the input text should not be tokenized.
	</para>
	</section>
	<section id="tools.sentdetect.detection.api">
	<title>Sentence Detection API</title>
	<para>
	The Sentence Detector can be easily integrated into an application via its API.
	To instantiate the Sentence Detector the sentence model must be loaded first.
	<programlisting language="java">
	<![CDATA[

	try (InputStream modelIn = new FileInputStream("en-sent.bin")) {
	SentenceModel model = new SentenceModel(modelIn);
	}]]>
	</programlisting>
	After the model is loaded the SentenceDetectorME can be instantiated.
	<programlisting language="java">
	<![CDATA[
	SentenceDetectorME sentenceDetector = new SentenceDetectorME(model);]]>
	</programlisting>
	The Sentence Detector can output an array of Strings, where each String is one sentence.
	<programlisting language="java">
	<![CDATA[
	String sentences[] = sentenceDetector.sentDetect(" First sentence. Second sentence. ");]]>
	</programlisting>
	The result array now contains two entries. The first String is "First sentence." and the
	second String is "Second sentence." The whitespace before, between and after the input String is removed.
	The API also offers a method which simply returns the span of the sentence in the input string.
	<programlisting language="java">
	<![CDATA[
	Span sentences[] = sentenceDetector.sentPosDetect(" First sentence. Second sentence. ");]]>
	</programlisting>
	The result array again contains two entries. The first span beings at index 2 and ends at
	17. The second span begins at 18 and ends at 34. The utility method Span.getCoveredText can be used to create a substring which only covers the chars in the span.
	</para>
	</section>
	</section>
	<section id="tools.sentdetect.training">
	<title>Sentence Detector Training</title>
	<para/>
	<section id="tools.sentdetect.training.tool">
	<title>Training Tool</title>
	<para>
	OpenNLP has a command line tool which is used to train the models available from the model
	download page on various corpora. The data must be converted to the OpenNLP Sentence Detector
	training format. Which is one sentence per line. An empty line indicates a document boundary.
	In case the document boundary is unknown, its recommended to have an empty line every few ten
	sentences. Exactly like the output in the sample above.
	Usage of the tool:
	<screen>
	<![CDATA[
	$ opennlp SentenceDetectorTrainer
	Usage: opennlp SentenceDetectorTrainer[.namefinder\|.conllx\|.pos] [-abbDict path] \
	[-params paramsFile] [-iterations num] [-cutoff num] -model modelFile \
	-lang language -data sampleData [-encoding charsetName]

	Arguments description:
	-abbDict path
	abbreviation dictionary in XML format.
	-params paramsFile
	training parameters file.
	-iterations num
	number of training iterations, ignored if -params is used.
	-cutoff num
	minimal number of times a feature must be seen, ignored if -params is used.
	-model modelFile
	output model file.
	-lang language
	language which is being processed.
	-data sampleData
	data to be used, usually a file name.
	-encoding charsetName
	encoding for reading and writing text, if absent the system default is used.]]>
	</screen>
	To train an English sentence detector use the following command:
	<screen>
	<![CDATA[
	$ opennlp SentenceDetectorTrainer -model en-sent.bin -lang en -data en-sent.train -encoding UTF-8
	]]>
	</screen>
	It should produce the following output:
	<screen>
	<![CDATA[
	Indexing events using cutoff of 5

	Computing event counts... done. 4883 events
	Indexing... done.
	Sorting and merging events... done. Reduced 4883 events to 2945.
	Done indexing.
	Incorporating indexed data for training...
	done.
	Number of Event Tokens: 2945
	Number of Outcomes: 2
	Number of Predicates: 467
	...done.
	Computing model parameters...
	Performing 100 iterations.
	1: .. loglikelihood=-3384.6376826743144 0.38951464263772273
	2: .. loglikelihood=-2191.9266688597672 0.9397911120212984
	3: .. loglikelihood=-1645.8640771555981 0.9643661683391358
	4: .. loglikelihood=-1340.386303774519 0.9739913987302887
	5: .. loglikelihood=-1148.4141548519624 0.9748105672742167

	...<skipping a bunch of iterations>...

	95: .. loglikelihood=-288.25556805874436 0.9834118369854598
	96: .. loglikelihood=-287.2283680343481 0.9834118369854598
	97: .. loglikelihood=-286.2174830344526 0.9834118369854598
	98: .. loglikelihood=-285.222486981048 0.9834118369854598
	99: .. loglikelihood=-284.24296917223916 0.9834118369854598
	100: .. loglikelihood=-283.2785335773966 0.9834118369854598
	Wrote sentence detector model.
	Path: en-sent.bin
	]]>
	</screen>
	</para>
	</section>
	<section id="tools.sentdetect.training.api">
	<title>Training API</title>
	<para>
	The Sentence Detector also offers an API to train a new sentence detection model.
	Basically three steps are necessary to train it:
	<itemizedlist>
	<listitem>
	<para>The application must open a sample data stream</para>
	</listitem>
	<listitem>
	<para>Call the SentenceDetectorME.train method</para>
	</listitem>
	<listitem>
	<para>Save the SentenceModel to a file or directly use it</para>
	</listitem>
	</itemizedlist>
	The following sample code illustrates these steps:
	<programlisting language="java">
	<![CDATA[
	ObjectStream<String> lineStream =
	new PlainTextByLineStream(new FileInputStream("en-sent.train"), StandardCharsets.UTF_8);

	SentenceModel model;

	try (ObjectStream<SentenceSample> sampleStream = new SentenceSampleStream(lineStream)) {
	model = SentenceDetectorME.train("en", sampleStream, true, null, TrainingParameters.defaultParams());
	}

	try (OutputStream modelOut = new BufferedOutputStream(new FileOutputStream(modelFile))) {
	model.serialize(modelOut);
	}]]>
	</programlisting>
	</para>
	</section>
	</section>
	<section id="tools.sentdetect.eval">
	<title>Evaluation</title>
	<para>
	</para>
	<section id="tools.sentdetect.eval.tool">
	<title>Evaluation Tool</title>
	<para>
	The command shows how the evaluator tool can be run:
	<screen>
	<![CDATA[
	$ opennlp SentenceDetectorEvaluator -model en-sent.bin -data en-sent.eval -encoding UTF-8

	Loading model ... done
	Evaluating ... done

	Precision: 0.9465737514518002
	Recall: 0.9095982142857143
	F-Measure: 0.9277177006260672]]>
	</screen>
	The en-sent.eval file has the same format as the training data.
	</para>
	</section>
	</section>
	</chapter>