opennlp-docs/src/docbkx/tokenizer.xml - opennlp - Git at Google

 <?xml version="1.0" encoding="UTF-8"?>
 <!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN"
 "http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd"[
 ]>
 <!-- Licensed to the Apache Software Foundation (ASF) under one or more contributor
 	license agreements. See the NOTICE file distributed with this work for additional
 	information regarding copyright ownership. The ASF licenses this file to
 	you under the Apache License, Version 2.0 (the "License"); you may not use
 	this file except in compliance with the License. You may obtain a copy of
 	the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required
 	by applicable law or agreed to in writing, software distributed under the
 	License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS
 	OF ANY KIND, either express or implied. See the License for the specific
 	language governing permissions and limitations under the License. -->

 <chapter id="tools.tokenizer">

 	<title>Tokenizer</title>

 	<section id="tools.tokenizer.introduction">
 		<title>Tokenization</title>
 		<para>
 			The OpenNLP Tokenizers segment an input character sequence into
 			tokens. Tokens are usually
 			words, punctuation, numbers, etc.

 			<screen>
 			<![CDATA[
 Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29.
 Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group.
 Rudolph Agnew, 55 years old and former chairman of Consolidated Gold Fields
     PLC, was named a director of this British industrial conglomerate.
 			]]>
 		    </screen>

 			The following result shows the individual tokens in a whitespace
 			separated representation.

 			<screen>
 			<![CDATA[
 Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 .
 Mr. Vinken is chairman of Elsevier N.V. , the Dutch publishing group .
 Rudolph Agnew , 55 years old and former chairman of Consolidated Gold Fields PLC ,
     was named a nonexecutive director of this British industrial conglomerate .
 A form of asbestos once used to make Kent cigarette filters has caused a high
     percentage of cancer deaths among a group of workers exposed to it more than 30 years ago ,
     researchers reported .
 			]]>
 		 	</screen>

 			OpenNLP offers multiple tokenizer implementations:
 			<itemizedlist>
 				<listitem>
 					<para>Whitespace Tokenizer - A whitespace tokenizer, non whitespace
 						sequences are identified as tokens</para>
 				</listitem>
 				<listitem>
 					<para>Simple Tokenizer - A character class tokenizer, sequences of
 						the same character class are tokens</para>
 				</listitem>
 				<listitem>
 					<para>Learnable Tokenizer - A maximum entropy tokenizer, detects
 						token boundaries based on probability model</para>
 				</listitem>
 			</itemizedlist>

 			Most part-of-speech taggers, parsers and so on, work with text
 			tokenized in this manner. It is important to ensure that your
 			tokenizer
 			produces tokens of the type expected by your later text
 			processing
 			components.
 		</para>

 		<para>
 			With OpenNLP (as with many systems), tokenization is a two-stage
 			process:
 			first, sentence boundaries are identified, then tokens within
 			each
 			sentence are identified.
 		</para>

 	<section id="tools.tokenizer.cmdline">
 		<title>Tokenizer Tools</title>
 		<para>The easiest way to try out the tokenizers are the command line
 			tools. The tools are only intended for demonstration and testing.
 		</para>
 		<para>There are two tools, one for the Simple Tokenizer and one for
 			the learnable tokenizer. A command line tool the for the Whitespace
 			Tokenizer does not exist, because the whitespace separated output
 			would be identical to the input.</para>
 		<para>
 			The following command shows how to use the Simple Tokenizer Tool.

 			<screen>
 			<![CDATA[
 $ opennlp SimpleTokenizer]]>
 		    </screen>
 			To use the learnable tokenizer download the english token model from
 			our website.
 			<screen>
 			<![CDATA[
 $ opennlp TokenizerME en-token.bin]]>
 		    </screen>
 			To test the tokenizer copy the sample from above to the console. The
 			whitespace separated tokens will be written back to the
 			console.
 		</para>
 		<para>
 			Usually the input is read from a file and written to a file.
 			<screen>
 			<![CDATA[
 $ opennlp TokenizerME en-token.bin < article.txt > article-tokenized.txt]]>
 		    </screen>
 			It can be done in the same way for the Simple Tokenizer.
 		</para>
 		<para>
 			Since most text comes truly raw and doesn't have sentence boundaries
 			and such, its possible to create a pipe which first performs sentence
 			boundary detection and tokenization. The following sample illustrates
 			that.
 			<screen>
 			<![CDATA[
 $ opennlp SentenceDetector sentdetect.model < article.txt | opennlp TokenizerME tokenize.model | more
 Loading model ... Loading model ... done
 done
 Showa Shell gained 20 to 1,570 and Mitsubishi Oil rose 50 to 1,500.
 Sumitomo Metal Mining fell five yen to 692 and Nippon Mining added 15 to 960 .
 Among other winners Wednesday was Nippon Shokubai , which was up 80 at 2,410 .
 Marubeni advanced 11 to 890 .
 London share prices were bolstered largely by continued gains on Wall Street and technical
     factors affecting demand for London 's blue-chip stocks .
 ...etc...]]>
 		 </screen>
 			Of course this is all on the command line. Many people use the models
 			directly in their Java code by creating SentenceDetector and
 			Tokenizer objects and calling their methods as appropriate. The
 			following section will explain how the Tokenizers can be used
 			directly from java.
 		</para>
 	</section>

 	<section id="tools.tokenizer.api">
 		<title>Tokenizer API</title>
 		<para>
 			The Tokenizers can be integrated into an application by the defined
 			API.
 			The shared instance of the WhitespaceTokenizer can be retrieved from a
 			static field WhitespaceTokenizer.INSTANCE. The shared instance of the
 			SimpleTokenizer can be retrieved in the same way from
 			SimpleTokenizer.INSTANCE.
 			To instantiate the TokenizerME (the learnable tokenizer) a Token Model
 			must be created first. The following code sample shows how a model
 			can be loaded.
 			<programlisting language="java">
 			<![CDATA[

 try (InputStream modelIn = new FileInputStream("en-token.bin")) {
   TokenizerModel model = new TokenizerModel(modelIn);
 }]]>
 		 </programlisting>
 			After the model is loaded the TokenizerME can be instantiated.
 			<programlisting language="java">
 			<![CDATA[
 Tokenizer tokenizer = new TokenizerME(model);]]>
 		 </programlisting>
 			The tokenizer offers two tokenize methods, both expect an input
 			String object which contains the untokenized text. If possible it
 			should be a sentence, but depending on the training of the learnable
 			tokenizer this is not required. The first returns an array of
 			Strings, where each String is one token.
 			<programlisting language="java">
 			<![CDATA[
 String tokens[] = tokenizer.tokenize("An input sample sentence.");]]>
 		 </programlisting>
 			The output will be an array with these tokens.
 			<programlisting>
 			<![CDATA[
 "An", "input", "sample", "sentence", "."]]>
 		 </programlisting>
 			The second method, tokenizePos returns an array of Spans, each Span
 			contain the begin and end character offsets of the token in the input
 			String.
 			<programlisting language="java">
 			<![CDATA[
 Span tokenSpans[] = tokenizer.tokenizePos("An input sample sentence.");]]>
 			</programlisting>
 			The tokenSpans array now contain 5 elements. To get the text for one
 			span call Span.getCoveredText which takes a span and the input text.

 			The TokenizerME is able to output the probabilities for the detected
 			tokens. The getTokenProbabilities method must be called directly
 			after one of the tokenize methods was called.
 			<programlisting language="java">
 			<![CDATA[
 TokenizerME tokenizer = ...

 String tokens[] = tokenizer.tokenize(...);
 double tokenProbs[] = tokenizer.getTokenProbabilities();]]>
 			</programlisting>
 			The tokenProbs array now contains one double value per token, the
 			value is between 0 and 1, where 1 is the highest possible probability
 			and 0 the lowest possible probability.
 		</para>
 	</section>
 	</section>

 	<section id="tools.tokenizer.training">
 		<title>Tokenizer Training</title>

 		<section id="tools.tokenizer.training.tool">
 			<title>Training Tool</title>
 			<para>
 				OpenNLP has a command line tool which is used to train the models
 				available from the model download page on various corpora. The data
 				can be converted to the OpenNLP Tokenizer training format or used directly.
                 The OpenNLP format contains one sentence per line. Tokens are either separated by a
                 whitespace or by a special &lt;SPLIT&gt; tag. Tokens are split automaticaly on whitespace
                 and at least one &lt;SPLIT&gt; tag must be present in the training text.

 				The following sample shows the sample from above in the correct format.
 				<screen>
 			    <![CDATA[
 Pierre Vinken<SPLIT>, 61 years old<SPLIT>, will join the board as a nonexecutive director Nov. 29<SPLIT>.
 Mr. Vinken is chairman of Elsevier N.V.<SPLIT>, the Dutch publishing group<SPLIT>.
 Rudolph Agnew<SPLIT>, 55 years old and former chairman of Consolidated Gold Fields PLC<SPLIT>,
     was named a nonexecutive director of this British industrial conglomerate<SPLIT>.]]>
 			    </screen>
 			    Usage of the tool:
 			    <screen>
 			    <![CDATA[
 $ opennlp TokenizerTrainer
 Usage: opennlp TokenizerTrainer[.namefinder|.conllx|.pos] [-abbDict path] \
                 [-alphaNumOpt isAlphaNumOpt] [-params paramsFile] [-iterations num] \
                 [-cutoff num] -model modelFile -lang language -data sampleData \
                 [-encoding charsetName]

 Arguments description:
         -abbDict path
                 abbreviation dictionary in XML format.
         -alphaNumOpt isAlphaNumOpt
                 Optimization flag to skip alpha numeric tokens for further tokenization
         -params paramsFile
                 training parameters file.
         -iterations num
                 number of training iterations, ignored if -params is used.
         -cutoff num
                 minimal number of times a feature must be seen, ignored if -params is used.
         -model modelFile
                 output model file.
         -lang language
                 language which is being processed.
         -data sampleData
                 data to be used, usually a file name.
         -encoding charsetName
                 encoding for reading and writing text, if absent the system default is used.]]>
                 </screen>
 				To train the english tokenizer use the following command:
 				<screen>
 			    <![CDATA[
 $ opennlp TokenizerTrainer -model en-token.bin -alphaNumOpt -lang en -data en-token.train -encoding UTF-8

 Indexing events using cutoff of 5

 	Computing event counts...  done. 262271 events
 	Indexing...  done.
 Sorting and merging events... done. Reduced 262271 events to 59060.
 Done indexing.
 Incorporating indexed data for training...
 done.
 	Number of Event Tokens: 59060
 	    Number of Outcomes: 2
 	  Number of Predicates: 15695
 ...done.
 Computing model parameters...
 Performing 100 iterations.
   1:  .. loglikelihood=-181792.40419263614	0.9614292087192255
   2:  .. loglikelihood=-34208.094253153664	0.9629238459456059
   3:  .. loglikelihood=-18784.123872910015	0.9729211388220581
   4:  .. loglikelihood=-13246.88162585859	0.9856103038460219
   5:  .. loglikelihood=-10209.262670265718	0.9894422181636552

  ...<skipping a bunch of iterations>...

  95:  .. loglikelihood=-769.2107474529454	0.999511955191386
  96:  .. loglikelihood=-763.8891914534009	0.999511955191386
  97:  .. loglikelihood=-758.6685383254891	0.9995157680414533
  98:  .. loglikelihood=-753.5458314695236	0.9995157680414533
  99:  .. loglikelihood=-748.5182305519613	0.9995157680414533
 100:  .. loglikelihood=-743.5830058068038	0.9995157680414533
 Wrote tokenizer model.
 Path: en-token.bin]]>
 				</screen>
 			</para>
 		</section>
 		<section id="tools.tokenizer.training.api">
 			<title>Training API</title>
             <para>
                 The Tokenizer offers an API to train a new tokenization model. Basically three steps
                 are necessary to train it:
                 <itemizedlist>
                     <listitem>
                         <para>The application must open a sample data stream</para>
                     </listitem>
                     <listitem>
                         <para>Call the TokenizerME.train method</para>
                     </listitem>
                     <listitem>
                         <para>Save the TokenizerModel to a file or directly use it</para>
                     </listitem>
                 </itemizedlist>
                 The following sample code illustrates these steps:
                 <programlisting language="java">
                     <![CDATA[
 ObjectStream<String> lineStream = new PlainTextByLineStream(new FileInputStream("en-sent.train"),
     StandardCharsets.UTF_8);
 ObjectStream<TokenSample> sampleStream = new TokenSampleStream(lineStream);

 TokenizerModel model;

 try {
   model = TokenizerME.train("en", sampleStream, true, TrainingParameters.defaultParams());
 }
 finally {
   sampleStream.close();
 }

 OutputStream modelOut = null;
 try {
   modelOut = new BufferedOutputStream(new FileOutputStream(modelFile));
   model.serialize(modelOut);
 } finally {
   if (modelOut != null)
      modelOut.close();
 }]]>
                 </programlisting>
             </para>
 		</section>
 	</section>

 	<section id="tools.tokenizer.detokenizing">
 		<title>Detokenizing</title>
 		<para>
 		Detokenizing is simple the opposite of tokenization, the original non-tokenized string should
 		be constructed out of a token sequence. The OpenNLP implementation was created to undo the tokenization
 		of training data for the tokenizer. It can also be used to undo the tokenization of such a trained
 		tokenizer. The implementation is strictly rule based and defines how tokens should be attached
 		to a sentence wise character sequence.
 		</para>
 		<para>
 		The rule dictionary assign to every token an operation which describes how it should be attached
 		to one continuous character sequence.
 		</para>
 		<para>
 		The following rules can be assigned to a token:
 		<itemizedlist>
 			<listitem>
 				<para>MERGE_TO_LEFT - Merges the token to the left side.</para>
 			</listitem>
 			<listitem>
 				<para>MERGE_TO_RIGHT - Merges the token to the right side.</para>
 			</listitem>
 			<listitem>
 				<para>RIGHT_LEFT_MATCHING - Merges the token to the right side on first occurrence
 				and to the left side on second occurrence.</para>
 			</listitem>
 		</itemizedlist>

 		The following sample will illustrate how the detokenizer with a small
 		rule dictionary (illustration format, not the xml data format):
 		<programlisting>
 			<![CDATA[
 . MERGE_TO_LEFT
 " RIGHT_LEFT_MATCHING]]>
 		</programlisting>
 		The dictionary should be used to de-tokenize the following whitespace tokenized sentence:
 		<programlisting>
 			<![CDATA[
 He said " This is a test " .]]>
 		</programlisting>
 		The tokens would get these tags based on the dictionary:
 		<programlisting>
 			<![CDATA[
 He -> NO_OPERATION
 said -> NO_OPERATION
 " -> MERGE_TO_RIGHT
 This -> NO_OPERATION
 is -> NO_OPERATION
 a -> NO_OPERATION
 test -> NO_OPERATION
 " -> MERGE_TO_LEFT
 . -> MERGE_TO_LEFT]]>
 			</programlisting>
 			That will result in the following character sequence:
 		<programlisting>
 			<![CDATA[
 He said "This is a test".]]>
 		</programlisting>
 		</para>
 		<section id="tools.tokenizer.detokenizing.api">
 			<title>Detokenizing API</title>
 			<para>
 				The Detokenizer can be used to detokenize the tokens to String.
 				To instantiate the Detokenizer (a rule based detokenizer)
 				a DetokenizationDictionary (the rule of dictionary) must be created first.
 				The following code sample shows how a rule dictionary can be loaded.
 				<programlisting language="java">
 					<![CDATA[
 InputStream dictIn = new FileInputStream("latin-detokenizer.xml");
 DetokenizationDictionary dict = new DetokenizationDictionary(dictIn);]]>
 				</programlisting>
 				After the rule dictionary is loadeed the DictionaryDetokenizer can be instantiated.
 				<programlisting language="java">
 					<![CDATA[
 Detokenizer detokenizer = new DictionaryDetokenizer(dict);]]>
 				</programlisting>
 				The detokenizer offers two detokenize methods,the first detokenize the input tokens into a String.
 				<programlisting language="java">
 					<![CDATA[
 String[] tokens = new String[]{"A", "co", "-", "worker", "helped", "."};
 String sentence = detokenizer.detokenize(tokens, null);
 Assert.assertEquals("A co-worker helped.", sentence);]]>
 				</programlisting>
 				Tokens which are connected without a space in-between can be separated by a split marker.
 				<programlisting language="java">
 					<![CDATA[
 String sentence = detokenizer.detokenize(tokens, "<SPLIT>");
 Assert.assertEquals("A co<SPLIT>-<SPLIT>worker helped<SPLIT>.", sentence);]]>
 				</programlisting>
 				The API also offers a method which simply returns operations array in the input tokens array.
 				<programlisting language="java">
 					<![CDATA[
 DetokenizationOperation[] operations = detokenizer.detokenize(tokens);
 for (DetokenizationOperation operation : operations) {
   System.out.println(operation);
 }]]>
 				</programlisting>
 				Output:
 				<programlisting>
 					<![CDATA[
 NO_OPERATION
 NO_OPERATION
 MERGE_BOTH
 NO_OPERATION
 NO_OPERATION
 MERGE_TO_LEFT]]>
 				</programlisting>
 			</para>
 		</section>
 		<section id="tools.tokenizer.detokenizing.dict">
 			<title>Detokenizer Dictionary</title>
 			<para>
 				Detokenization Dictionary is the rule dictionary about detokenizer.
 				tokens - an array of tokens that should be detokenized according to an operation.
 				operations - an array of operations which specifies which operation
 				should be used for the provided tokens.
 				The following code sample shows how a rule dictionary can be created.
 				<programlisting language="java">
 					<![CDATA[
 String[] tokens = new String[]{".", "!", "(", ")", "\"", "-"};
 Operation[] operations = new Operation[]{
     Operation.MOVE_LEFT,
     Operation.MOVE_LEFT,
     Operation.MOVE_RIGHT,
     Operation.MOVE_LEFT,
     Operation.RIGHT_LEFT_MATCHING,
     Operation.MOVE_BOTH};
 DetokenizationDictionary dict = new DetokenizationDictionary(tokens, operations);]]>
 				</programlisting>
 			</para>
 		</section>
 	</section>
 </chapter>
	<?xml version="1.0" encoding="UTF-8"?>
	<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN"
	"http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd"[
	]>
	<!-- Licensed to the Apache Software Foundation (ASF) under one or more contributor
	license agreements. See the NOTICE file distributed with this work for additional
	information regarding copyright ownership. The ASF licenses this file to
	you under the Apache License, Version 2.0 (the "License"); you may not use
	this file except in compliance with the License. You may obtain a copy of
	the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required
	by applicable law or agreed to in writing, software distributed under the
	License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS
	OF ANY KIND, either express or implied. See the License for the specific
	language governing permissions and limitations under the License. -->

	<chapter id="tools.tokenizer">

	<title>Tokenizer</title>

	<section id="tools.tokenizer.introduction">
	<title>Tokenization</title>
	<para>
	The OpenNLP Tokenizers segment an input character sequence into
	tokens. Tokens are usually
	words, punctuation, numbers, etc.

	<screen>
	<![CDATA[
	Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29.
	Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group.
	Rudolph Agnew, 55 years old and former chairman of Consolidated Gold Fields
	PLC, was named a director of this British industrial conglomerate.
	]]>
	</screen>

	The following result shows the individual tokens in a whitespace
	separated representation.

	<screen>
	<![CDATA[
	Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 .
	Mr. Vinken is chairman of Elsevier N.V. , the Dutch publishing group .
	Rudolph Agnew , 55 years old and former chairman of Consolidated Gold Fields PLC ,
	was named a nonexecutive director of this British industrial conglomerate .
	A form of asbestos once used to make Kent cigarette filters has caused a high
	percentage of cancer deaths among a group of workers exposed to it more than 30 years ago ,
	researchers reported .
	]]>
	</screen>

	OpenNLP offers multiple tokenizer implementations:
	<itemizedlist>
	<listitem>
	<para>Whitespace Tokenizer - A whitespace tokenizer, non whitespace
	sequences are identified as tokens</para>
	</listitem>
	<listitem>
	<para>Simple Tokenizer - A character class tokenizer, sequences of
	the same character class are tokens</para>
	</listitem>
	<listitem>
	<para>Learnable Tokenizer - A maximum entropy tokenizer, detects
	token boundaries based on probability model</para>
	</listitem>
	</itemizedlist>

	Most part-of-speech taggers, parsers and so on, work with text
	tokenized in this manner. It is important to ensure that your
	tokenizer
	produces tokens of the type expected by your later text
	processing
	components.
	</para>

	<para>
	With OpenNLP (as with many systems), tokenization is a two-stage
	process:
	first, sentence boundaries are identified, then tokens within
	each
	sentence are identified.
	</para>

	<section id="tools.tokenizer.cmdline">
	<title>Tokenizer Tools</title>
	<para>The easiest way to try out the tokenizers are the command line
	tools. The tools are only intended for demonstration and testing.
	</para>
	<para>There are two tools, one for the Simple Tokenizer and one for
	the learnable tokenizer. A command line tool the for the Whitespace
	Tokenizer does not exist, because the whitespace separated output
	would be identical to the input.</para>
	<para>
	The following command shows how to use the Simple Tokenizer Tool.

	<screen>
	<![CDATA[
	$ opennlp SimpleTokenizer]]>
	</screen>
	To use the learnable tokenizer download the english token model from
	our website.
	<screen>
	<![CDATA[
	$ opennlp TokenizerME en-token.bin]]>
	</screen>
	To test the tokenizer copy the sample from above to the console. The
	whitespace separated tokens will be written back to the
	console.
	</para>
	<para>
	Usually the input is read from a file and written to a file.
	<screen>
	<![CDATA[
	$ opennlp TokenizerME en-token.bin < article.txt > article-tokenized.txt]]>
	</screen>
	It can be done in the same way for the Simple Tokenizer.
	</para>
	<para>
	Since most text comes truly raw and doesn't have sentence boundaries
	and such, its possible to create a pipe which first performs sentence
	boundary detection and tokenization. The following sample illustrates
	that.
	<screen>
	<![CDATA[
	$ opennlp SentenceDetector sentdetect.model < article.txt \| opennlp TokenizerME tokenize.model \| more
	Loading model ... Loading model ... done
	done
	Showa Shell gained 20 to 1,570 and Mitsubishi Oil rose 50 to 1,500.
	Sumitomo Metal Mining fell five yen to 692 and Nippon Mining added 15 to 960 .
	Among other winners Wednesday was Nippon Shokubai , which was up 80 at 2,410 .
	Marubeni advanced 11 to 890 .
	London share prices were bolstered largely by continued gains on Wall Street and technical
	factors affecting demand for London 's blue-chip stocks .
	...etc...]]>
	</screen>
	Of course this is all on the command line. Many people use the models
	directly in their Java code by creating SentenceDetector and
	Tokenizer objects and calling their methods as appropriate. The
	following section will explain how the Tokenizers can be used
	directly from java.
	</para>
	</section>

	<section id="tools.tokenizer.api">
	<title>Tokenizer API</title>
	<para>
	The Tokenizers can be integrated into an application by the defined
	API.
	The shared instance of the WhitespaceTokenizer can be retrieved from a
	static field WhitespaceTokenizer.INSTANCE. The shared instance of the
	SimpleTokenizer can be retrieved in the same way from
	SimpleTokenizer.INSTANCE.
	To instantiate the TokenizerME (the learnable tokenizer) a Token Model
	must be created first. The following code sample shows how a model
	can be loaded.
	<programlisting language="java">
	<![CDATA[

	try (InputStream modelIn = new FileInputStream("en-token.bin")) {
	TokenizerModel model = new TokenizerModel(modelIn);
	}]]>
	</programlisting>
	After the model is loaded the TokenizerME can be instantiated.
	<programlisting language="java">
	<![CDATA[
	Tokenizer tokenizer = new TokenizerME(model);]]>
	</programlisting>
	The tokenizer offers two tokenize methods, both expect an input
	String object which contains the untokenized text. If possible it
	should be a sentence, but depending on the training of the learnable
	tokenizer this is not required. The first returns an array of
	Strings, where each String is one token.
	<programlisting language="java">
	<![CDATA[
	String tokens[] = tokenizer.tokenize("An input sample sentence.");]]>
	</programlisting>
	The output will be an array with these tokens.
	<programlisting>
	<![CDATA[
	"An", "input", "sample", "sentence", "."]]>
	</programlisting>
	The second method, tokenizePos returns an array of Spans, each Span
	contain the begin and end character offsets of the token in the input
	String.
	<programlisting language="java">
	<![CDATA[
	Span tokenSpans[] = tokenizer.tokenizePos("An input sample sentence.");]]>
	</programlisting>
	The tokenSpans array now contain 5 elements. To get the text for one
	span call Span.getCoveredText which takes a span and the input text.

	The TokenizerME is able to output the probabilities for the detected
	tokens. The getTokenProbabilities method must be called directly
	after one of the tokenize methods was called.
	<programlisting language="java">
	<![CDATA[
	TokenizerME tokenizer = ...

	String tokens[] = tokenizer.tokenize(...);
	double tokenProbs[] = tokenizer.getTokenProbabilities();]]>
	</programlisting>
	The tokenProbs array now contains one double value per token, the
	value is between 0 and 1, where 1 is the highest possible probability
	and 0 the lowest possible probability.
	</para>
	</section>
	</section>

	<section id="tools.tokenizer.training">
	<title>Tokenizer Training</title>

	<section id="tools.tokenizer.training.tool">
	<title>Training Tool</title>
	<para>
	OpenNLP has a command line tool which is used to train the models
	available from the model download page on various corpora. The data
	can be converted to the OpenNLP Tokenizer training format or used directly.
	The OpenNLP format contains one sentence per line. Tokens are either separated by a
	whitespace or by a special <SPLIT> tag. Tokens are split automaticaly on whitespace
	and at least one <SPLIT> tag must be present in the training text.

	The following sample shows the sample from above in the correct format.
	<screen>
	<![CDATA[
	Pierre Vinken<SPLIT>, 61 years old<SPLIT>, will join the board as a nonexecutive director Nov. 29<SPLIT>.
	Mr. Vinken is chairman of Elsevier N.V.<SPLIT>, the Dutch publishing group<SPLIT>.
	Rudolph Agnew<SPLIT>, 55 years old and former chairman of Consolidated Gold Fields PLC<SPLIT>,
	was named a nonexecutive director of this British industrial conglomerate<SPLIT>.]]>
	</screen>
	Usage of the tool:
	<screen>
	<![CDATA[
	$ opennlp TokenizerTrainer
	Usage: opennlp TokenizerTrainer[.namefinder\|.conllx\|.pos] [-abbDict path] \
	[-alphaNumOpt isAlphaNumOpt] [-params paramsFile] [-iterations num] \
	[-cutoff num] -model modelFile -lang language -data sampleData \
	[-encoding charsetName]

	Arguments description:
	-abbDict path
	abbreviation dictionary in XML format.
	-alphaNumOpt isAlphaNumOpt
	Optimization flag to skip alpha numeric tokens for further tokenization
	-params paramsFile
	training parameters file.
	-iterations num
	number of training iterations, ignored if -params is used.
	-cutoff num
	minimal number of times a feature must be seen, ignored if -params is used.
	-model modelFile
	output model file.
	-lang language
	language which is being processed.
	-data sampleData
	data to be used, usually a file name.
	-encoding charsetName
	encoding for reading and writing text, if absent the system default is used.]]>
	</screen>
	To train the english tokenizer use the following command:
	<screen>
	<![CDATA[
	$ opennlp TokenizerTrainer -model en-token.bin -alphaNumOpt -lang en -data en-token.train -encoding UTF-8

	Indexing events using cutoff of 5

	Computing event counts... done. 262271 events
	Indexing... done.
	Sorting and merging events... done. Reduced 262271 events to 59060.
	Done indexing.
	Incorporating indexed data for training...
	done.
	Number of Event Tokens: 59060
	Number of Outcomes: 2
	Number of Predicates: 15695
	...done.
	Computing model parameters...
	Performing 100 iterations.
	1: .. loglikelihood=-181792.40419263614 0.9614292087192255
	2: .. loglikelihood=-34208.094253153664 0.9629238459456059
	3: .. loglikelihood=-18784.123872910015 0.9729211388220581
	4: .. loglikelihood=-13246.88162585859 0.9856103038460219
	5: .. loglikelihood=-10209.262670265718 0.9894422181636552

	...<skipping a bunch of iterations>...

	95: .. loglikelihood=-769.2107474529454 0.999511955191386
	96: .. loglikelihood=-763.8891914534009 0.999511955191386
	97: .. loglikelihood=-758.6685383254891 0.9995157680414533
	98: .. loglikelihood=-753.5458314695236 0.9995157680414533
	99: .. loglikelihood=-748.5182305519613 0.9995157680414533
	100: .. loglikelihood=-743.5830058068038 0.9995157680414533
	Wrote tokenizer model.
	Path: en-token.bin]]>
	</screen>
	</para>
	</section>
	<section id="tools.tokenizer.training.api">
	<title>Training API</title>
	<para>
	The Tokenizer offers an API to train a new tokenization model. Basically three steps
	are necessary to train it:
	<itemizedlist>
	<listitem>
	<para>The application must open a sample data stream</para>
	</listitem>
	<listitem>
	<para>Call the TokenizerME.train method</para>
	</listitem>
	<listitem>
	<para>Save the TokenizerModel to a file or directly use it</para>
	</listitem>
	</itemizedlist>
	The following sample code illustrates these steps:
	<programlisting language="java">
	<![CDATA[
	ObjectStream<String> lineStream = new PlainTextByLineStream(new FileInputStream("en-sent.train"),
	StandardCharsets.UTF_8);
	ObjectStream<TokenSample> sampleStream = new TokenSampleStream(lineStream);

	TokenizerModel model;

	try {
	model = TokenizerME.train("en", sampleStream, true, TrainingParameters.defaultParams());
	}
	finally {
	sampleStream.close();
	}

	OutputStream modelOut = null;
	try {
	modelOut = new BufferedOutputStream(new FileOutputStream(modelFile));
	model.serialize(modelOut);
	} finally {
	if (modelOut != null)
	modelOut.close();
	}]]>
	</programlisting>
	</para>
	</section>
	</section>

	<section id="tools.tokenizer.detokenizing">
	<title>Detokenizing</title>
	<para>
	Detokenizing is simple the opposite of tokenization, the original non-tokenized string should
	be constructed out of a token sequence. The OpenNLP implementation was created to undo the tokenization
	of training data for the tokenizer. It can also be used to undo the tokenization of such a trained
	tokenizer. The implementation is strictly rule based and defines how tokens should be attached
	to a sentence wise character sequence.
	</para>
	<para>
	The rule dictionary assign to every token an operation which describes how it should be attached
	to one continuous character sequence.
	</para>
	<para>
	The following rules can be assigned to a token:
	<itemizedlist>
	<listitem>
	<para>MERGE_TO_LEFT - Merges the token to the left side.</para>
	</listitem>
	<listitem>
	<para>MERGE_TO_RIGHT - Merges the token to the right side.</para>
	</listitem>
	<listitem>
	<para>RIGHT_LEFT_MATCHING - Merges the token to the right side on first occurrence
	and to the left side on second occurrence.</para>
	</listitem>
	</itemizedlist>

	The following sample will illustrate how the detokenizer with a small
	rule dictionary (illustration format, not the xml data format):
	<programlisting>
	<![CDATA[
	. MERGE_TO_LEFT
	" RIGHT_LEFT_MATCHING]]>
	</programlisting>
	The dictionary should be used to de-tokenize the following whitespace tokenized sentence:
	<programlisting>
	<![CDATA[
	He said " This is a test " .]]>
	</programlisting>
	The tokens would get these tags based on the dictionary:
	<programlisting>
	<![CDATA[
	He -> NO_OPERATION
	said -> NO_OPERATION
	" -> MERGE_TO_RIGHT
	This -> NO_OPERATION
	is -> NO_OPERATION
	a -> NO_OPERATION
	test -> NO_OPERATION
	" -> MERGE_TO_LEFT
	. -> MERGE_TO_LEFT]]>
	</programlisting>
	That will result in the following character sequence:
	<programlisting>
	<![CDATA[
	He said "This is a test".]]>
	</programlisting>
	</para>
	<section id="tools.tokenizer.detokenizing.api">
	<title>Detokenizing API</title>
	<para>
	The Detokenizer can be used to detokenize the tokens to String.
	To instantiate the Detokenizer (a rule based detokenizer)
	a DetokenizationDictionary (the rule of dictionary) must be created first.
	The following code sample shows how a rule dictionary can be loaded.
	<programlisting language="java">
	<![CDATA[
	InputStream dictIn = new FileInputStream("latin-detokenizer.xml");
	DetokenizationDictionary dict = new DetokenizationDictionary(dictIn);]]>
	</programlisting>
	After the rule dictionary is loadeed the DictionaryDetokenizer can be instantiated.
	<programlisting language="java">
	<![CDATA[
	Detokenizer detokenizer = new DictionaryDetokenizer(dict);]]>
	</programlisting>
	The detokenizer offers two detokenize methods,the first detokenize the input tokens into a String.
	<programlisting language="java">
	<![CDATA[
	String[] tokens = new String[]{"A", "co", "-", "worker", "helped", "."};
	String sentence = detokenizer.detokenize(tokens, null);
	Assert.assertEquals("A co-worker helped.", sentence);]]>
	</programlisting>
	Tokens which are connected without a space in-between can be separated by a split marker.
	<programlisting language="java">
	<![CDATA[
	String sentence = detokenizer.detokenize(tokens, "<SPLIT>");
	Assert.assertEquals("A co<SPLIT>-<SPLIT>worker helped<SPLIT>.", sentence);]]>
	</programlisting>
	The API also offers a method which simply returns operations array in the input tokens array.
	<programlisting language="java">
	<![CDATA[
	DetokenizationOperation[] operations = detokenizer.detokenize(tokens);
	for (DetokenizationOperation operation : operations) {
	System.out.println(operation);
	}]]>
	</programlisting>
	Output:
	<programlisting>
	<![CDATA[
	NO_OPERATION
	NO_OPERATION
	MERGE_BOTH
	NO_OPERATION
	NO_OPERATION
	MERGE_TO_LEFT]]>
	</programlisting>
	</para>
	</section>
	<section id="tools.tokenizer.detokenizing.dict">
	<title>Detokenizer Dictionary</title>
	<para>
	Detokenization Dictionary is the rule dictionary about detokenizer.
	tokens - an array of tokens that should be detokenized according to an operation.
	operations - an array of operations which specifies which operation
	should be used for the provided tokens.
	The following code sample shows how a rule dictionary can be created.
	<programlisting language="java">
	<![CDATA[
	String[] tokens = new String[]{".", "!", "(", ")", "\"", "-"};
	Operation[] operations = new Operation[]{
	Operation.MOVE_LEFT,
	Operation.MOVE_LEFT,
	Operation.MOVE_RIGHT,
	Operation.MOVE_LEFT,
	Operation.RIGHT_LEFT_MATCHING,
	Operation.MOVE_BOTH};
	DetokenizationDictionary dict = new DetokenizationDictionary(tokens, operations);]]>
	</programlisting>
	</para>
	</section>
	</section>
	</chapter>