| <?xml version="1.0" encoding="UTF-8"?> |
| <!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN" |
| "http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd"[ |
| ]> |
| <!-- Licensed to the Apache Software Foundation (ASF) under one or more contributor |
| license agreements. See the NOTICE file distributed with this work for additional |
| information regarding copyright ownership. The ASF licenses this file to |
| you under the Apache License, Version 2.0 (the "License"); you may not use |
| this file except in compliance with the License. You may obtain a copy of |
| the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required |
| by applicable law or agreed to in writing, software distributed under the |
| License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS |
| OF ANY KIND, either express or implied. See the License for the specific |
| language governing permissions and limitations under the License. --> |
| |
| <chapter id="tools.tokenizer"> |
| |
| <title>Tokenizer</title> |
| |
| <section id="tools.tokenizer.introduction"> |
| <title>Tokenization</title> |
| <para> |
| The OpenNLP Tokenizers segment an input character sequence into |
| tokens. Tokens are usually |
| words, punctuation, numbers, etc. |
| |
| <screen> |
| <![CDATA[ |
| Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29. |
| Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group. |
| Rudolph Agnew, 55 years old and former chairman of Consolidated Gold Fields |
| PLC, was named a director of this British industrial conglomerate. |
| ]]> |
| </screen> |
| |
| The following result shows the individual tokens in a whitespace |
| separated representation. |
| |
| <screen> |
| <![CDATA[ |
| Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 . |
| Mr. Vinken is chairman of Elsevier N.V. , the Dutch publishing group . |
| Rudolph Agnew , 55 years old and former chairman of Consolidated Gold Fields PLC , |
| was named a nonexecutive director of this British industrial conglomerate . |
| A form of asbestos once used to make Kent cigarette filters has caused a high |
| percentage of cancer deaths among a group of workers exposed to it more than 30 years ago , |
| researchers reported . |
| ]]> |
| </screen> |
| |
| OpenNLP offers multiple tokenizer implementations: |
| <itemizedlist> |
| <listitem> |
| <para>Whitespace Tokenizer - A whitespace tokenizer, non whitespace |
| sequences are identified as tokens</para> |
| </listitem> |
| <listitem> |
| <para>Simple Tokenizer - A character class tokenizer, sequences of |
| the same character class are tokens</para> |
| </listitem> |
| <listitem> |
| <para>Learnable Tokenizer - A maximum entropy tokenizer, detects |
| token boundaries based on probability model</para> |
| </listitem> |
| </itemizedlist> |
| |
| Most part-of-speech taggers, parsers and so on, work with text |
| tokenized in this manner. It is important to ensure that your |
| tokenizer |
| produces tokens of the type expected by your later text |
| processing |
| components. |
| </para> |
| |
| <para> |
| With OpenNLP (as with many systems), tokenization is a two-stage |
| process: |
| first, sentence boundaries are identified, then tokens within |
| each |
| sentence are identified. |
| </para> |
| |
| <section id="tools.tokenizer.cmdline"> |
| <title>Tokenizer Tools</title> |
| <para>The easiest way to try out the tokenizers are the command line |
| tools. The tools are only intended for demonstration and testing. |
| </para> |
| <para>There are two tools, one for the Simple Tokenizer and one for |
| the learnable tokenizer. A command line tool the for the Whitespace |
| Tokenizer does not exist, because the whitespace separated output |
| would be identical to the input.</para> |
| <para> |
| The following command shows how to use the Simple Tokenizer Tool. |
| |
| <screen> |
| <![CDATA[ |
| $ opennlp SimpleTokenizer]]> |
| </screen> |
| To use the learnable tokenizer download the english token model from |
| our website. |
| <screen> |
| <![CDATA[ |
| $ opennlp TokenizerME en-token.bin]]> |
| </screen> |
| To test the tokenizer copy the sample from above to the console. The |
| whitespace separated tokens will be written back to the |
| console. |
| </para> |
| <para> |
| Usually the input is read from a file and written to a file. |
| <screen> |
| <![CDATA[ |
| $ opennlp TokenizerME en-token.bin < article.txt > article-tokenized.txt]]> |
| </screen> |
| It can be done in the same way for the Simple Tokenizer. |
| </para> |
| <para> |
| Since most text comes truly raw and doesn't have sentence boundaries |
| and such, its possible to create a pipe which first performs sentence |
| boundary detection and tokenization. The following sample illustrates |
| that. |
| <screen> |
| <![CDATA[ |
| $ opennlp SentenceDetector sentdetect.model < article.txt | opennlp TokenizerME tokenize.model | more |
| Loading model ... Loading model ... done |
| done |
| Showa Shell gained 20 to 1,570 and Mitsubishi Oil rose 50 to 1,500. |
| Sumitomo Metal Mining fell five yen to 692 and Nippon Mining added 15 to 960 . |
| Among other winners Wednesday was Nippon Shokubai , which was up 80 at 2,410 . |
| Marubeni advanced 11 to 890 . |
| London share prices were bolstered largely by continued gains on Wall Street and technical |
| factors affecting demand for London 's blue-chip stocks . |
| ...etc...]]> |
| </screen> |
| Of course this is all on the command line. Many people use the models |
| directly in their Java code by creating SentenceDetector and |
| Tokenizer objects and calling their methods as appropriate. The |
| following section will explain how the Tokenizers can be used |
| directly from java. |
| </para> |
| </section> |
| |
| <section id="tools.tokenizer.api"> |
| <title>Tokenizer API</title> |
| <para> |
| The Tokenizers can be integrated into an application by the defined |
| API. |
| The shared instance of the WhitespaceTokenizer can be retrieved from a |
| static field WhitespaceTokenizer.INSTANCE. The shared instance of the |
| SimpleTokenizer can be retrieved in the same way from |
| SimpleTokenizer.INSTANCE. |
| To instantiate the TokenizerME (the learnable tokenizer) a Token Model |
| must be created first. The following code sample shows how a model |
| can be loaded. |
| <programlisting language="java"> |
| <![CDATA[ |
| |
| try (InputStream modelIn = new FileInputStream("en-token.bin")) { |
| TokenizerModel model = new TokenizerModel(modelIn); |
| }]]> |
| </programlisting> |
| After the model is loaded the TokenizerME can be instantiated. |
| <programlisting language="java"> |
| <![CDATA[ |
| Tokenizer tokenizer = new TokenizerME(model);]]> |
| </programlisting> |
| The tokenizer offers two tokenize methods, both expect an input |
| String object which contains the untokenized text. If possible it |
| should be a sentence, but depending on the training of the learnable |
| tokenizer this is not required. The first returns an array of |
| Strings, where each String is one token. |
| <programlisting language="java"> |
| <![CDATA[ |
| String tokens[] = tokenizer.tokenize("An input sample sentence.");]]> |
| </programlisting> |
| The output will be an array with these tokens. |
| <programlisting> |
| <![CDATA[ |
| "An", "input", "sample", "sentence", "."]]> |
| </programlisting> |
| The second method, tokenizePos returns an array of Spans, each Span |
| contain the begin and end character offsets of the token in the input |
| String. |
| <programlisting language="java"> |
| <![CDATA[ |
| Span tokenSpans[] = tokenizer.tokenizePos("An input sample sentence.");]]> |
| </programlisting> |
| The tokenSpans array now contain 5 elements. To get the text for one |
| span call Span.getCoveredText which takes a span and the input text. |
| |
| The TokenizerME is able to output the probabilities for the detected |
| tokens. The getTokenProbabilities method must be called directly |
| after one of the tokenize methods was called. |
| <programlisting language="java"> |
| <![CDATA[ |
| TokenizerME tokenizer = ... |
| |
| String tokens[] = tokenizer.tokenize(...); |
| double tokenProbs[] = tokenizer.getTokenProbabilities();]]> |
| </programlisting> |
| The tokenProbs array now contains one double value per token, the |
| value is between 0 and 1, where 1 is the highest possible probability |
| and 0 the lowest possible probability. |
| </para> |
| </section> |
| </section> |
| |
| <section id="tools.tokenizer.training"> |
| <title>Tokenizer Training</title> |
| |
| <section id="tools.tokenizer.training.tool"> |
| <title>Training Tool</title> |
| <para> |
| OpenNLP has a command line tool which is used to train the models |
| available from the model download page on various corpora. The data |
| can be converted to the OpenNLP Tokenizer training format or used directly. |
| The OpenNLP format contains one sentence per line. Tokens are either separated by a |
| whitespace or by a special <SPLIT> tag. Tokens are split automaticaly on whitespace |
| and at least one <SPLIT> tag must be present in the training text. |
| |
| The following sample shows the sample from above in the correct format. |
| <screen> |
| <![CDATA[ |
| Pierre Vinken<SPLIT>, 61 years old<SPLIT>, will join the board as a nonexecutive director Nov. 29<SPLIT>. |
| Mr. Vinken is chairman of Elsevier N.V.<SPLIT>, the Dutch publishing group<SPLIT>. |
| Rudolph Agnew<SPLIT>, 55 years old and former chairman of Consolidated Gold Fields PLC<SPLIT>, |
| was named a nonexecutive director of this British industrial conglomerate<SPLIT>.]]> |
| </screen> |
| Usage of the tool: |
| <screen> |
| <![CDATA[ |
| $ opennlp TokenizerTrainer |
| Usage: opennlp TokenizerTrainer[.namefinder|.conllx|.pos] [-abbDict path] \ |
| [-alphaNumOpt isAlphaNumOpt] [-params paramsFile] [-iterations num] \ |
| [-cutoff num] -model modelFile -lang language -data sampleData \ |
| [-encoding charsetName] |
| |
| Arguments description: |
| -abbDict path |
| abbreviation dictionary in XML format. |
| -alphaNumOpt isAlphaNumOpt |
| Optimization flag to skip alpha numeric tokens for further tokenization |
| -params paramsFile |
| training parameters file. |
| -iterations num |
| number of training iterations, ignored if -params is used. |
| -cutoff num |
| minimal number of times a feature must be seen, ignored if -params is used. |
| -model modelFile |
| output model file. |
| -lang language |
| language which is being processed. |
| -data sampleData |
| data to be used, usually a file name. |
| -encoding charsetName |
| encoding for reading and writing text, if absent the system default is used.]]> |
| </screen> |
| To train the english tokenizer use the following command: |
| <screen> |
| <![CDATA[ |
| $ opennlp TokenizerTrainer -model en-token.bin -alphaNumOpt -lang en -data en-token.train -encoding UTF-8 |
| |
| Indexing events using cutoff of 5 |
| |
| Computing event counts... done. 262271 events |
| Indexing... done. |
| Sorting and merging events... done. Reduced 262271 events to 59060. |
| Done indexing. |
| Incorporating indexed data for training... |
| done. |
| Number of Event Tokens: 59060 |
| Number of Outcomes: 2 |
| Number of Predicates: 15695 |
| ...done. |
| Computing model parameters... |
| Performing 100 iterations. |
| 1: .. loglikelihood=-181792.40419263614 0.9614292087192255 |
| 2: .. loglikelihood=-34208.094253153664 0.9629238459456059 |
| 3: .. loglikelihood=-18784.123872910015 0.9729211388220581 |
| 4: .. loglikelihood=-13246.88162585859 0.9856103038460219 |
| 5: .. loglikelihood=-10209.262670265718 0.9894422181636552 |
| |
| ...<skipping a bunch of iterations>... |
| |
| 95: .. loglikelihood=-769.2107474529454 0.999511955191386 |
| 96: .. loglikelihood=-763.8891914534009 0.999511955191386 |
| 97: .. loglikelihood=-758.6685383254891 0.9995157680414533 |
| 98: .. loglikelihood=-753.5458314695236 0.9995157680414533 |
| 99: .. loglikelihood=-748.5182305519613 0.9995157680414533 |
| 100: .. loglikelihood=-743.5830058068038 0.9995157680414533 |
| Wrote tokenizer model. |
| Path: en-token.bin]]> |
| </screen> |
| </para> |
| </section> |
| <section id="tools.tokenizer.training.api"> |
| <title>Training API</title> |
| <para> |
| The Tokenizer offers an API to train a new tokenization model. Basically three steps |
| are necessary to train it: |
| <itemizedlist> |
| <listitem> |
| <para>The application must open a sample data stream</para> |
| </listitem> |
| <listitem> |
| <para>Call the TokenizerME.train method</para> |
| </listitem> |
| <listitem> |
| <para>Save the TokenizerModel to a file or directly use it</para> |
| </listitem> |
| </itemizedlist> |
| The following sample code illustrates these steps: |
| <programlisting language="java"> |
| <![CDATA[ |
| ObjectStream<String> lineStream = new PlainTextByLineStream(new FileInputStream("en-sent.train"), |
| StandardCharsets.UTF_8); |
| ObjectStream<TokenSample> sampleStream = new TokenSampleStream(lineStream); |
| |
| TokenizerModel model; |
| |
| try { |
| model = TokenizerME.train("en", sampleStream, true, TrainingParameters.defaultParams()); |
| } |
| finally { |
| sampleStream.close(); |
| } |
| |
| OutputStream modelOut = null; |
| try { |
| modelOut = new BufferedOutputStream(new FileOutputStream(modelFile)); |
| model.serialize(modelOut); |
| } finally { |
| if (modelOut != null) |
| modelOut.close(); |
| }]]> |
| </programlisting> |
| </para> |
| </section> |
| </section> |
| |
| <section id="tools.tokenizer.detokenizing"> |
| <title>Detokenizing</title> |
| <para> |
| Detokenizing is simple the opposite of tokenization, the original non-tokenized string should |
| be constructed out of a token sequence. The OpenNLP implementation was created to undo the tokenization |
| of training data for the tokenizer. It can also be used to undo the tokenization of such a trained |
| tokenizer. The implementation is strictly rule based and defines how tokens should be attached |
| to a sentence wise character sequence. |
| </para> |
| <para> |
| The rule dictionary assign to every token an operation which describes how it should be attached |
| to one continuous character sequence. |
| </para> |
| <para> |
| The following rules can be assigned to a token: |
| <itemizedlist> |
| <listitem> |
| <para>MERGE_TO_LEFT - Merges the token to the left side.</para> |
| </listitem> |
| <listitem> |
| <para>MERGE_TO_RIGHT - Merges the token to the right side.</para> |
| </listitem> |
| <listitem> |
| <para>RIGHT_LEFT_MATCHING - Merges the token to the right side on first occurrence |
| and to the left side on second occurrence.</para> |
| </listitem> |
| </itemizedlist> |
| |
| The following sample will illustrate how the detokenizer with a small |
| rule dictionary (illustration format, not the xml data format): |
| <programlisting> |
| <![CDATA[ |
| . MERGE_TO_LEFT |
| " RIGHT_LEFT_MATCHING]]> |
| </programlisting> |
| The dictionary should be used to de-tokenize the following whitespace tokenized sentence: |
| <programlisting> |
| <![CDATA[ |
| He said " This is a test " .]]> |
| </programlisting> |
| The tokens would get these tags based on the dictionary: |
| <programlisting> |
| <![CDATA[ |
| He -> NO_OPERATION |
| said -> NO_OPERATION |
| " -> MERGE_TO_RIGHT |
| This -> NO_OPERATION |
| is -> NO_OPERATION |
| a -> NO_OPERATION |
| test -> NO_OPERATION |
| " -> MERGE_TO_LEFT |
| . -> MERGE_TO_LEFT]]> |
| </programlisting> |
| That will result in the following character sequence: |
| <programlisting> |
| <![CDATA[ |
| He said "This is a test".]]> |
| </programlisting> |
| </para> |
| <section id="tools.tokenizer.detokenizing.api"> |
| <title>Detokenizing API</title> |
| <para> |
| The Detokenizer can be used to detokenize the tokens to String. |
| To instantiate the Detokenizer (a rule based detokenizer) |
| a DetokenizationDictionary (the rule of dictionary) must be created first. |
| The following code sample shows how a rule dictionary can be loaded. |
| <programlisting language="java"> |
| <![CDATA[ |
| InputStream dictIn = new FileInputStream("latin-detokenizer.xml"); |
| DetokenizationDictionary dict = new DetokenizationDictionary(dictIn);]]> |
| </programlisting> |
| After the rule dictionary is loadeed the DictionaryDetokenizer can be instantiated. |
| <programlisting language="java"> |
| <![CDATA[ |
| Detokenizer detokenizer = new DictionaryDetokenizer(dict);]]> |
| </programlisting> |
| The detokenizer offers two detokenize methods,the first detokenize the input tokens into a String. |
| <programlisting language="java"> |
| <![CDATA[ |
| String[] tokens = new String[]{"A", "co", "-", "worker", "helped", "."}; |
| String sentence = detokenizer.detokenize(tokens, null); |
| Assert.assertEquals("A co-worker helped.", sentence);]]> |
| </programlisting> |
| Tokens which are connected without a space in-between can be separated by a split marker. |
| <programlisting language="java"> |
| <![CDATA[ |
| String sentence = detokenizer.detokenize(tokens, "<SPLIT>"); |
| Assert.assertEquals("A co<SPLIT>-<SPLIT>worker helped<SPLIT>.", sentence);]]> |
| </programlisting> |
| The API also offers a method which simply returns operations array in the input tokens array. |
| <programlisting language="java"> |
| <![CDATA[ |
| DetokenizationOperation[] operations = detokenizer.detokenize(tokens); |
| for (DetokenizationOperation operation : operations) { |
| System.out.println(operation); |
| }]]> |
| </programlisting> |
| Output: |
| <programlisting> |
| <![CDATA[ |
| NO_OPERATION |
| NO_OPERATION |
| MERGE_BOTH |
| NO_OPERATION |
| NO_OPERATION |
| MERGE_TO_LEFT]]> |
| </programlisting> |
| </para> |
| </section> |
| <section id="tools.tokenizer.detokenizing.dict"> |
| <title>Detokenizer Dictionary</title> |
| <para> |
| Detokenization Dictionary is the rule dictionary about detokenizer. |
| tokens - an array of tokens that should be detokenized according to an operation. |
| operations - an array of operations which specifies which operation |
| should be used for the provided tokens. |
| The following code sample shows how a rule dictionary can be created. |
| <programlisting language="java"> |
| <![CDATA[ |
| String[] tokens = new String[]{".", "!", "(", ")", "\"", "-"}; |
| Operation[] operations = new Operation[]{ |
| Operation.MOVE_LEFT, |
| Operation.MOVE_LEFT, |
| Operation.MOVE_RIGHT, |
| Operation.MOVE_LEFT, |
| Operation.RIGHT_LEFT_MATCHING, |
| Operation.MOVE_BOTH}; |
| DetokenizationDictionary dict = new DetokenizationDictionary(tokens, operations);]]> |
| </programlisting> |
| </para> |
| </section> |
| </section> |
| </chapter> |