| <?xml version="1.0" encoding="UTF-8"?>
|
| <!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN" |
| "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd" [ |
| <!ENTITY imgroot "./images/" >
|
| <!ENTITY % xinclude SYSTEM "../../../uima-docbook-tool/xinclude.mod">
|
| %xinclude; |
| ]> |
| <!--
|
| Licensed to the Apache Software Foundation (ASF) under one
|
| or more contributor license agreements. See the NOTICE file
|
| distributed with this work for additional information
|
| regarding copyright ownership. The ASF licenses this file
|
| to you under the Apache License, Version 2.0 (the
|
| "License"); you may not use this file except in compliance
|
| with the License. You may obtain a copy of the License at
|
|
|
| http://www.apache.org/licenses/LICENSE-2.0
|
|
|
| Unless required by applicable law or agreed to in writing,
|
| software distributed under the License is distributed on an
|
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
| KIND, either express or implied. See the License for the
|
| specific language governing permissions and limitations
|
| under the License.
|
| -->
|
|
|
| <book lang="en">
|
|
|
| <title>
|
| Apache UIMA Dictionary Annotator Documentation
|
| </title>
|
|
|
| <xi:include xmlns:xi="http://www.w3.org/2001/XInclude"
|
| href="../../../SandboxDocs/src/docbook/book_info.xml" />
|
|
|
| <preface>
|
| <title>Introduction</title>
|
| <para>
|
| The DictionaryAnnotator is an
|
| Apache UIMA analysis engine that annotates words based on
|
| dictionary entries. For each word in the document text
|
| that is available in the dictionary a new annotation is created.
|
|
|
| The annotator can be configured with one or more independent
|
| dictionaries. The dictionaries can easily be created with the
|
| dictionary creator command line tooling. For advanced usage
|
| of the annotator the matching can also be improved by specifying
|
| multi word capabilities, match input type properties and input type
|
| filter settings.
|
| </para>
|
| </preface>
|
|
|
| <chapter id="sandbox.dictAnnotator.processingOverview">
|
| <title>Processing Overview</title>
|
| <para>
|
| To use the DictionaryAnnotator at first a dictionary must be
|
| created because so far the annotator does not provide any dictionaries.
|
| The creation of a dictionary is very simple when using the
|
| dictionary creator command line tooling. The tooling takes as input a list
|
| words that should be added to the dictionary.
|
| The output of the dictionary creator is the created dictionary as XML file
|
| and can be used to configure the annotator. For each dictionary additional
|
| meta data like the annotation output type for the created
|
| annotation can be set. The dictionary and the DictionaryAnnotator can be
|
| configured to work with single word dictionary entries like "Apache" or with
|
| multi word entries like "Apache UIMA".
|
| </para>
|
| <para>
|
| After the annotator is configured with the created dictionary the lookup
|
| strategy settings must be defined. The dictionary lookup inside the annotator
|
| works with tokens. A token is a word or an arbitrary text fragment that is used for
|
| the dictionary lookup. If a token match a dictionary entry an annotation is created.
|
| The kind of tokens that are used for the lookup can be configured and enhanced with
|
| filter capabilities. To improve the dictionary lookup it is recommended that the
|
| tokenization for the dictionary entries and the tokenization for the document text is the same.
|
| This can be achieved when using the dictionary creator with some advanced settings.
|
| </para>
|
| <para>
|
| During the annotator processing for each token in the document text
|
| that is available in the dictionary a new annotation with the dictionary
|
| output type is created. These annotations can be used in a succeeding step to do
|
| some further processing.
|
| </para>
|
| </chapter>
|
|
|
| <chapter id="sandbox.dictAnnotator.dictionaryCreation">
|
| <title>Dictionary Creation</title>
|
| <para>
|
| To automatically create a dictionary, the DictionaryCreator command line tooling is provied.
|
| </para>
|
| <section id="sandbox.dictAnnotator.dictionaryCreation.DictionaryCreator">
|
| <title>Dictionary Creator</title>
|
| <para>
|
| The DictionaryCreator command line tool should be used to create the
|
| DictionaryAnnotator dictionaries. The input for the DictionaryCreator
|
| is a text file that contains the dictionary entries, one entry per line. The output
|
| is the created dictionary as XML file.
|
| </para>
|
| <para>
|
| The usage below shows all possible command line parameters.
|
| <programlisting><![CDATA[java
|
| -cp uimaj-an-dictionary.jar
|
| org.apache.uima.annotator.dict_annot.dictionary.impl.DictionaryCreator
|
| -input <InputFile>
|
| -encoding <InputFileEncoding>
|
| -output <OutputFile>
|
| [-tokenizer <TokenizerPearFile> -tokenType <tokenType>]
|
| [-separator <separatorChar>]
|
| [-lang <dictionaryLanguage>]]]></programlisting>
|
| </para>
|
| <para>
|
| When just using the mandatory settings the input content for the dictionary
|
| is tokenized/separated by using the whitespace character. This means that if
|
| the line contains a whitespace character as in "Apache UIMA"
|
| the dictionary entry is treated as multi word entry where the mutli word
|
| consists of the two tokens "Apache" and "UIMA". If the line just contains "DictionaryAnnotator"
|
| the dictionary entry in treated as single word entry and has only one token
|
| called "DictionaryAnnotator".
|
| </para>
|
| <para>
|
| A sample XML dictionary file is shown below.
|
| <programlisting><![CDATA[<dictionary
|
| xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
|
| xsi:noNamespaceSchemaLocation="dictionary.xsd">
|
| <typeCollection>
|
| <dictionaryMetaData
|
| caseNormalization="true"
|
| multiWordEntries="true"
|
| multiWordSeparator=" "/>
|
| <typeDescription>
|
| <typeName> ADD DICTIONARY OUTPUT TYPE HERE</typeName>
|
| </typeDescription>
|
| <entries>
|
| <entry>
|
| <key>DictionaryAnnotator</key>
|
| </entry>
|
| <entry>
|
| <key>Apache UIMA</key>
|
| </entry>
|
| </entries>
|
| </typeCollection>
|
| </dictionary>]]></programlisting>
|
| </para>
|
| <para>
|
| In addition to the default creation, the DictionaryCreator can be configured with
|
| additional parameters.
|
| </para>
|
| <para>
|
| These are:
|
| <itemizedlist>
|
| <listitem>
|
| <para>
|
| <code>tokenization <TokenizerPearFile></code> -
|
| To use an Apache UIMA tokenizer annotator PEAR that tokenize
|
| the input instead of the simple whitespace tokenization that is done
|
| by default. When using a special tokenizer
|
| the <code>tokenType <tokenType></code> parameter must also be set.
|
| </para>
|
| </listitem>
|
| <listitem>
|
| <para>
|
| <code>tokenType <tokenType></code> -
|
| Specifies the token type to get the tokens created by the tokenizer.
|
| These tokens are used to create the single or multi word dictionary entries
|
| for each line of the input.
|
| </para>
|
| </listitem>
|
| <listitem>
|
| <para>
|
| <code>lang <languageCode></code> -
|
| In some cases it is necessary to specify the language for the created dictionary
|
| and for the used tokenization.
|
| </para>
|
| </listitem>
|
| <listitem>
|
| <para>
|
| <code>separator <separatorChar></code> -
|
| If no special tokenizer is used for the tokenization of the input dictionary content,
|
| by default the whitespace character is used to tokenizer the content. If another
|
| separator character should be used instead, it can be specified by using this parameter.
|
| </para>
|
| </listitem>
|
| </itemizedlist>
|
| </para>
|
| <para>
|
| After the dictionary is created, it is necessary to update the created dictionary
|
| with some additional meta data. The most important one that must be set is the
|
| <code>typeName</code> entry. The <code>typeName</code> entry after the creation looks like
|
| <code><typeName> ADD DICTIONARY OUTPUT TYPE HERE</typeName></code> and must
|
| be updated with the UIMA type that should be used if the DictionaryAnnotator creates
|
| an annotation for a word based on this dictionary. For more details about the other
|
| meta data entries of the dictionary, please refer to
|
| <xref linkend="sandbox.dictAnnotator.dictionaryCreation.DictionaryFormat"/>.
|
| </para>
|
| </section>
|
|
|
| <section id="sandbox.dictAnnotator.dictionaryCreation.DictionaryFormat">
|
| <title>Dictionary XML Format</title>
|
| <para>
|
| The Dictionary XML Format is shown with an example below:
|
| </para>
|
| <para>
|
| <programlisting><![CDATA[<dictionary
|
| xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
|
| xsi:noNamespaceSchemaLocation="dictionary.xsd">
|
| <typeCollection>
|
| <dictionaryMetaData
|
| caseNormalization="true"
|
| multiWordEntries="true"
|
| multiWordSeparator=" "/>
|
| <languageId>en</languageId>
|
| <typeDescription>
|
| <typeName>org.apache.uima.DictionaryEntry</typeName>
|
| </typeDescription>
|
| <entries>
|
| <entry>
|
| <key>DictionaryAnnotator</key>
|
| </entry>
|
| <entry>
|
| <key>Apache UIMA</key>
|
| </entry>
|
| </entries>
|
| </typeCollection>
|
| </dictionary>]]></programlisting>
|
| </para>
|
| <para>
|
| The <code><dictionaryMetaData></code> element specifies how the dictionary is used
|
| inside the DictionaryAnnoator. The attributes for the element are:
|
| <itemizedlist>
|
| <listitem>
|
| <para>
|
| <code>caseNormalization</code> -
|
| If this parameter is set to <code>true</code> all dictionary entries are treated
|
| case normalized. This means that the dictionary matching is not case sensitive.
|
| </para>
|
| </listitem>
|
| <listitem>
|
| <para>
|
| <code>multiWordSeparator</code> -
|
| Specifies the multi word separator character that is used in the XML document
|
| for multi words. If the DictionaryCreator creates the dictionary files this is by default
|
| the "|" character.
|
| </para>
|
| </listitem>
|
| <listitem>
|
| <para>
|
| <code>multiWordEntries</code> -
|
| If this parameter is <code>true</code> the dictionary is treated as
|
| multi word dictionary. This means that dictionary entries that are separated by
|
| the <code>multiWordSpearator</code> are treated as multi word entries. So for example
|
| "Apache|UIMA" is treated as multi word entry and the document text must
|
| have after the tokenization two tokens "Apache" and "UIMA" to match the dictionary
|
| entry.
|
| </para>
|
| </listitem>
|
| </itemizedlist>
|
| </para>
|
| <para>
|
| The <code><languageId></code> element specifies the language of the current dictionary if all
|
| entries have the same language. This settings is not mandatory and can also be omitted.
|
| content.
|
| </para>
|
| <para>
|
| The <code><typeName></code> element specifies the output type that is used if
|
| an annotation is created for a dictionary entry.
|
| </para>
|
| <para>
|
| The <code><key></code> elements specifies the dictionary entries. For each entry
|
| an own <code><key></code> element is used.
|
| </para>
|
| </section>
|
| </chapter>
|
|
|
| <chapter id="sandbox.dictAnnotator.annotatorConfiguration">
|
| <title>Annotator Configuration</title>
|
|
|
| <para>
|
| To use the DictionaryAnnotator it must be configured with at least one dictionary
|
| and with the input match type settings - the tokens - that the
|
| annotator will use to do the lookup. In addition to these mandatory settings
|
| it is possible to define input match type filters to filter the used annotations
|
| before they are used for the lookup. The following paragraphs will
|
| explain in detail how to configuration is done.
|
| </para>
|
| <section id="sandbox.dictAnnotator.annotatorConfiguration.DictionaryFiles">
|
| <title>Dictionary Files</title>
|
| <para>
|
| To specify the annotator dictionary files there is a configuration parameter
|
| definition in the annotator descriptor that looks like:
|
| </para>
|
| <para>
|
| <programlisting><![CDATA[<configurationParameter>
|
| <name>DictionaryFiles</name>
|
| <description>
|
| list of dictionary files to configure the annotator
|
| </description>
|
| <type>String</type>
|
| <multiValued>true</multiValued>
|
| <mandatory>true</mandatory>
|
| </configurationParameter>]]></programlisting>
|
| </para>
|
| <para>
|
| This parameter is mandatory and multi valued. This means that the setting must
|
| be available and one or more dictionary files can be specified with the same parameter.
|
| A sample setting for two dictionary files can look like:
|
| </para>
|
| <para>
|
| <programlisting><![CDATA[<nameValuePair>
|
| <name>DictionaryFiles</name>
|
| <value>
|
| <array>
|
| <string>dictionary1.xml</string>
|
| <string>http://localhost/mydict/dictionary.xml</string>
|
| </array>
|
| </value>
|
| </nameValuePair>]]></programlisting>
|
| </para>
|
| <para>
|
| The specified dictionary file names must be available in the classpath or in the UIMA datapath.
|
| Additionally it is possible to specify an HTTP URL to load the dictionary file.
|
| </para>
|
| </section>
|
|
|
| <section id="sandbox.dictAnnotator.annotatorConfiguration.InputTokenType">
|
| <title>Input Match Type</title>
|
| <para>
|
| The <code>InputMatchType</code> parameter defines the annotation type that is used
|
| for the dictionary lookup. All annotations of type <code>InputMatchType</code> are used for the
|
| lookup in the dictionary. In most cases this type should be the output type of the tokenizer
|
| annotator component. If the dictionary was created by using the DictionaryCreator configured with
|
| a tokenizer, it is recommended that the same tokenizer is also used in the annotator flow. Beyond
|
| that the <code>InputMatchType</code> should be the same as the tokenType used for the
|
| dictionary creation.
|
| </para>
|
| <para>
|
| The parameter that defines the input match type is:
|
| <programlisting><![CDATA[<configurationParameter>
|
| <name>InputMatchType</name>
|
| <description></description>
|
| <type>String</type>
|
| <multiValued>false</multiValued>
|
| <mandatory>true</mandatory>
|
| </configurationParameter>]]></programlisting>
|
| </para>
|
| <para>
|
| The parameter setting is mandatory and single valued. A sample setting for
|
| the <code>InputMatchType</code> looks like:
|
| </para>
|
| <para>
|
| <programlisting><![CDATA[<nameValuePair>
|
| <name>InputMatchType</name>
|
| <value>
|
| <string>org.apache.uima.TokenAnnotation</string>
|
| </value>
|
| </nameValuePair>]]></programlisting>
|
| </para>
|
| <section id="sandbox.dictAnnotator.annotatorConfiguration.InputTokenType.FeaturePath">
|
| <title>Input Match Type Feature Path</title>
|
| <para>
|
| In some special cases it may be necessary to use a feature value or a
|
| featurePath value of the <code>InputMatchType</code> for the dictionary lookup. In that case
|
| not the covered text of the <code>InputMatchType</code> annotation is used for the lookup but the
|
| specified feature or featurePath value.
|
| </para>
|
| <para>
|
| To define a feature or featurePath that is used for the lookup the following
|
| parameter must be used:
|
| <programlisting><![CDATA[<configurationParameter>
|
| <name>InputMatchFeaturePath</name>
|
| <description></description>
|
| <type>String</type>
|
| <multiValued>false</multiValued>
|
| <mandatory>false</mandatory>
|
| </configurationParameter>]]></programlisting>
|
| </para>
|
| <para>
|
| The parameter is not mandatory, it is just an optional addition. But if the parameter
|
| is used, the defined feature or featurePath must be valid for the <code>InputMatchType</code>.
|
| A sample configuration with a feature called <code>baseFormToken</code> is shown below:
|
| <programlisting><![CDATA[<nameValuePair>
|
| <name>InputMatchFeaturePath</name>
|
| <value>
|
| <string>baseFormToken</string>
|
| </value>
|
| </nameValuePair>]]></programlisting>
|
| </para>
|
| <para>
|
| If a featurePath is specified the path separator for the feature is "/".
|
| </para>
|
| </section>
|
| </section>
|
|
|
| <section id="sandbox.dictAnnotator.annotatorConfiguration.InputTokenFilter">
|
| <title>Input Match Type Filters</title>
|
| <para>
|
| If not all <code>InputMatchType</code> annotations should be used for the dictionary lookup it
|
| is possible to define filters to filter the used annotations. To define a filter three
|
| settings are necessary. The first one is the <code>InputMatchFilterFeaturePath</code>
|
| that specifies the feature or featurePath that should be used for the filtering. The
|
| second parameter is the <code>FilterConditionOperator</code> that defines the filter condition
|
| operator. The last parameter is <code>FilterConditionValue</code> that defines the condition
|
| value for the comparison.
|
| </para>
|
| <para>
|
| The parameter definition for all three parameters looks like:
|
| <programlisting><![CDATA[<configurationParameter>
|
| <name>InputMatchFilterFeaturePath</name>
|
| <description></description>
|
| <type>String</type>
|
| <multiValued>false</multiValued>
|
| <mandatory>false</mandatory>
|
| </configurationParameter>
|
|
|
| <configurationParameter>
|
| <name>FilterConditionOperator</name>
|
| <description></description>
|
| <type>String</type>
|
| <multiValued>false</multiValued>
|
| <mandatory>false</mandatory>
|
| </configurationParameter>
|
|
|
| <configurationParameter>
|
| <name>FilterConditionValue</name>
|
| <description></description>
|
| <type>String</type>
|
| <multiValued>false</multiValued>
|
| <mandatory>false</mandatory>
|
| </configurationParameter>]]></programlisting>
|
| </para>
|
| <para>
|
| For the <code>InputMatchFilterFeaturePath</code> the same rules applies as for the
|
| <code>InputMatchFeaturePath</code>. The specified feature or featurePath must be valid
|
| for the <code>InputMatchType</code> definition. If a featurePath is specified, the features
|
| are separated by "/".
|
| </para>
|
| <para>
|
| The value for the <code>FilterConditionOperator</code> can be one of:
|
| <itemizedlist>
|
| <listitem>
|
| <para>
|
| <code>NULL</code> -
|
| <code>InputMatchFilterFeaturePath</code> value must be NULL. No <code>FilterConditionValue</code>
|
| must be specified.
|
| </para>
|
| </listitem>
|
| <listitem>
|
| <para>
|
| <code>NOT_NULL</code> -
|
| <code>InputMatchFilterFeaturePath</code> value must be set and is not NULL. No
|
| <code>FilterConditionValue</code> must be specified.
|
| </para>
|
| </listitem>
|
| <listitem>
|
| <para>
|
| <code>EQUALS</code> -
|
| <code>InputMatchFilterFeaturePath</code> value must be equal to
|
| the <code>FilterConditionValue</code>.
|
| </para>
|
| </listitem>
|
| <listitem>
|
| <para>
|
| <code>NOT_EQUALS</code> -
|
| <code>InputMatchFilterFeaturePath</code> value is not equal
|
| to the <code>FilterConditionValue</code>
|
| </para>
|
| </listitem>
|
| <listitem>
|
| <para>
|
| <code>LESS</code> -
|
| <code>InputMatchFilterFeaturePath</code> value is less than
|
| the <code>FilterConditionValue</code>
|
| </para>
|
| </listitem>
|
| <listitem>
|
| <para>
|
| <code>LESS_EQ</code> -
|
| <code>InputMatchFilterFeaturePath</code> value is less or equal to
|
| the <code>FilterConditionValue</code>
|
| </para>
|
| </listitem>
|
| <listitem>
|
| <para>
|
| <code>GREATER</code> -
|
| <code>InputMatchFilterFeaturePath</code> value is greater than
|
| the <code>FilterConditionValue</code>
|
| </para>
|
| </listitem>
|
| <listitem>
|
| <para>
|
| <code>GREATER_EQ</code> -
|
| <code>InputMatchFilterFeaturePath</code> value is greater or equal to
|
| the <code>FilterConditionValue</code>
|
| </para>
|
| </listitem>
|
| </itemizedlist>
|
| </para>
|
| <para>
|
| A sample configuration for a filter that only use noun tokens for the dictionary lookup
|
| is shown below:
|
| <programlisting><![CDATA[<nameValuePair>
|
| <name>InputMatchFilterFeaturePath</name>
|
| <value>
|
| <string>partOfSpeach</string>
|
| </value>
|
| </nameValuePair>
|
|
|
| <nameValuePair>
|
| <name>FilterConditionOperator</name>
|
| <value>
|
| <string>EQUALS</string>
|
| </value>
|
| </nameValuePair>
|
|
|
| <nameValuePair>
|
| <name>FilterConditionValue</name>
|
| <value>
|
| <string>noun</string>
|
| </value>
|
| </nameValuePair>]]></programlisting>
|
| </para>
|
| </section>
|
| </chapter>
|
| </book> |