sandbox-2.3.0-01/DictionaryAnnotator/docbook/DictionaryAnnotatorUserGuide/DictionaryAnnotatorUserGuide.xml - uima-sandbox - Git at Google

 <?xml version="1.0" encoding="UTF-8"?>
 <!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN"
 "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd" [
 <!ENTITY imgroot "./images/" >
 <!ENTITY % xinclude SYSTEM "../../../uima-docbook-tool/xinclude.mod">
   %xinclude;
 ]>
 <!--
 	Licensed to the Apache Software Foundation (ASF) under one
 	or more contributor license agreements.  See the NOTICE file
 	distributed with this work for additional information
 	regarding copyright ownership.  The ASF licenses this file
 	to you under the Apache License, Version 2.0 (the
 	"License"); you may not use this file except in compliance
 	with the License.  You may obtain a copy of the License at

 	http://www.apache.org/licenses/LICENSE-2.0

 	Unless required by applicable law or agreed to in writing,
 	software distributed under the License is distributed on an
 	"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 	KIND, either express or implied.  See the License for the
 	specific language governing permissions and limitations
 	under the License.
 -->

 <book lang="en">

 	<title>
 		Apache UIMA Dictionary Annotator Documentation
 	</title>

 	<xi:include xmlns:xi="http://www.w3.org/2001/XInclude"
 		href="../../../SandboxDocs/src/docbook/book_info.xml" />

 	<preface>
 		<title>Introduction</title>
 		<para>
 			The DictionaryAnnotator is an
 			Apache UIMA analysis engine that annotates words based on
 			dictionary entries. For each word in the document text
 			that is available in the dictionary a new annotation is created.

 			The annotator can be configured with one or more independent
 			dictionaries. The dictionaries can easily be created with the
 			dictionary creator command line tooling. For advanced usage
 			of the annotator the matching can also be improved by specifying
 			multi word capabilities, match input type properties and input type
 			filter settings.
 		</para>
 	</preface>

 	<chapter id="sandbox.dictAnnotator.processingOverview">
 		<title>Processing Overview</title>
 		<para>
 			To use the DictionaryAnnotator at first a dictionary must be
 			created because so far the annotator does not provide any dictionaries.
 			The creation of a dictionary is very simple when using the
 			dictionary creator command line tooling. The tooling takes as input a list
 			words that should be added to the dictionary.
 			The output of the dictionary creator is the created dictionary as XML file
 			and can be used to configure the annotator. For each dictionary additional
 			meta data like the annotation output type for the created
 			annotation can be set. The dictionary and the DictionaryAnnotator can be
 			configured to work with single word dictionary entries like "Apache" or with
 			multi word entries like "Apache UIMA".
 		</para>
 		<para>
 			After the annotator is configured with the created dictionary the lookup
 			strategy settings must be defined. The dictionary lookup inside the annotator
 			works with tokens. A token is a word or an arbitrary text fragment that is used for
 			the dictionary lookup. If a token match a dictionary entry an annotation is created.
 			The kind of tokens that are used for the lookup can be configured and enhanced with
 			filter capabilities. To improve the dictionary lookup it is recommended that the
 			tokenization for the dictionary entries and the tokenization for the document text is the same.
 			This can be achieved when using the	dictionary creator with some advanced settings.
 		</para>
 		<para>
 		    During the annotator processing for each token in the document text
 		    that is available in the dictionary a new annotation with the dictionary
 		    output type is created. These annotations can be used in a succeeding step to do
 		    some further processing.
 		</para>
 	</chapter>

 	<chapter id="sandbox.dictAnnotator.dictionaryCreation">
 		<title>Dictionary Creation</title>
 		<para>
 		  To automatically create a dictionary, the DictionaryCreator command line tooling is provied.
 		</para>
 		<section id="sandbox.dictAnnotator.dictionaryCreation.DictionaryCreator">
 			<title>Dictionary Creator</title>
 			<para>
 				The DictionaryCreator command line tool should be used to create the
 				DictionaryAnnotator dictionaries. The input for the DictionaryCreator
 				is a text file that contains the dictionary entries, one entry per line. The output
 				is the created dictionary as XML file.
 		    </para>
 			<para>
 			    The usage below shows all possible command line parameters.
 			<programlisting><![CDATA[java
   -cp uimaj-an-dictionary.jar
   org.apache.uima.annotator.dict_annot.dictionary.impl.DictionaryCreator
   -input <InputFile>
   -encoding <InputFileEncoding>
   -output <OutputFile>
   [-tokenizer <TokenizerPearFile> -tokenType <tokenType>]
   [-separator <separatorChar>]
   [-lang <dictionaryLanguage>]]]></programlisting>
 			</para>
 			<para>
 			    When just using the mandatory settings the input content for the dictionary
 			    is tokenized/separated by using the whitespace character. This means that if
 			    the line contains a whitespace character as in "Apache UIMA"
 			    the dictionary entry is treated as multi word entry where the mutli word
 			    consists of the two tokens "Apache" and "UIMA". If the line just contains "DictionaryAnnotator"
 			    the dictionary entry in treated as single word entry and has only one token
 			    called "DictionaryAnnotator".
 			</para>
 			<para>
 			    A sample XML dictionary file is shown below.
 			   <programlisting><![CDATA[<dictionary
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:noNamespaceSchemaLocation="dictionary.xsd">
    <typeCollection>
       <dictionaryMetaData
          caseNormalization="true"
          multiWordEntries="true"
          multiWordSeparator=" "/>
       <typeDescription>
          <typeName> ADD DICTIONARY OUTPUT TYPE HERE</typeName>
       </typeDescription>
       <entries>
          <entry>
             <key>DictionaryAnnotator</key>
          </entry>
          <entry>
             <key>Apache UIMA</key>
          </entry>
       </entries>
    </typeCollection>
 </dictionary>]]></programlisting>
 			</para>
 			<para>
 			    In addition to the default creation, the DictionaryCreator can be configured with
 			    additional parameters.
 			</para>
 			<para>
 			    These are:
 			    <itemizedlist>
 						<listitem>
 							<para>
 								<code>tokenization &lt;TokenizerPearFile></code> -
 								To use an Apache UIMA tokenizer annotator PEAR that tokenize
 								the input instead of the simple whitespace tokenization that is done
 								by default. When using a special tokenizer
 								the <code>tokenType &lt;tokenType></code> parameter must also be set.
 							</para>
 						</listitem>
 						<listitem>
 							<para>
 								<code>tokenType &lt;tokenType></code> -
 								Specifies the token type to get	the tokens created by the tokenizer.
 								These tokens are used to create the single or multi word dictionary entries
 								for	each line of the input.
 							</para>
 						</listitem>
 						<listitem>
 							<para>
 								<code>lang &lt;languageCode></code> -
 								In some cases it is necessary to specify the language for the created dictionary
 								and for the used tokenization.
 							</para>
 						</listitem>
 						<listitem>
 							<para>
 								<code>separator &lt;separatorChar></code> -
 								If no special tokenizer is used for the tokenization of the input dictionary content,
 								by default the whitespace character is used to tokenizer the content. If another
 								separator character should be used instead, it can be specified by using this parameter.
 							</para>
 						</listitem>
 			    </itemizedlist>
 			 </para>
 			 <para>
 			   After the dictionary is created, it is necessary to update the created dictionary
 			   with some additional meta data. The most important one that must be set is the
 			   <code>typeName</code> entry. The <code>typeName</code> entry after the creation looks like
 			   <code>&lt;typeName> ADD DICTIONARY OUTPUT TYPE HERE&lt;/typeName></code> and must
 			   be updated with the UIMA type that should be used if the DictionaryAnnotator creates
 			   an annotation for a word based on this dictionary. For more details about the other
 			   meta data entries of the dictionary, please refer to
 			   <xref linkend="sandbox.dictAnnotator.dictionaryCreation.DictionaryFormat"/>.
 			 </para>
 		</section>

 		<section id="sandbox.dictAnnotator.dictionaryCreation.DictionaryFormat">
 			<title>Dictionary XML Format</title>
 			<para>
 				The Dictionary XML Format is shown with an example below:
 			</para>
 			<para>
 			   <programlisting><![CDATA[<dictionary
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:noNamespaceSchemaLocation="dictionary.xsd">
    <typeCollection>
       <dictionaryMetaData
          caseNormalization="true"
          multiWordEntries="true"
          multiWordSeparator=" "/>
       <languageId>en</languageId>
       <typeDescription>
          <typeName>org.apache.uima.DictionaryEntry</typeName>
       </typeDescription>
       <entries>
          <entry>
             <key>DictionaryAnnotator</key>
          </entry>
          <entry>
             <key>Apache UIMA</key>
          </entry>
       </entries>
    </typeCollection>
 </dictionary>]]></programlisting>
 			</para>
 			<para>
 			  The <code>&lt;dictionaryMetaData></code> element specifies how the dictionary is used
 			  inside the DictionaryAnnoator. The attributes for the element are:
 			  <itemizedlist>
 			     <listitem>
 			        <para>
 			           <code>caseNormalization</code> -
 						If this parameter is set to <code>true</code> all dictionary entries are treated
 						case normalized. This means that the dictionary matching is not case sensitive.
 					</para>
 				 </listitem>
 			     <listitem>
 			        <para>
 			           <code>multiWordSeparator</code> -
 						Specifies the multi word separator character that is used in the XML document
 						for multi words. If the DictionaryCreator creates the dictionary files this is by default
 						the "|" character.
 					</para>
 				 </listitem>
 			     <listitem>
 			        <para>
 			           <code>multiWordEntries</code> -
 						If this parameter is <code>true</code> the dictionary is treated as
 						multi word dictionary. This means that dictionary entries that are separated by
 						the <code>multiWordSpearator</code> are treated as multi word entries. So for example
 						"Apache|UIMA" is treated as multi word entry and the document text must
 						have after the tokenization two tokens "Apache" and "UIMA" to match the dictionary
 						entry.
 					</para>
 				 </listitem>
 			  </itemizedlist>
 			</para>
 			<para>
 			   The <code>&lt;languageId></code> element specifies the language of the current dictionary if all
 			   entries have the same language. This settings is not mandatory and can also be omitted.
 			   content.
 			</para>
 			<para>
 			   The <code>&lt;typeName></code> element specifies the output type that is used if
 			   an annotation is created for a dictionary entry.
 			</para>
 			<para>
 			   The <code>&lt;key></code> elements specifies the dictionary entries. For each entry
 			   an own <code>&lt;key></code> element is used.
 			</para>
 		</section>
 	</chapter>

 	<chapter id="sandbox.dictAnnotator.annotatorConfiguration">
 		<title>Annotator Configuration</title>

 		<para>
 		  To use the DictionaryAnnotator it must be configured with at least one dictionary
 		  and with the input match type settings - the tokens - that the
 		  annotator will use to do the lookup. In addition to these mandatory settings
 		  it is possible to define input match type filters to filter the used annotations
 		  before they are used for the lookup. The following paragraphs will
 		  explain in detail how to configuration is done.
 		</para>
 		<section id="sandbox.dictAnnotator.annotatorConfiguration.DictionaryFiles">
 			<title>Dictionary Files</title>
 			<para>
               To specify the annotator dictionary files there is a configuration parameter
               definition in the annotator descriptor that looks like:
 			</para>
 			<para>
 			<programlisting><![CDATA[<configurationParameter>
    <name>DictionaryFiles</name>
    <description>
       list of dictionary files to configure the annotator
    </description>
    <type>String</type>
    <multiValued>true</multiValued>
    <mandatory>true</mandatory>
 </configurationParameter>]]></programlisting>
 			</para>
 			<para>
 			   This parameter is mandatory and multi valued. This means that the setting must
 			   be available and one or more dictionary files can be specified with the same parameter.
 			   A sample setting for two dictionary files can look like:
 			</para>
 			<para>
 			  <programlisting><![CDATA[<nameValuePair>
    <name>DictionaryFiles</name>
    <value>
       <array>
          <string>dictionary1.xml</string>
          <string>http://localhost/mydict/dictionary.xml</string>
       </array>
    </value>
 </nameValuePair>]]></programlisting>
 			</para>
 			<para>
 			  The specified dictionary file names must be available in the classpath or in the UIMA datapath.
 			  Additionally it is possible to specify an HTTP URL to load the dictionary file.
 			</para>
 		</section>

 		<section id="sandbox.dictAnnotator.annotatorConfiguration.InputTokenType">
 			<title>Input Match Type</title>
 			<para>
 				The <code>InputMatchType</code> parameter defines the annotation type that is used
 				for the dictionary lookup. All annotations of type <code>InputMatchType</code> are used for the
 				lookup in the dictionary. In most cases this type should be the output type of the tokenizer
 				annotator component. If the dictionary was created by using the DictionaryCreator configured with
 				a tokenizer, it is recommended that the same tokenizer is also used in the annotator flow. Beyond
 				that the <code>InputMatchType</code> should be the same as the tokenType used for the
 				dictionary creation.
 			</para>
 			<para>
 			   The parameter that defines the input match type is:
 			   <programlisting><![CDATA[<configurationParameter>
    <name>InputMatchType</name>
    <description></description>
    <type>String</type>
    <multiValued>false</multiValued>
    <mandatory>true</mandatory>
 </configurationParameter>]]></programlisting>
 			</para>
 			<para>
 			   The parameter setting is mandatory and single valued. A sample setting for
 			   the <code>InputMatchType</code> looks like:
 			</para>
 			<para>
 			  <programlisting><![CDATA[<nameValuePair>
    <name>InputMatchType</name>
    <value>
       <string>org.apache.uima.TokenAnnotation</string>
    </value>
 </nameValuePair>]]></programlisting>
 			</para>
 			<section id="sandbox.dictAnnotator.annotatorConfiguration.InputTokenType.FeaturePath">
 				<title>Input Match Type Feature Path</title>
 				<para>
 					In some special cases it may be necessary to use a feature value or a
 					featurePath value of the <code>InputMatchType</code> for the dictionary lookup. In that case
 					not the covered text of the <code>InputMatchType</code> annotation is used for the lookup but the
 					specified feature or featurePath value.
 				</para>
 				<para>
 				    To define a feature or featurePath that is used for the lookup the following
 				    parameter must be used:
 				    <programlisting><![CDATA[<configurationParameter>
    <name>InputMatchFeaturePath</name>
    <description></description>
    <type>String</type>
    <multiValued>false</multiValued>
    <mandatory>false</mandatory>
 </configurationParameter>]]></programlisting>
 				</para>
 				<para>
 				    The parameter is not mandatory, it is just an optional addition. But if the parameter
 				    is used, the defined feature or featurePath must be valid for the <code>InputMatchType</code>.
 				    A sample configuration with a feature called <code>baseFormToken</code> is shown below:
 				    <programlisting><![CDATA[<nameValuePair>
    <name>InputMatchFeaturePath</name>
    <value>
       <string>baseFormToken</string>
    </value>
 </nameValuePair>]]></programlisting>
 				</para>
 				<para>
 				   If a featurePath is specified the path separator for the feature is "/".
 				</para>
 			</section>
 		</section>

 		<section id="sandbox.dictAnnotator.annotatorConfiguration.InputTokenFilter">
 			<title>Input Match Type Filters</title>
 			<para>
                 If not all <code>InputMatchType</code> annotations should be used for the dictionary lookup it
                 is possible to define filters to filter the used annotations. To define a filter three
                 settings are necessary. The first one is the <code>InputMatchFilterFeaturePath</code>
                 that specifies the feature or featurePath that should be used for the filtering. The
                 second parameter is the <code>FilterConditionOperator</code> that defines the filter condition
                 operator. The last parameter is <code>FilterConditionValue</code> that defines the condition
                 value for the comparison.
 			</para>
 			<para>
 				The parameter definition for all three parameters looks like:
 		    <programlisting><![CDATA[<configurationParameter>
    <name>InputMatchFilterFeaturePath</name>
    <description></description>
    <type>String</type>
    <multiValued>false</multiValued>
    <mandatory>false</mandatory>
 </configurationParameter>

 <configurationParameter>
    <name>FilterConditionOperator</name>
    <description></description>
    <type>String</type>
    <multiValued>false</multiValued>
    <mandatory>false</mandatory>
 </configurationParameter>

 <configurationParameter>
    <name>FilterConditionValue</name>
    <description></description>
    <type>String</type>
    <multiValued>false</multiValued>
    <mandatory>false</mandatory>
 </configurationParameter>]]></programlisting>
 				</para>
 				<para>
 				    For the <code>InputMatchFilterFeaturePath</code> the same rules applies as for the
 				    <code>InputMatchFeaturePath</code>. The specified feature or featurePath must be valid
 				    for the <code>InputMatchType</code> definition. If a featurePath is specified, the features
 				    are separated by "/".
 				</para>
 				<para>
 				    The value for the <code>FilterConditionOperator</code> can be one of:
 					<itemizedlist>
 					   <listitem>
 					      <para>
 					         <code>NULL</code> -
 							 <code>InputMatchFilterFeaturePath</code> value must be NULL. No <code>FilterConditionValue</code>
 							 must be specified.
 						  </para>
 				 	   </listitem>
 					   <listitem>
 					      <para>
 					         <code>NOT_NULL</code> -
 							 <code>InputMatchFilterFeaturePath</code> value must be set and is not NULL. No
 							 <code>FilterConditionValue</code> must be specified.
 						  </para>
 				 	   </listitem>
 					   <listitem>
 					      <para>
 					         <code>EQUALS</code> -
 							 <code>InputMatchFilterFeaturePath</code> value must be equal to
 							 the <code>FilterConditionValue</code>.
 						  </para>
 				 	   </listitem>
 					   <listitem>
 					      <para>
 					         <code>NOT_EQUALS</code> -
 							 <code>InputMatchFilterFeaturePath</code> value is not equal
 							 to the <code>FilterConditionValue</code>
 						  </para>
 				 	   </listitem>
 					   <listitem>
 					      <para>
 					         <code>LESS</code> -
 							 <code>InputMatchFilterFeaturePath</code> value is less than
 							 the <code>FilterConditionValue</code>
 						  </para>
 				 	   </listitem>
 					   <listitem>
 					      <para>
 					         <code>LESS_EQ</code> -
 							 <code>InputMatchFilterFeaturePath</code> value is less or equal to
 							 the <code>FilterConditionValue</code>
 						  </para>
 				 	   </listitem>
 					   <listitem>
 					      <para>
 					         <code>GREATER</code> -
 							 <code>InputMatchFilterFeaturePath</code> value is greater than
 							 the <code>FilterConditionValue</code>
 						  </para>
 				 	   </listitem>
 					   <listitem>
 					      <para>
 					         <code>GREATER_EQ</code> -
 							 <code>InputMatchFilterFeaturePath</code> value is greater or equal to
 							 the <code>FilterConditionValue</code>
 						  </para>
 				 	   </listitem>
 			        </itemizedlist>
 				</para>
 				<para>
 				   A sample configuration for a filter that only use noun tokens for the dictionary lookup
 				   is shown below:
 				   <programlisting><![CDATA[<nameValuePair>
    <name>InputMatchFilterFeaturePath</name>
    <value>
       <string>partOfSpeach</string>
    </value>
 </nameValuePair>

 <nameValuePair>
    <name>FilterConditionOperator</name>
    <value>
       <string>EQUALS</string>
    </value>
 </nameValuePair>

 <nameValuePair>
    <name>FilterConditionValue</name>
    <value>
       <string>noun</string>
    </value>
 </nameValuePair>]]></programlisting>
 				</para>
 		</section>
 	</chapter>
 </book>
	<?xml version="1.0" encoding="UTF-8"?>
	<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN"
	"http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd" [
	<!ENTITY imgroot "./images/" >
	<!ENTITY % xinclude SYSTEM "../../../uima-docbook-tool/xinclude.mod">
	%xinclude;
	]>
	<!--
	Licensed to the Apache Software Foundation (ASF) under one
	or more contributor license agreements. See the NOTICE file
	distributed with this work for additional information
	regarding copyright ownership. The ASF licenses this file
	to you under the Apache License, Version 2.0 (the
	"License"); you may not use this file except in compliance
	with the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing,
	software distributed under the License is distributed on an
	"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	KIND, either express or implied. See the License for the
	specific language governing permissions and limitations
	under the License.
	-->

	<book lang="en">

	<title>
	Apache UIMA Dictionary Annotator Documentation
	</title>

	<xi:include xmlns:xi="http://www.w3.org/2001/XInclude"
	href="../../../SandboxDocs/src/docbook/book_info.xml" />

	<preface>
	<title>Introduction</title>
	<para>
	The DictionaryAnnotator is an
	Apache UIMA analysis engine that annotates words based on
	dictionary entries. For each word in the document text
	that is available in the dictionary a new annotation is created.

	The annotator can be configured with one or more independent
	dictionaries. The dictionaries can easily be created with the
	dictionary creator command line tooling. For advanced usage
	of the annotator the matching can also be improved by specifying
	multi word capabilities, match input type properties and input type
	filter settings.
	</para>
	</preface>

	<chapter id="sandbox.dictAnnotator.processingOverview">
	<title>Processing Overview</title>
	<para>
	To use the DictionaryAnnotator at first a dictionary must be
	created because so far the annotator does not provide any dictionaries.
	The creation of a dictionary is very simple when using the
	dictionary creator command line tooling. The tooling takes as input a list
	words that should be added to the dictionary.
	The output of the dictionary creator is the created dictionary as XML file
	and can be used to configure the annotator. For each dictionary additional
	meta data like the annotation output type for the created
	annotation can be set. The dictionary and the DictionaryAnnotator can be
	configured to work with single word dictionary entries like "Apache" or with
	multi word entries like "Apache UIMA".
	</para>
	<para>
	After the annotator is configured with the created dictionary the lookup
	strategy settings must be defined. The dictionary lookup inside the annotator
	works with tokens. A token is a word or an arbitrary text fragment that is used for
	the dictionary lookup. If a token match a dictionary entry an annotation is created.
	The kind of tokens that are used for the lookup can be configured and enhanced with
	filter capabilities. To improve the dictionary lookup it is recommended that the
	tokenization for the dictionary entries and the tokenization for the document text is the same.
	This can be achieved when using the dictionary creator with some advanced settings.
	</para>
	<para>
	During the annotator processing for each token in the document text
	that is available in the dictionary a new annotation with the dictionary
	output type is created. These annotations can be used in a succeeding step to do
	some further processing.
	</para>
	</chapter>

	<chapter id="sandbox.dictAnnotator.dictionaryCreation">
	<title>Dictionary Creation</title>
	<para>
	To automatically create a dictionary, the DictionaryCreator command line tooling is provied.
	</para>
	<section id="sandbox.dictAnnotator.dictionaryCreation.DictionaryCreator">
	<title>Dictionary Creator</title>
	<para>
	The DictionaryCreator command line tool should be used to create the
	DictionaryAnnotator dictionaries. The input for the DictionaryCreator
	is a text file that contains the dictionary entries, one entry per line. The output
	is the created dictionary as XML file.
	</para>
	<para>
	The usage below shows all possible command line parameters.
	<programlisting><![CDATA[java
	-cp uimaj-an-dictionary.jar
	org.apache.uima.annotator.dict_annot.dictionary.impl.DictionaryCreator
	-input <InputFile>
	-encoding <InputFileEncoding>
	-output <OutputFile>
	[-tokenizer <TokenizerPearFile> -tokenType <tokenType>]
	[-separator <separatorChar>]
	[-lang <dictionaryLanguage>]]]></programlisting>
	</para>
	<para>
	When just using the mandatory settings the input content for the dictionary
	is tokenized/separated by using the whitespace character. This means that if
	the line contains a whitespace character as in "Apache UIMA"
	the dictionary entry is treated as multi word entry where the mutli word
	consists of the two tokens "Apache" and "UIMA". If the line just contains "DictionaryAnnotator"
	the dictionary entry in treated as single word entry and has only one token
	called "DictionaryAnnotator".
	</para>
	<para>
	A sample XML dictionary file is shown below.
	<programlisting><![CDATA[<dictionary
	xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
	xsi:noNamespaceSchemaLocation="dictionary.xsd">
	<typeCollection>
	<dictionaryMetaData
	caseNormalization="true"
	multiWordEntries="true"
	multiWordSeparator=" "/>
	<typeDescription>
	<typeName> ADD DICTIONARY OUTPUT TYPE HERE</typeName>
	</typeDescription>
	<entries>
	<entry>
	<key>DictionaryAnnotator</key>
	</entry>
	<entry>
	<key>Apache UIMA</key>
	</entry>
	</entries>
	</typeCollection>
	</dictionary>]]></programlisting>
	</para>
	<para>
	In addition to the default creation, the DictionaryCreator can be configured with
	additional parameters.
	</para>
	<para>
	These are:
	<itemizedlist>
	<listitem>
	<para>
	<code>tokenization <TokenizerPearFile></code> -
	To use an Apache UIMA tokenizer annotator PEAR that tokenize
	the input instead of the simple whitespace tokenization that is done
	by default. When using a special tokenizer
	the <code>tokenType <tokenType></code> parameter must also be set.
	</para>
	</listitem>
	<listitem>
	<para>
	<code>tokenType <tokenType></code> -
	Specifies the token type to get the tokens created by the tokenizer.
	These tokens are used to create the single or multi word dictionary entries
	for each line of the input.
	</para>
	</listitem>
	<listitem>
	<para>
	<code>lang <languageCode></code> -
	In some cases it is necessary to specify the language for the created dictionary
	and for the used tokenization.
	</para>
	</listitem>
	<listitem>
	<para>
	<code>separator <separatorChar></code> -
	If no special tokenizer is used for the tokenization of the input dictionary content,
	by default the whitespace character is used to tokenizer the content. If another
	separator character should be used instead, it can be specified by using this parameter.
	</para>
	</listitem>
	</itemizedlist>
	</para>
	<para>
	After the dictionary is created, it is necessary to update the created dictionary
	with some additional meta data. The most important one that must be set is the
	<code>typeName</code> entry. The <code>typeName</code> entry after the creation looks like
	<code><typeName> ADD DICTIONARY OUTPUT TYPE HERE</typeName></code> and must
	be updated with the UIMA type that should be used if the DictionaryAnnotator creates
	an annotation for a word based on this dictionary. For more details about the other
	meta data entries of the dictionary, please refer to
	<xref linkend="sandbox.dictAnnotator.dictionaryCreation.DictionaryFormat"/>.
	</para>
	</section>

	<section id="sandbox.dictAnnotator.dictionaryCreation.DictionaryFormat">
	<title>Dictionary XML Format</title>
	<para>
	The Dictionary XML Format is shown with an example below:
	</para>
	<para>
	<programlisting><![CDATA[<dictionary
	xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
	xsi:noNamespaceSchemaLocation="dictionary.xsd">
	<typeCollection>
	<dictionaryMetaData
	caseNormalization="true"
	multiWordEntries="true"
	multiWordSeparator=" "/>
	<languageId>en</languageId>
	<typeDescription>
	<typeName>org.apache.uima.DictionaryEntry</typeName>
	</typeDescription>
	<entries>
	<entry>
	<key>DictionaryAnnotator</key>
	</entry>
	<entry>
	<key>Apache UIMA</key>
	</entry>
	</entries>
	</typeCollection>
	</dictionary>]]></programlisting>
	</para>
	<para>
	The <code><dictionaryMetaData></code> element specifies how the dictionary is used
	inside the DictionaryAnnoator. The attributes for the element are:
	<itemizedlist>
	<listitem>
	<para>
	<code>caseNormalization</code> -
	If this parameter is set to <code>true</code> all dictionary entries are treated
	case normalized. This means that the dictionary matching is not case sensitive.
	</para>
	</listitem>
	<listitem>
	<para>
	<code>multiWordSeparator</code> -
	Specifies the multi word separator character that is used in the XML document
	for multi words. If the DictionaryCreator creates the dictionary files this is by default
	the "\|" character.
	</para>
	</listitem>
	<listitem>
	<para>
	<code>multiWordEntries</code> -
	If this parameter is <code>true</code> the dictionary is treated as
	multi word dictionary. This means that dictionary entries that are separated by
	the <code>multiWordSpearator</code> are treated as multi word entries. So for example
	"Apache\|UIMA" is treated as multi word entry and the document text must
	have after the tokenization two tokens "Apache" and "UIMA" to match the dictionary
	entry.
	</para>
	</listitem>
	</itemizedlist>
	</para>
	<para>
	The <code><languageId></code> element specifies the language of the current dictionary if all
	entries have the same language. This settings is not mandatory and can also be omitted.
	content.
	</para>
	<para>
	The <code><typeName></code> element specifies the output type that is used if
	an annotation is created for a dictionary entry.
	</para>
	<para>
	The <code><key></code> elements specifies the dictionary entries. For each entry
	an own <code><key></code> element is used.
	</para>
	</section>
	</chapter>

	<chapter id="sandbox.dictAnnotator.annotatorConfiguration">
	<title>Annotator Configuration</title>

	<para>
	To use the DictionaryAnnotator it must be configured with at least one dictionary
	and with the input match type settings - the tokens - that the
	annotator will use to do the lookup. In addition to these mandatory settings
	it is possible to define input match type filters to filter the used annotations
	before they are used for the lookup. The following paragraphs will
	explain in detail how to configuration is done.
	</para>
	<section id="sandbox.dictAnnotator.annotatorConfiguration.DictionaryFiles">
	<title>Dictionary Files</title>
	<para>
	To specify the annotator dictionary files there is a configuration parameter
	definition in the annotator descriptor that looks like:
	</para>
	<para>
	<programlisting><![CDATA[<configurationParameter>
	<name>DictionaryFiles</name>
	<description>
	list of dictionary files to configure the annotator
	</description>
	<type>String</type>
	<multiValued>true</multiValued>
	<mandatory>true</mandatory>
	</configurationParameter>]]></programlisting>
	</para>
	<para>
	This parameter is mandatory and multi valued. This means that the setting must
	be available and one or more dictionary files can be specified with the same parameter.
	A sample setting for two dictionary files can look like:
	</para>
	<para>
	<programlisting><![CDATA[<nameValuePair>
	<name>DictionaryFiles</name>
	<value>
	<array>
	<string>dictionary1.xml</string>
	<string>http://localhost/mydict/dictionary.xml</string>
	</array>
	</value>
	</nameValuePair>]]></programlisting>
	</para>
	<para>
	The specified dictionary file names must be available in the classpath or in the UIMA datapath.
	Additionally it is possible to specify an HTTP URL to load the dictionary file.
	</para>
	</section>

	<section id="sandbox.dictAnnotator.annotatorConfiguration.InputTokenType">
	<title>Input Match Type</title>
	<para>
	The <code>InputMatchType</code> parameter defines the annotation type that is used
	for the dictionary lookup. All annotations of type <code>InputMatchType</code> are used for the
	lookup in the dictionary. In most cases this type should be the output type of the tokenizer
	annotator component. If the dictionary was created by using the DictionaryCreator configured with
	a tokenizer, it is recommended that the same tokenizer is also used in the annotator flow. Beyond
	that the <code>InputMatchType</code> should be the same as the tokenType used for the
	dictionary creation.
	</para>
	<para>
	The parameter that defines the input match type is:
	<programlisting><![CDATA[<configurationParameter>
	<name>InputMatchType</name>
	<description></description>
	<type>String</type>
	<multiValued>false</multiValued>
	<mandatory>true</mandatory>
	</configurationParameter>]]></programlisting>
	</para>
	<para>
	The parameter setting is mandatory and single valued. A sample setting for
	the <code>InputMatchType</code> looks like:
	</para>
	<para>
	<programlisting><![CDATA[<nameValuePair>
	<name>InputMatchType</name>
	<value>
	<string>org.apache.uima.TokenAnnotation</string>
	</value>
	</nameValuePair>]]></programlisting>
	</para>
	<section id="sandbox.dictAnnotator.annotatorConfiguration.InputTokenType.FeaturePath">
	<title>Input Match Type Feature Path</title>
	<para>
	In some special cases it may be necessary to use a feature value or a
	featurePath value of the <code>InputMatchType</code> for the dictionary lookup. In that case
	not the covered text of the <code>InputMatchType</code> annotation is used for the lookup but the
	specified feature or featurePath value.
	</para>
	<para>
	To define a feature or featurePath that is used for the lookup the following
	parameter must be used:
	<programlisting><![CDATA[<configurationParameter>
	<name>InputMatchFeaturePath</name>
	<description></description>
	<type>String</type>
	<multiValued>false</multiValued>
	<mandatory>false</mandatory>
	</configurationParameter>]]></programlisting>
	</para>
	<para>
	The parameter is not mandatory, it is just an optional addition. But if the parameter
	is used, the defined feature or featurePath must be valid for the <code>InputMatchType</code>.
	A sample configuration with a feature called <code>baseFormToken</code> is shown below:
	<programlisting><![CDATA[<nameValuePair>
	<name>InputMatchFeaturePath</name>
	<value>
	<string>baseFormToken</string>
	</value>
	</nameValuePair>]]></programlisting>
	</para>
	<para>
	If a featurePath is specified the path separator for the feature is "/".
	</para>
	</section>
	</section>

	<section id="sandbox.dictAnnotator.annotatorConfiguration.InputTokenFilter">
	<title>Input Match Type Filters</title>
	<para>
	If not all <code>InputMatchType</code> annotations should be used for the dictionary lookup it
	is possible to define filters to filter the used annotations. To define a filter three
	settings are necessary. The first one is the <code>InputMatchFilterFeaturePath</code>
	that specifies the feature or featurePath that should be used for the filtering. The
	second parameter is the <code>FilterConditionOperator</code> that defines the filter condition
	operator. The last parameter is <code>FilterConditionValue</code> that defines the condition
	value for the comparison.
	</para>
	<para>
	The parameter definition for all three parameters looks like:
	<programlisting><![CDATA[<configurationParameter>
	<name>InputMatchFilterFeaturePath</name>
	<description></description>
	<type>String</type>
	<multiValued>false</multiValued>
	<mandatory>false</mandatory>
	</configurationParameter>

	<configurationParameter>
	<name>FilterConditionOperator</name>
	<description></description>
	<type>String</type>
	<multiValued>false</multiValued>
	<mandatory>false</mandatory>
	</configurationParameter>

	<configurationParameter>
	<name>FilterConditionValue</name>
	<description></description>
	<type>String</type>
	<multiValued>false</multiValued>
	<mandatory>false</mandatory>
	</configurationParameter>]]></programlisting>
	</para>
	<para>
	For the <code>InputMatchFilterFeaturePath</code> the same rules applies as for the
	<code>InputMatchFeaturePath</code>. The specified feature or featurePath must be valid
	for the <code>InputMatchType</code> definition. If a featurePath is specified, the features
	are separated by "/".
	</para>
	<para>
	The value for the <code>FilterConditionOperator</code> can be one of:
	<itemizedlist>
	<listitem>
	<para>
	<code>NULL</code> -
	<code>InputMatchFilterFeaturePath</code> value must be NULL. No <code>FilterConditionValue</code>
	must be specified.
	</para>
	</listitem>
	<listitem>
	<para>
	<code>NOT_NULL</code> -
	<code>InputMatchFilterFeaturePath</code> value must be set and is not NULL. No
	<code>FilterConditionValue</code> must be specified.
	</para>
	</listitem>
	<listitem>
	<para>
	<code>EQUALS</code> -
	<code>InputMatchFilterFeaturePath</code> value must be equal to
	the <code>FilterConditionValue</code>.
	</para>
	</listitem>
	<listitem>
	<para>
	<code>NOT_EQUALS</code> -
	<code>InputMatchFilterFeaturePath</code> value is not equal
	to the <code>FilterConditionValue</code>
	</para>
	</listitem>
	<listitem>
	<para>
	<code>LESS</code> -
	<code>InputMatchFilterFeaturePath</code> value is less than
	the <code>FilterConditionValue</code>
	</para>
	</listitem>
	<listitem>
	<para>
	<code>LESS_EQ</code> -
	<code>InputMatchFilterFeaturePath</code> value is less or equal to
	the <code>FilterConditionValue</code>
	</para>
	</listitem>
	<listitem>
	<para>
	<code>GREATER</code> -
	<code>InputMatchFilterFeaturePath</code> value is greater than
	the <code>FilterConditionValue</code>
	</para>
	</listitem>
	<listitem>
	<para>
	<code>GREATER_EQ</code> -
	<code>InputMatchFilterFeaturePath</code> value is greater or equal to
	the <code>FilterConditionValue</code>
	</para>
	</listitem>
	</itemizedlist>
	</para>
	<para>
	A sample configuration for a filter that only use noun tokens for the dictionary lookup
	is shown below:
	<programlisting><![CDATA[<nameValuePair>
	<name>InputMatchFilterFeaturePath</name>
	<value>
	<string>partOfSpeach</string>
	</value>
	</nameValuePair>

	<nameValuePair>
	<name>FilterConditionOperator</name>
	<value>
	<string>EQUALS</string>
	</value>
	</nameValuePair>

	<nameValuePair>
	<name>FilterConditionValue</name>
	<value>
	<string>noun</string>
	</value>
	</nameValuePair>]]></programlisting>
	</para>
	</section>
	</chapter>
	</book>