blob: 0d47eb57bede7cd83da9289ce523f226d570cf61 [file] [log] [blame]
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN"
"http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd" [
<!ENTITY imgroot "./images/" >
<!ENTITY % xinclude SYSTEM "../../../uima-docbook-tool/xinclude.mod">
%xinclude;
]>
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
<book lang="en">
<title>
Apache UIMA Regular Expression Annotator Documentation
</title>
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude"
href="../../../SandboxDocs/src/docbook/book_info.xml" />
<preface>
<title>Introduction</title>
<para>
The Regular Expression Annotator (RegexAnnotator) is an
Apache UIMA analysis engine that detects entities such as
email addresses, URLs, phone numbers, zip codes or any other
entity that can be specified using a regular expression. For
each entity that is detected an own annotation can be
created or an already existing annotation can be updated
with new features.
To detect also more difficult and complex entities, the
annotator provides some advanced filter capabilities and a
rule definition syntax that can combine rules to a concept
with a confidence value for each of the concept's rules.
</para>
</preface>
<chapter id="sandbox.regexAnnotator.processingOverview">
<title>Processing Overview</title>
<para>
To detect any kind of entity the RegexAnnotator must be
configured using an external XML file. We call this file
"concept file" since it contains the regular expressions and
concepts that the annotator use during its processing to
detect entities. In addition to the rules the concept file
also contains the "entity result processing" that is done if
an entity was detected. The "entity result processing" can
either be the creation of new annotations or an update of an
existing annotation with additional features. The types and
features that are used to create new annotations have to be
available in the UIMA type system.
</para>
<para>
After the concept file is created, the annotator XML
descriptor have to be updated with the capabilities and
maybe with the type system information from the concept
file. The capability update is necessary that the UIMA
framework can call the annotator also in complex annotator
flows if the annotator is assembled with others to an
analysis bundle. The UIMA type system update is only
necessary if the used types are not available in the UIMA
type system definition.
</para>
<para>
With the completion of the descriptor updates, the
RegexAnnotator is ready to use. When starting the annotator,
during the initialization the annotator reads the concept
file and checks if all rules and concepts are valid and if
all annotations types are defined in the UIMA type system.
For each document that is processed the rules and concepts
are executed in exactly the same order as defined in the
concept file. The results and annotations created for a
preceding rule are used by the following one since they are
stored in the CAS.
</para>
</chapter>
<chapter id="sandbox.regexAnnotator.conceptsFile">
<title>Concepts Configuration File</title>
<para>
The RegexAnnotator can be configured using two levels of
complexity.
</para>
<para>
The RuleSet definition is the easier way to define rules.
Such a definition consists of a regular expression pattern
and of annotations that should be created if the rule match
an entity.
</para>
<para>
The Concept definition is the more complex way to define
rules. Such a definition can consists of more than one
regular expression rule that can be combined together and of
a set of annotations that should be created if one of the
rules has matched an entity.
</para>
<para>
The syntax for both definitions is the same, so you don't
need to learn two configuration possibilities. The RuleSet
definition is just available to have an easier and faster
way to configure the annotator for simple tasks. If you have
a RuleSet definition it is also possible to extend it with
more and more features so that it becomes a real Concept
definition.
</para>
<section id="sandbox.regexAnnotator.conceptsFile.rules">
<title>RuleSet definition</title>
<para>
The syntax of a simple RuleSet definition to detect email addresses
is shown in the listing below:
</para>
<para>
<programlisting><![CDATA[<conceptSet xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="concept.xsd">
<concept name="emailAddressDetection">
<rules>
<rule regEx="([a-zA-Z0-9!#$%*+'/=?^_-`{|}~.\x26]+)@
([a-zA-Z0-9._-]+[a-zA-Z]{2,4})"
matchStrategy="matchAll" matchType="uima.tcas.DocumentAnnotation"/>
</rules>
<createAnnotations>
<annotation id="emailAnnot" type="org.apache.uima.EmailAddress">
<begin group="0"/>
<end group="0"/>
</annotation>
</createAnnotations>
</concept>
</conceptSet>
]]></programlisting>
</para>
<para>
The definition above defines are simple concept
with the name <code>emailAddressDetection</code>. The
defined rule use <code>([a-zA-Z0-9!#$%*+'/=?^_-`{|}~.\x26]+)@([a-zA-Z0-9._-]+[a-zA-Z]{2,4})</code> as
regular expression pattern that is matched on the
covered text of the match type <code>uima.tcas.DocumentAnnotation</code>.
As match strategy, <code>matchAll</code> is used that means that all
matches for the pattern are used to create the
annotations defined in the
<code>&lt;createAnnotations></code>
element. So for each match a
<code>org.apache.uima.EmailAddress</code> annotation is created that
covers the match in the document text.
</para>
<para>
For additional annotation creation possibilities such as adding
features to a created annotation, please refer to
<xref linkend="sandbox.regexAnnotator.conceptsFile.annotationCreation"/>
</para>
</section>
<section id="sandbox.regexAnnotator.conceptsFile.concepts">
<title>Concept definition</title>
<para>The syntax of a complex Concept definition to detect credit card numbers for the
RegexAnnotator is shown in the listing below:</para>
<para>
<programlisting><![CDATA[<conceptSet xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="concept.xsd">
<concept name="creditCardNumberDetection" processAllRules="true">
<rules>
<rule ruleId="AmericanExpress"
regEx="(((34|37)\d{2}[- ]?)(\d{6}[- ]?)\d{5})"
matchStrategy="matchAll"
matchType="uima.tcas.DocumentAnnotation"
confidence="1.0"/>
<rule ruleId="Visa"
regEx="((4\d{3}[- ]?)(\d{4}[- ]?){2}\d{4})"
matchStrategy="matchAll"
matchType="uima.tcas.DocumentAnnotation"
confidence="1.0"/>
<rule ruleId="MasterCard"
regEx="((5[1-5]\d{2}[- ]?)(\d{4}[- ]?){2}\d{4})"
matchStrategy="matchAll"
matchType="uima.tcas.DocumentAnnotation"
confidence="1.0"/>
<rule ruleId="unknownCardType"
regEx="(([1-6]\d{3}[- ])(\d{4}[- ]){2}\d{4})|
([1-6]\d{13,18})|([1-6]\d{3}[- ]\d{6}[- ]\d{5})"
matchStrategy="matchAll"
matchType="uima.tcas.DocumentAnnotation"
confidence="1.0"/>
</rules>
<createAnnotations>
<annotation id="creditCardNumber"
type="org.apache.uima.CreditCardNumber"
validate="org.apache.uima.annotator.regex.
extension.impl.CreditCardNumberValidator">
<begin group="0"/>
<end group="0"/>
<setFeature name="confidence" type="Confidence"/>
<setFeature name="cardType" type="RuleId"/>
</annotation>
</createAnnotations>
</concept>
</conceptSet>
]]></programlisting>
</para>
<para>
As you can see the Concept definition is a more complex
RuleSet definition. The main differences are some additional
features defined at the rule and the combination of rules
within one concept.
The new features for a rule are <code>ruleID</code>
and <code>confidence</code>. If these features
are specified, the feature values for these features can
later be assigned to an annotation feature for a created annotation.
In case we use the listing above as example this means that when the
<code>org.apache.uima.CreditCardNumber</code> is created the value of the
<code>confidence</code> feature of the rule that matched the document text
is assigned to the annotation feature called <code>confidenceValue</code>.
The same is done for the <code>ruleId</code> feature.
With that you can later check your annotation confidence and you can see
which rule was responsible for the annotation creation.
</para>
<note>
<para>
The annotation features for <code>Confidence</code>
and <code>RuleId</code>
have to be created manually in the UIMA type system.
Given that it is possible to assign the <code>confidence</code> and <code>ruleId</code>
feature values to any other annotation feature you have defined
in the UIMA type system. Confidence features have to be of type
<code>uima.cas.Float</code> and RuleId features have to be of
type <code>uima.cas.String</code>.
</para>
</note>
<para>
The processing of a concept definition depends on the rule processing.
The feature that controls the rule processing is called
<code>processAllRules</code> and is specified at the <code>&lt;concept></code> element.
By default this optional feature is set to <code>false</code>.
This means that the concept processing
starts with the first rule and goes on with the next one
until a match was found. So in this processing mode, maybe only the first rule
of a concept is evaluated if there a match was found. The other rules
of this concept will be ignored in that case.
This strategy should be used for example if your first concept
rule has a strict pattern with a confidence of 1.0 and your
second rule has a more lenient pattern with a confidence
of 0.5. If the <code>processAllRules</code> feature
is set to <code>true</code> all rules of a concept are processed
independent of the matches for a previous rule.
</para>
</section>
<section
id="sandbox.regexAnnotator.conceptsFile.regexVariables">
<title>Regex Variables</title>
<para>
The regex variables allows to externalize parts of a regular expression
to shorten them and make it easier to read. The externalized part of the
expression is replaced with a regex variable. The variable syntax looks like
<code>\v{weekdays}</code>, where <code>weekdays</code> is the variable name.
The field for regex variables are mainly the separation of enumerations in a
regular expression to make them easier to understand and maintain.
But let's see how it works in the short example below.
</para>
<para>
A simple regular expression for a date like <code>Wednesday, November 28, 2007</code>
can look like:
</para>
<para>
<programlisting><emphasis><![CDATA[<concept name="Date" processAllRules="true">
<rules>
<rule regEx="(Monday|Tuesday|Wednesday|Thursday|Friday|Saturday|Sunday),
(January|February|March|April|May|June|July|August|September|October|
November|December) (0[1-9]|[12][0-9]|3[01]), ((19|20)\d\d)"
matchStrategy="matchAll" matchType="uima.tcas.DocumentAnnotation"/>
</rules>
<createAnnotations>
<annotation type="org.apache.uima.Date">
<begin group="0" />
<end group="0" />
</annotation>
</createAnnotations>
</concept>
]]></emphasis></programlisting>
</para>
<para>
When using regex variables to externalize the weekdays and the months in this
regular expression, it looks like:
</para>
<para>
<programlisting><emphasis><![CDATA[<conceptSet xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns="http://incubator.apache.org/uima/regex">
<variables>
<variable name="weekdays"
value="Monday|Tuesday|Wednesday|Thursday|Friday|Saturday|Sunday"/>
<variable name="months"
value="January|February|March|April|May|June|July|August|September|
October|November|December"/>
</variables>
<concept name="Date" processAllRules="true">
<rules>
<rule regEx="(\v{weekdays}), (\v{months}) (0[1-9]|[12][0-9]|3[01]),
((19|20)\d\d)"
matchStrategy="matchAll" matchType="uima.tcas.DocumentAnnotation"/>
</rules>
<createAnnotations>
<annotation type="org.apache.uima.Date">
<begin group="0" />
<end group="0" />
</annotation>
</createAnnotations>
</concept>
</conceptSet>
]]></emphasis></programlisting>
</para>
<para>
The regex variables must be defined at the beginning of the concept file
next to the <code>&lt;conceptSet></code> element before the concepts are
defined. The variables can be used in all concept definition within the
same file.
</para>
<para>
The regex variable name can contain any of the following characters
<code>[a-zA-Z_0-9]</code>. Other characters are not allowed.
</para>
</section>
<section
id="sandbox.regexAnnotator.conceptsFile.rulesDefinition">
<title>Rule Definition</title>
<para>
This paragraph shows in detail how to define a rule for a
RuleSet or Concept definition and give you some advanced
configuration possibilities for the rule processing.
</para>
<para>
The listing below shows an abstract rule definition with
all possible sub elements and attributes. Please refer to
the sub sections for details about the sub elements.
</para>
<para>
<programlisting><emphasis><![CDATA[<rule ruleId="ID1" regEx="TestRegex" matchStrategy="matchAll"
matchType="uima.tcas.DocumentAnnotation" featurePath="my/feature/path"
confidence="1.0">
<matchTypeFilter>
<feature name="language">en</feature>
</matchTypeFilter>
<updateMatchTypeAnnotation>
<setFeature name="language" type="String">$0</setFeature>
</updateMatchTypeAnnotation>
<ruleExceptions>
<exception matchType="uima.tcas.DocumentAnnotation">
ExceptionExpression
</exception>
</ruleExceptions>
</rule>
]]></emphasis></programlisting>
</para>
<para>
For each rule that should be added a <code>&lt;rule></code> element
have to be created. The <code>&lt;rule></code> element definition has three
mandatory features, these are:
</para>
<para>
<itemizedlist>
<listitem>
<para>
<code>regEx</code>
- The regular expression pattern that
is used for this rule. As pattern, everything supported
by the Java regular expression syntax is allowed.
</para>
</listitem>
<listitem>
<para>
<code>matchStrategy</code>
- The match strategy that is used
for this rule. Possible values are
<code>matchAll</code>
to get all matches,
<code>matchFirst</code>
to get the first match only and
<code>matchComplete</code>
to get matches where the whole input
text match the regular expression pattern.
</para>
</listitem>
<listitem>
<para>
<code>matchType</code>
- The annotation type that is used
to match the regular expression pattern.
As input text for the match, the annotation span
is used, but only if no additional <code>featurePath</code>
feature is specified.
</para>
</listitem>
</itemizedlist>
</para>
<para>
In addition to the mandatory features the <code>&lt;rule></code>
element definition also has some optional features that can
be used, these are:
</para>
<itemizedlist>
<listitem>
<para>
<code>ruleId</code>
- Specifies the ID for this rule. The
ID can later be used to add it as
value to an annotation feature (see
<xref linkend="sandbox.regexAnnotator.conceptsFile.annotationCreation.features"/>).
</para>
</listitem>
<listitem>
<para>
<code>confidence</code>
- Specifies the confidence value of this
rule. If you have more than one rule that describes
the same complex entity you can classify the rules with
a confidence value. This confidence value
can later be used to add it as value to an
annotation feature (see
<xref linkend="sandbox.regexAnnotator.conceptsFile.annotationCreation.features"/>).
</para>
</listitem>
<listitem>
<para>
<code>featurePath</code>
- Specifies the feature path that should be used to match the regular expression pattern.
If a feature path is specified, the feature path value is used to match against the
regular expression instead of the match type annotation span.
The defined feature path must be valid for the specified match type annotation type.
The feature path elements are separated by "/".
</para>
<para>
The listing below shows how to match a regular expression on the <code>normalizedText</code>
feature of a <code>uima.TokenAnnotation</code>. So in this case, not the covered text of the
<code>uima.TokenAnnotation</code> is used to match the regular expression but the
<code>normalizedText</code> feature value of the annotation. The <code>normalizedText</code>
feature must be defined in the UIMA type system as feature of type <code>uima.TokenAnnotation</code>.
</para>
<para>
<programlisting><emphasis><![CDATA[<rule regEx="TestRegex" matchStrategy="matchAll"
matchType="uima.TokenAnnotation" featurePath="normalizedText">
</rule>
]]></emphasis></programlisting>
</para>
</listitem>
</itemizedlist>
<section
id="sandbox.regexAnnotator.conceptsFile.rulesDefinition.filter">
<title>Match Type Filter</title>
<para>
<programlisting><emphasis><![CDATA[<matchTypeFilter>
<feature featurePath="language">en</feature>
</matchTypeFilter>
]]></emphasis></programlisting>
</para>
<para>
Match type filters can be used to filter the match type
annotations that are used for matching the regular expression
pattern. For example to use a rule only when the document language
is English, as shown in the example above.
Match type filters ever relate to the <code>matchType</code>
that was specified for the rule.
</para>
<para>
The <code>&lt;matchTypeFilter></code>
element can contain an arbitrary amount of
<code>&lt;feature></code>
elements that contains the filter information. But all specified <code>&lt;feature></code>
elements have to be valid for the <code>matchType</code> annotation
of the rule.
</para>
<para>
The feature path that should be used as
filter is specified using the <code>featurePath</code> feature of the
<code>&lt;feature></code> element. Feature path elements are separated by "/" e.g.
my/feature/path. The specified feature path must be valid for the <code>matchType</code> annotation.
The content of the
<code>&lt;feature></code> element contains the regular expression pattern
that is used as filter. To pass the filter, this pattern
have to match the feature path value that is resolved using the match type annotation.
In the example above the match type annotation has a UIMA feature called
<code>language</code> that have to have the content <code>en</code>. If that
is true, the annotation passed the filter condition.
</para>
</section>
<section id="sandbox.regexAnnotator.conceptsFile.rulesDefinition.update">
<title>Update Match Type Annotations With Additional Features</title>
<para>
<programlisting><emphasis><![CDATA[<updateMatchTypeAnnotation>
<setFeature name="language" type="String">$0</setFeature>
</updateMatchTypeAnnotation>
]]></emphasis></programlisting>
</para>
<para>
With the
<code>&lt;updateMatchTypeAnnotation></code>
construct it is possible to update or set a UIMA feature value
for the match type annotation in case a rule match
was found. The
<code>&lt;updateMatchTypeAnnotation></code> element
can have an arbitrary amount of
<code>&lt;setFeature></code> elements that contains
the feature information that should be updated.
</para>
<para>
The <code>&lt;setFeature></code> element has two
mandatory features, these are:
</para>
<itemizedlist>
<listitem>
<para>
<code>name</code>
- Specifies the UIMA feature name that
should be set. The feature have to be available
at the <code>matchType</code> annotation
of the rule.
</para>
</listitem>
<listitem>
<para>
<code>type</code>
- Specifies the UIMA feature type that is
defined in the UIMA type system for this feature.
Currently supported feature types are <code>String</code>,
<code>Integer</code> and <code>Float</code>.
</para>
</listitem>
</itemizedlist>
<para>
The optional features are:
</para>
<itemizedlist>
<listitem>
<para>
<code>normalization</code>
- Specifies the normalization that should be performed before the feature value
is assigned to the match type annotation. For a list of all built-in
normalization functions please refer to
<xref linkend="sandbox.regexAnnotator.conceptsFile.annotationCreation.featureNormalization"/>.
</para>
</listitem>
<listitem>
<para>
<code>class</code>
- Specifies the custom normalization class that should be used to normalize the
feature value before it is assigned to the match type annotation. Custom normalization
classes are used if the <code>normalization</code> feature has the value
<code>Custom</code>. The normalization class have to implement the
<code>org.apache.uima.annotator.regex.extension.Normalization</code> interface.
For details about the feature normalization please refer to
<xref linkend="sandbox.regexAnnotator.conceptsFile.annotationCreation.featureNormalization"/>.
</para>
</listitem>
</itemizedlist>
<para>
The content of the <code>&lt;setFeature></code>
element definition contains the feature value that should be set.
This can either be a literal value or a regular
expression capturing group as shown in the example
above. A combination of capturing groups and literals
is also possible.
</para>
</section>
<section
id="sandbox.regexAnnotator.conceptsFile.rulesDefinition.exception">
<title>Rule exception</title>
<para>
<programlisting><emphasis><![CDATA[<ruleExceptions>
<exception matchType="uima.tcas.DocumentAnnotation">
ExceptionPattern
</exception>
</ruleExceptions>
]]></emphasis></programlisting>
</para>
<para>
With the
<code>&lt;ruleExceptions></code>
construct it is possible to configure exceptions to prevent matches for the rule.
An exception is something similar to a filter, but on the higher level. For
example take the scenario where you have several token annotations that
are covered by a sentence annotation. You have written a rule that can detect
car brands. The text you analyze has the sentence "Henry Ford was born 1863".
When analyzing the text you will get a car brand annotation since "Ford" is
a car brand. But is this the correct behavior? The work around that issue
you can create an exception that looks like
<programlisting><emphasis><![CDATA[<ruleExceptions>
<exception matchType="uima.SentenceAnnotation">Henry</exception>
</ruleExceptions>
]]></emphasis></programlisting>
and add it to your car brand rule. After adding this, car brand annotations
are only created if the sentence annotation that covers the token annotation
does not contain the word "Henry".
</para>
<para>
The <code>&lt;ruleExceptions></code> element can have
an arbitrary amount of <code>&lt;exception></code>
elements to specify rule exceptions.
</para>
<para>
The <code>&lt;exception></code>
element has one mandatory feature called
<code>matchType</code>. The <code>matchType</code> feature
specifies the annotation type the exception is based on.
The concrete exception match type annotation that is used
during the runtime is evaluated for each
match type annotation that is used to match a rule. As
exception annotation always the covering annotation
of the current match type annotation is used.
If no covering annotation instance of the exception match type
was found the exception is not evaluated.
</para>
<para>
The content of the <code>&lt;exception></code>
element specifies the regular expression that is used to evaluate the exception.
</para>
<para>
If the exception match is true, the
current match type annotation is filtered out and is
not used to create any matches and annotations.
</para>
</section>
</section>
<section id="sandbox.regexAnnotator.conceptsFile.annotationCreation">
<title>Annotation Creation</title>
<para>
This paragraph explains in detail how to create annotations if a rule has matched some input text.
An annotation creation example with all possible settings is shown in the listing below.
</para>
<para>
<programlisting><emphasis><![CDATA[<annotation id="testannot" type="org.apache.uima.TestAnnot"
validate="CustomValidatorClass">
<begin group="0" location="start"/>
<end group="0" location="end"/>
<setFeature name="testFeature1" type="String">$0</setFeature>
<setFeature name="testFeature2" type="String"
normalization="ToLowerCase">$0</setFeature>
<setFeature name="testFeature3" type="Integer">$1</setFeature>
<setFeature name="testFeature4" type="Float">$2</setFeature>
<setFeature name="testFeature5" type="Reference">testannot1</setFeature>
<setFeature name="confidenceValue" type="Confidence"/>
<setFeature name="ruleId" type="RuleId"/>
<setFeature name="normalizedText" type="String"
normalization="Custom"
class="org.apache.CustomNormalizer">$0</setFeature>
</annotation>]]></emphasis></programlisting>
</para>
<para>
The <code>&lt;annotation></code> element has two mandatory features, these are:
</para>
<para>
<itemizedlist>
<listitem>
<para>
<code>id</code>
- Specifies the annotation id for this annotation. If the annotation id is specified,
it must be unique within the same concept. An annotation id is required if the
annotation is referred by another annotation or if the annotation itself refers
other annotations using a <code>Reference</code> feature.
</para>
</listitem>
<listitem>
<para>
<code>type</code>
- Specifies the UIMA annotation type that is used if an annotation is created.
The used type have to be defined in the UIMA type system.
</para>
</listitem>
</itemizedlist>
</para>
<para>
The optional features are:
</para>
<para>
<itemizedlist>
<listitem>
<para>
<code>validate</code>
- Specifies the custom validator class that is used to validate matches before
they are added as annotation to the CAS. For more details about the custom
annotation validation, please refer to
<xref linkend="sandbox.regexAnnotator.conceptsFile.annotationCreation.validation"/>.
</para>
</listitem>
</itemizedlist>
</para>
<para>
The mandatory sub elements of the <code>&lt;annotation></code> element are:
</para>
<para>
<itemizedlist>
<listitem>
<para>
<code>&lt;begin></code>
- Specifies the begin position of the annotation that is created.
For details about the <code>&lt;begin></code> element, please refer
to <xref linkend="sandbox.regexAnnotator.conceptsFile.annotationCreation.boundaries"/>.
</para>
</listitem>
<listitem>
<para>
<code>&lt;end></code>
- Specifies the end position of the annotation that is created.
For details about the <code>&lt;end></code> element, please refer
to <xref linkend="sandbox.regexAnnotator.conceptsFile.annotationCreation.boundaries"/>.
</para>
</listitem>
</itemizedlist>
</para>
<para>
The optional sub elements of the <code>&lt;annotation></code> element are:
</para>
<para>
<itemizedlist>
<listitem>
<para>
<code><code>&lt;setFeature></code></code>
- set a UIMA feature for the created annotation.
For details about the <code>&lt;setFeature></code> element, please refer
to <xref linkend="sandbox.regexAnnotator.conceptsFile.annotationCreation.features"/>
</para>
</listitem>
</itemizedlist>
</para>
<section id="sandbox.regexAnnotator.conceptsFile.annotationCreation.boundaries">
<title>Annotation Boundaries</title>
<para>
When creating an annotation with the <code>&lt;annotation></code> element it is also
necessary to define the annotations boundaries. The annotation boundaries are defined using the
sub elements <code>&lt;begin></code> and <code>&lt;end></code>. The start position of
the annotation is defined using the <code>&lt;begin></code> element, the end position using
the <code>&lt;end></code> element. Both elements have the same features as shown below:
</para>
<para>
<itemizedlist>
<listitem>
<para>
<code>group</code>
- identifies the capturing group number within the regular expression pattern for the
current rule. The value is a positive number where 0 denotes
the whole match, 1 the first capturing group, 2 the second one, and so on.
</para>
</listitem>
<listitem>
<para>
<code>location</code>
- indicates a position inside the capturing group, which can either be the position
of the left parenthesis in case of a value <code>start</code>, or the right parenthesis in
case of a value <code>end</code>. The <code>location</code> feature is optional. By default
the <code>&lt;begin></code> element is set to <code>location="start"</code> and the
<code>&lt;end></code> element to <code>location="end"</code>.
</para>
</listitem>
</itemizedlist>
</para>
<note>
<para>
When the rule definition defines a <code>featurePath</code> for a <code>matchType</code>,
the annotation boundaries for the created annotation are automatically set to
the annotation boundaries of the match input annotation. This must be done since
the matching with a feature value of an annotation has no relation to the document text, so the only
relation is the annotation where the feature is defined.
</para>
</note>
</section>
<section id="sandbox.regexAnnotator.conceptsFile.annotationCreation.validation">
<title>Annotation Validation</title>
<para>
The custom annotation validation can be used to validate a regular expression match by using some
java code before the match is added as annotation to the CAS. For example if your regular expression
detects an ISBN number you can use the custom validation code to check if it is really an ISBN number
by calculating the last check digit or if it is just a phone number.
</para>
<para>
To use the custom annotation validation you have to specify the validation class at the <code>validate</code>
feature of the <code>&lt;annotation></code> element. The validation class must implement the
<code>org.apache.uima.annotator.regex.extension.Validation</code> interface
(<xref linkend="sandbox.regexAnnotator.Validation"/>). The interface defines one
method called <code>validate(String coveredText, String ruleID)</code>. The validate method is called by the annotator
before the match is added as annotation to the CAS. Annotations are only added if the validate method
returns <code>true</code>, otherwise the match is skipped. The <code>coveredText</code> parameter contains
the text that matches the regular expression.
The <code>ruleID</code> parameter contains the ruldId of the rule that creates the match. This can also be null
if no ruleID was specified. The listing below shows a sample implementation of the validation interface.
</para>
<para>
<programlisting><![CDATA[package org.apache.uima.annotator.regex;
public class SampleValidator implements
org.apache.uima.annotator.regex.extension.Validation {
/* (non-Javadoc)
* @see org.apache.uima.annotator.regex.extension.Validation
* #validate(java.lang.String, java.lang.String)
*/
public boolean validate(String coveredText, String ruleID)
throws Exception {
//implement your custom validation, e.g. to validate ISBN numbers
return validateISBNNumbers(coveredText);
}
}]]></programlisting>
</para>
<para>
The configuration for this example looks like:
</para>
<para>
<programlisting><emphasis><![CDATA[<annotation id="isbnNumber" type="org.apache.uima.ISBNNumber"
validate="org.apache.uima.annotator.regex.SampleValidator">
<begin group="0"/>
<end group="0"/>
</annotation>]]></emphasis></programlisting>
</para>
</section>
<section id="sandbox.regexAnnotator.conceptsFile.annotationCreation.features">
<title>Annotation Features</title>
<para>
With the <code>&lt;setFeature></code> element of <code>&lt;annotation></code> definition it is
possible to set UIMA features for the created annotation. The mandatory features
for the <code>&lt;setFeature></code> element are:
</para>
<para>
<itemizedlist>
<listitem>
<para>
<code>name</code>
- Specifies the UIMA feature name that should be set. The feature name have to
be a valid UIMA feature for this annotation and have to be defined in the
UIMA type system.
</para>
</listitem>
<listitem>
<para>
<code>type</code>
- Specifies the type of the UIMA feature. For a list of all
possible feature types please refer to
<xref linkend="sandbox.regexAnnotator.conceptsFile.annotationCreation.featureTypes"/>.
</para>
</listitem>
</itemizedlist>
</para>
<para>
The optional features are:
</para>
<para>
<itemizedlist>
<listitem>
<para>
<code>normalization</code>
- Specifies the normalization that should be performed before the feature value
is assigned to the UIMA annotation. For a list of all built-in
normalization functions please refer to
<xref linkend="sandbox.regexAnnotator.conceptsFile.annotationCreation.featureNormalization"/>.
</para>
</listitem>
<listitem>
<para>
<code>class</code>
- Specifies the custom normalization class that should be used to normalize the
feature value before it is assigned to the UIMA annotation. Custom normalization
classes are used if the <code>normalization</code> feature has the value
<code>Custom</code>. The normalization class have to implement the
<code>org.apache.uima.annotator.regex.extension.Normalization</code> interface.
For details about the feature normalization please refer to
<xref linkend="sandbox.regexAnnotator.conceptsFile.annotationCreation.featureNormalization"/>.
</para>
</listitem>
</itemizedlist>
</para>
<para>
The content of the <code>&lt;setFeature></code> element specifies the value of the
UIMA feature that is set. As value a literal, a capturing group or a combination of
both can be used.
To add the value of a capturing group there are two ways to do it.
The first notation is <code>$</code> followed by the capturing group number from 0 to 9
e.g. $0 for capturing group 0 or $7 for capturing group 7.
The second notation to get the value of a capturing group are capturing group names.
If the rule contains named capturing groups these groups can be accessed
with <code>${matchGroupName}</code>. For the access of capturing
groups greater than 9 capturing group names must be used. An example for capturing group names is
shown below:
</para>
<para>
To add a name to a capturing group just add the following fragment <code>\m{groupname}</code>
in front of the capturing group start parenthesis.
<programlisting><emphasis><![CDATA[<concept name="capturingGroupNames">
<rules>
<rule ruleId="ID1"
regEx="My \m{groupName}(named capturing group) example"
matchStrategy="matchAll"
matchType="uima.tcas.DocumentAnnotation"/>
</rules>
<createAnnotations>
<annotation type="org.apache.uima.TestAnnot">
<begin group="0"/>
<end group="0"/>
<setFeature name="testFeature0" type="String">
${groupName}
</setFeature>
</annotation>
</createAnnotations>
</concept>
]]></emphasis></programlisting>
</para>
<section id="sandbox.regexAnnotator.conceptsFile.annotationCreation.featureTypes">
<title>Features Types</title>
<para>
When setting UIMA feature for an annotation using the <code>&lt;setFeature></code> element
the feature type has to be specified according the the UIMA type system definition.
The feature at the <code>&lt;setFeature></code> element to do that is called <code>type</code>.
The list below shows all currently supported feature types:
</para>
<para>
<itemizedlist>
<listitem>
<para>
<code>String</code>
- for <code>uima.cas.String</code> based UIMA features.
</para>
</listitem>
<listitem>
<para>
<code>Integer</code>
- for <code>uima.cas.Integer</code> based UIMA features.
</para>
</listitem>
<listitem>
<para>
<code>Float</code>
- for <code>uima.cas.Float</code> based UIMA features.
</para>
</listitem>
<listitem>
<para>
<code>Reference</code>
- to link a UIMA feature to another annotation. In this case the
UIMA feature type have to be the same as the referred annotation type.
To reference another annotation instance the <code>&lt;setFeature></code>
content must have the annotation <code>id</code> as value of the referred
annotation. The referred annotation instance is the created annotation of
the current match.
</para>
</listitem>
<listitem>
<para>
<code>Confidence</code>
- to add the value of the <code>confidence</code> feature defined
at the <code>&lt;rule></code> element to this feature. The UIMA feature have to
be of type <code>uima.cas.Float</code>.
</para>
</listitem>
<listitem>
<para>
<code>RuleId</code>
- to add the value of the <code>ruleId</code> feature defined
at the <code>&lt;rule></code> element to this feature. The UIMA feature have to
be of type <code>uima.cas.String</code>.
</para>
</listitem>
</itemizedlist>
</para>
<note>
<para>
Float and Integer based feature values are converted using the Java NumberFormat for the
current Java default locale. If the feature value cannot be converted the feature value is not
set and a warning is written to the log. To prevent these warnings it may be useful
to do a custom normalization of the numbers before they are added to the feature.
</para>
</note>
</section>
<section id="sandbox.regexAnnotator.conceptsFile.annotationCreation.featureNormalization">
<title>Features Value Normalization</title>
<para>
Before assigning a feature value to an annotation it is possible to
do a normalization on the feature value. This normalization can be useful for example to normalize
a detected email addresses to lower case before it is added to the annotation.
To normalize a feature value the <code>normalization</code> feature of the
<code>&lt;setFeature></code> element is used. The built-in normalization functions
are listed below. Additionally the RegexAnnotator provides an extension point that can be
implemented to add a custom normalization.
</para>
<para>
The possible build-in functions that are specified as feature value of
the <code>normalization</code> feature are listed below:
</para>
<para>
<itemizedlist>
<listitem>
<para>
<code>ToLowerCase</code>
- normalize the feature value to lower case before it is assigned to the annotation.
</para>
</listitem>
<listitem>
<para>
<code>ToUpperCase</code>
- normalize the feature value to upper case before it is assigned to the annotation.
</para>
</listitem>
<listitem>
<para>
<code>Trim</code>
- remove all leading and trailing whitespace characters from the feature value before
it is assigned to the annotation.
</para>
</listitem>
</itemizedlist>
Built-in normalization configuration:
<programlisting><emphasis><![CDATA[<setFeature name="normalizedFeature" type="String"
normalization="ToLowerCase">$0</setFeature>]]></emphasis></programlisting>
</para>
<para>
In case of a custom normalization, the <code>normalization</code> feature must have the value
<code>Custom</code>, and an additional feature of the <code>&lt;setFeature></code> element
called <code>class</code> have to be specified containing the full qualified class name of the
custom normalization implementation. The custom normalization implementation have to implement
the interface <code>org.apache.uima.annotator.regex.extension.Normalization</code>
(<xref linkend="sandbox.regexAnnotator.Normalization"/>) which defines the
<code>normalize</code> method to normalize the feature values. A sample implementation with
the corresponding configuration is shown below.
</para>
<para>
Custom normalization implementation:
<programlisting><![CDATA[package org.apache.uima;
public class CustomNormalizer
implements org.apache.uima.annotator.regex.extension.Normalization {
/* (non-Javadoc)
* @see org.apache.uima.annotator.regex.extension.Normalization
* #normalize(java.lang.String, java.lang.String)
*/
public String normalize(String input, String ruleId) {
//implement your custom normalization
String result = ...
return result;
}]]></programlisting>
</para>
<para>
Custom normalization configuration:
<programlisting><emphasis><![CDATA[<setFeature name="normalizedFeature" type="String"
normalization="Custom" class="org.apache.uima.CustomNormalizer">
$0
</setFeature>]]></emphasis></programlisting>
</para>
</section>
</section>
</section>
</chapter>
<chapter id="sandbox.regexAnnotator.annotatorDescriptor">
<title>Annotator Descriptor</title>
<para>The RegexAnnotator analysis engine descriptor contains some processing information for
the annotator. The processing information is specified as configuration parameters.
This chapter we explain in detail the possible descriptor settings.
</para>
<section id="sandbox.regexAnnotator.annotatorDescriptor.configParam">
<title>Configuration Parameters</title>
<para>
The RegexAnnotator has the following configuration parameters:
</para>
<para>
<itemizedlist>
<listitem>
<para>
<code>ConceptFiles</code>
- This parameter is modeled as array of Strings and contains
the concept files the annotator should use. The concept files
must be specified using a relative path that is available in the
UIMA datapath or in the classpath. When you use the UIMA datapath,
you can use wildcard expressions such as <code>rules/*.rule</code>.
These kinds of wildcard expressions will not work when rule files
are discovered via the classpath.
<programlisting><emphasis><![CDATA[<nameValuePair>
<name>ConceptFiles</name>
<value>
<array>
<string>subdir/myConcepts.xml</string>
<string>SampleConcept.xml</string>
</array>
</value>
</nameValuePair>]]></emphasis></programlisting>
</para>
</listitem>
</itemizedlist>
</para>
</section>
<section id="sandbox.regexAnnotator.annotatorDescriptor.capabilities">
<title>Capabilities</title>
<para>
In the capabilities section of the RegexAnnotator descriptor the input and output
capabilities and the supported languages have to be defined.
</para>
<para>
The input capabilities defined
in the descriptor have to comply with the match types used in the concept rule file
that is used. For example the <code>uima.SentenceAnnotation</code> used in the rule
below have to be added to the input capability section in the RegexAnnotator descriptor.
</para>
<para>
<programlisting><emphasis><![CDATA[<rules>
<rule regEx="SampleRegex" matchStrategy="matchAll"
matchType="uima.SentenceAnnotation"/>
</rules>
]]></emphasis></programlisting>
</para>
<para>
In the output section, all of the annotation types and features created by
the RegexAnnotator have to be specified. These have to match the
output types and features declared in the <code>&lt;annotation></code> elements of the concept file.
For example the <code>org.apache.uima.TestAnnot</code> annotation and the
<code>org.apache.uima.TestAnnot:testFeature</code> feature used below have to
be added to the output capability section in the RegexAnnotator descriptor.
</para>
<para>
<programlisting><emphasis><![CDATA[<createAnnotations>
<annotation type="org.apache.uima.TestAnnot">
<begin group="0"/>
<end group="0"/>
<setFeature name="testFeature" type="String">$0</setFeature>
</annotation>
</createAnnotations>
]]></emphasis></programlisting>
</para>
<para>
If there are any language dependent rules in the concept file the languages abbreviations
have to be specified in the <code>&lt;languagesSupported></code>element. If there are no
language dependent rules available you can specify <code>x-unspecified</code> as language. That means
that the annotator can work on all languages.
</para>
<para>
For the short examples used above the capabilities section in the RegexAnnotator
descriptor looks like:
</para>
<para>
<programlisting><emphasis><![CDATA[<capabilities>
<capability>
<inputs>
<type>uima.SentenceAnnotation</type>
</inputs>
<outputs>
<type>org.apache.uima.TestAnnot</type>
<feature>org.apache.uima.TestAnnot:testFeature</feature>
</outputs>
<languagesSupported>
<language>x-unspecified</language>
</languagesSupported>
</capability>
</capabilities>
]]></emphasis></programlisting>
</para>
</section>
</chapter>
<appendix id="sandbox.regexAnnotator.xsd">
<title>Concept File Schema</title>
<para>The concept file schema that is used to define the concept file looks like:
</para>
<para>
<programlisting><![CDATA[<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
targetNamespace="http://incubator.apache.org/uima/regex"
xmlns="http://incubator.apache.org/uima/regex"
elementFormDefault="qualified">
<!--
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
-->
<xs:element name="conceptSet">
<xs:complexType>
<xs:sequence>
<xs:element ref="concept" minOccurs="0" maxOccurs="unbounded"/>
</xs:sequence>
</xs:complexType>
</xs:element>
<xs:element name="concept">
<xs:complexType>
<xs:sequence>
<xs:element ref="rules" minOccurs="1" maxOccurs="1"/>
<xs:element ref="createAnnotations" minOccurs="1" maxOccurs="1"/>
</xs:sequence>
<xs:attribute name="name" type="xs:string" use="optional"/>
</xs:complexType>
</xs:element>
<xs:element name="createAnnotations">
<xs:complexType>
<xs:sequence>
<xs:element ref="annotation" minOccurs="1" maxOccurs="unbounded"/>
</xs:sequence>
</xs:complexType>
</xs:element>
<xs:element name="rules">
<xs:complexType>
<xs:sequence>
<xs:element ref="rule" minOccurs="1" maxOccurs="unbounded"/>
</xs:sequence>
</xs:complexType>
</xs:element>
<xs:element name="rule">
<xs:complexType>
<xs:all>
<xs:element ref="matchTypeFilter" minOccurs="0" maxOccurs="1"/>
<xs:element ref="updateMatchTypeAnnotation" minOccurs="0" maxOccurs="1"/>
<xs:element ref="ruleExceptions" minOccurs="0" maxOccurs="1"/>
</xs:all>
<xs:attribute name="regEx" type="xs:string" use="required"/>
<xs:attribute name="matchStrategy" use="required">
<xs:simpleType>
<xs:restriction base="xs:string">
<xs:enumeration value="matchFirst"/>
<xs:enumeration value="matchAll"/>
<xs:enumeration value="matchComplete"/>
</xs:restriction>
</xs:simpleType>
</xs:attribute>
<xs:attribute name="matchType" type="xs:string" use="required"/>
<xs:attribute name="featurePath" type="xs:string" use="optional" />
<xs:attribute name="ruleId" type="xs:string" use="optional"/>
<xs:attribute name="confidence" type="xs:decimal" use="optional"/>
</xs:complexType>
</xs:element>
<xs:element name="matchTypeFilter">
<xs:complexType>
<xs:sequence>
<xs:element ref="feature" minOccurs="0" maxOccurs="unbounded"/>
</xs:sequence>
</xs:complexType>
</xs:element>
<xs:element name="ruleExceptions">
<xs:complexType>
<xs:sequence>
<xs:element ref="exception" minOccurs="0" maxOccurs="unbounded"/>
</xs:sequence>
</xs:complexType>
</xs:element>
<xs:element name="exception">
<xs:complexType>
<xs:simpleContent>
<xs:extension base="xs:string">
<xs:attribute name="matchType" type="xs:string" use="required"/>
</xs:extension>
</xs:simpleContent>
</xs:complexType>
</xs:element>
<xs:element name="feature">
<xs:complexType>
<xs:simpleContent>
<xs:extension base="xs:string">
<xs:attribute name="featurePath" type="xs:string" use="required"/>
</xs:extension>
</xs:simpleContent>
</xs:complexType>
</xs:element>
<xs:element name="annotation">
<xs:complexType>
<xs:sequence>
<xs:element ref="begin" minOccurs="1" maxOccurs="1"/>
<xs:element ref="end" minOccurs="1" maxOccurs="1"/>
<xs:element ref="setFeature" minOccurs="0" maxOccurs="unbounded"/>
</xs:sequence>
<xs:attribute name="id" type="xs:string" use="optional"/>
<xs:attribute name="type" type="xs:string" use="required"/>
<xs:attribute name="validate" type="xs:string" use="optional" />
</xs:complexType>
</xs:element>
<xs:element name="updateMatchTypeAnnotation">
<xs:complexType>
<xs:sequence>
<xs:element ref="setFeature" minOccurs="0" maxOccurs="unbounded"/>
</xs:sequence>
</xs:complexType>
</xs:element>
<xs:element name="begin">
<xs:complexType>
<xs:attribute name="group" use="required" type="xs:integer"/>
<xs:attribute name="location" use="optional" default="start">
<xs:simpleType>
<xs:restriction base="xs:string">
<xs:enumeration value="start"/>
<xs:enumeration value="end"/>
</xs:restriction>
</xs:simpleType>
</xs:attribute>
</xs:complexType>
</xs:element>
<xs:element name="end">
<xs:complexType>
<xs:attribute name="group" use="required" type="xs:integer"/>
<xs:attribute name="location" use="optional" default="end">
<xs:simpleType>
<xs:restriction base="xs:string">
<xs:enumeration value="start"/>
<xs:enumeration value="end"/>
</xs:restriction>
</xs:simpleType>
</xs:attribute>
</xs:complexType>
</xs:element>
<xs:element name="setFeature">
<xs:complexType>
<xs:simpleContent>
<xs:extension base="xs:string">
<xs:attribute name="name" type="xs:string" use="required"/>
<xs:attribute name="type" use="required">
<xs:simpleType>
<xs:restriction base="xs:string">
<xs:enumeration value="String"/>
<xs:enumeration value="Integer"/>
<xs:enumeration value="Float"/>
<xs:enumeration value="Reference"/>
<xs:enumeration value="Confidence"/>
<xs:enumeration value="RuleId"/>
</xs:restriction>
</xs:simpleType>
</xs:attribute>
<xs:attribute name="normalization" use="optional">
<xs:simpleType>
<xs:restriction base="xs:string">
<xs:enumeration value="Custom" />
<xs:enumeration value="ToLowerCase" />
<xs:enumeration value="ToUpperCase" />
<xs:enumeration value="Trim" />
</xs:restriction>
</xs:simpleType>
</xs:attribute>
<xs:attribute name="class" type="xs:string" use="optional" />
</xs:extension>
</xs:simpleContent>
</xs:complexType>
</xs:element>
</xs:schema>
]]></programlisting>
</para>
</appendix>
<appendix id="sandbox.regexAnnotator.Validation">
<title>Validation Interface</title>
<para>
<programlisting><![CDATA[/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/
package org.apache.uima.annotator.regex.extension;
/**
* The Validation interface is provided to implement a custom validator
* that can be used to validate regular expression matches before
* they are added as annotations.
*/
public interface Validation {
/**
* The validate method validates the covered text of an annotator and
* returns true or false whether the annotation is correct or not.
* The validate method is called between a rule match and the
* annotation creation. The annotation is only created if the method
* returns true.
*
* @param coveredText covered text of the annotation that should be
* validated
* @param ruleID ruleID of the rule which created the match
*
* @return true if the annotation is valid or false if the annotation
* is invalid
*
* @throws Exception throws an exception if an validation error occurred
*/
public boolean validate(String coveredText, String ruleID)
throws Exception;
}]]></programlisting>
</para>
</appendix>
<appendix id="sandbox.regexAnnotator.Normalization">
<title>Normalization Interface</title>
<para>
<programlisting><![CDATA[/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/
package org.apache.uima.annotator.regex.extension;
/**
* The Normalization interface was add to implement a custom normalization
* for feature values before they are assigned to an anntoation.
*/
public interface Normalization {
/**
* Custom feature value normalization. This interface must be implemented
* to perform a custom normalization on the given input string.
*
* @param input input string which should be normalized
*
* @param ruleID rule ID of the matching rule
*
* @return String - normalized input string
*/
public String normalize(String input, String ruleID) throws Exception;
}]]></programlisting>
</para>
</appendix>
</book>