<?xml version="1.0" encoding="UTF-8"?> | |
<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN" "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd" [ | |
<!ENTITY imgroot "../images"> | |
<!ENTITY % xinclude SYSTEM "../../../uima-docbook-tool/xinclude.mod"> | |
]> | |
<!-- | |
Licensed to the Apache Software Foundation (ASF) under one | |
or more contributor license agreements. See the NOTICE file | |
distributed with this work for additional information | |
regarding copyright ownership. The ASF licenses this file | |
to you under the Apache License, Version 2.0 (the | |
"License"); you may not use this file except in compliance | |
with the License. You may obtain a copy of the License at | |
http://www.apache.org/licenses/LICENSE-2.0 | |
Unless required by applicable law or agreed to in writing, | |
software distributed under the License is distributed on an | |
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | |
KIND, either express or implied. See the License for the | |
specific language governing permissions and limitations | |
under the License. | |
--> | |
<book lang="en"> | |
<title>CFE User Guide</title> | |
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="../../../SandboxDocs/src/docbook/book_info.xml"/> | |
<chapter id="_Overview"> | |
<title> | |
Overview | |
</title> | |
<section id="_Motivation"> | |
<title> | |
Motivation | |
</title> | |
<para role="Normal">Feature extraction, the extraction of | |
information from data sources, is a common task frequently required | |
to be performed by many different types of applications, such as | |
machine learning, performance evaluation, and statistical analysis. | |
This guide describes a tool that can be used to facilitate this | |
extraction process, in conjunction with the Unstructured Information | |
Management Architecture (UIMA), particularly focusing on text | |
processing applications. UIMA provides a mechanism for executing | |
modules called Analysis Engines that analyze artifacts (text | |
documents in our case) and store the results of the analysis in a | |
data structure called the Common Analysis Structure (CAS). These | |
results are stored as Feature Structures, which are simply data | |
structures that have an associated type and a set of properties in | |
the form of attribute/value pairs. Feature Structures that are | |
attached to a particular span of a text document are called | |
Annotations. They usually represent a concept that the analysis | |
engine computes based on the text. The attributes are called | |
<code>Features</code> in UIMA terminology. This sense of feature will always be | |
referred to as <code>UIMA feature</code> in this document, so as not to be | |
confused with the general sense of <code>feature</code> when discussing | |
<code>feature extraction</code>, referring to the process of extracting values | |
from data sources (in our case, the CAS). Values that are extracted | |
are not required to be values of attributes (i.e., UIMA Features) of | |
Annotations, but can be computed by other methods, as will be shown | |
later. The terms <code>features</code> and <code>feature values</code> | |
in this document refer to any value extracted from the CAS, regardless of the particular | |
source. | |
</para> | |
<para role="Normal" /> | |
<para role="Normal">As an example, Figure 1 depicts annotation objects | |
of the type Token that are associated with individual words, each | |
having attributes <code>Index</code> and <code>POS</code> (part of speech). A feature | |
extraction task could be "extract token indexes for the words that | |
are nouns". Such a task is translated to the following execution | |
steps: | |
</para> | |
<orderedlist numeration="arabic" spacing="normal"> | |
<listitem> | |
<para role="Normal">find an annotation of a type <code>Token</code></para> | |
</listitem> | |
<listitem> | |
<para role="Normal">examine the value of <code>POS</code> attribute</para> | |
</listitem> | |
<listitem> | |
<para role="Normal">extract the value of <code>Index</code> attribute only if | |
the value of <code>POS</code> attribute is <code>NN</code> | |
</para> | |
</listitem> | |
</orderedlist> | |
<para role="Normal">The expression "word that is a noun" defines a | |
concept, and its implementation is that it has to be found in the | |
CAS. <code>Token index</code> is the information (i.e., <code>feature</code>) to be | |
extracted. The resulting values for the task will be values 3 and 9, | |
which are the values of the attribute <code>Index</code> for the words <code>car</code> and | |
<code>finish</code>. | |
</para> | |
<para> | |
<inlinemediaobject> | |
<imageobject> | |
<imagedata fileref="../images/CFE_UG/CFE_UG-1.jpg" /> | |
</imageobject> | |
</inlinemediaobject> | |
</para> | |
<para role="LREC Caption"> | |
Figure 1: Annotated text sample | |
</para> | |
<para role="Normal">While Figure 1 shows a fairly simple example of | |
annotations types associated with some text, real world applications | |
could have quite sophisticated annotation types, storing various | |
kinds of computed information. Consider an annotation type Car that | |
has, for illustration purposes, just two attributes: <code>Color</code> and | |
Engine. While the attribute <code>Color</code> is of type string, the <code>Engine</code> | |
attribute is a complex annotation type with attributes <code>Cylinders</code> and | |
<code>Size</code>. This is represented by a UML diagram in Figure 2, illustrating | |
a class hierarchy on the left and sample instance of this class | |
structure on the right. | |
</para> | |
<para> | |
<inlinemediaobject> | |
<imageobject> | |
<imagedata fileref="../images/CFE_UG/CFE_UG-3.jpg" /> | |
</imageobject> | |
</inlinemediaobject> | |
</para> | |
<para role="LREC Caption"> | |
Figure 2: Composite object sample | |
</para> | |
<para role="Normal"> | |
If a requirement is to extract the number of cylinders of the car's | |
engine, then the application needs to find any object(s) that represent | |
the concept of a car (<code>CarAnnotation</code> in this case) and traverse the | |
object's structure to access the <code>Cylinders</code> attribute of <code>EngineAnnotation</code>. | |
Once the attribute's value is accessed, the application outputs it to the | |
desired destination, such as a text file or a database. | |
</para> | |
</section> | |
<section id="_Approaches_to_feature_extraction"> | |
<title> | |
Approaches to feature extraction | |
</title> | |
<section id="_Custom_CAS_Consumers"> | |
<title> | |
Custom CAS Consumers | |
</title> | |
<para role="Normal"> | |
When working with UIMA, feature extraction is usually implemented by | |
writing a special UIMA component called a CAS Consumer that contains | |
custom code for accessing the annotations and their attributes, | |
outputting them to a file, memory or database as required. The CAS | |
consumer contains explicit logic for traversing the object's structure | |
and examining values of specific attributes. Also, the CAS consumer would | |
likely have code for outputting the accessed values to a particular | |
destination, as required by the application. Writing CAS consumers can be | |
labor intensive and requires Java programming. While this approach allows | |
powerful control and customization to an application's needs, supporting | |
the code can become problematic, especially as application requirements | |
change. This can have a negative effect on many different aspects of code | |
support, such as maintenance, evolution, bug fixing, reusability etc. | |
</para> | |
</section> | |
<section id="_CFE_approach"> | |
<title> | |
CFE approach | |
</title> | |
<para role="Normal" /> | |
<para role="Normal"> | |
CFE is a multipurpose tool that enables feature extraction from a UIMA | |
CAS in a very generalized and application independent way. The extraction | |
process is performed according to rules expressed using the Feature | |
Extraction Specification Language (FESL) that are stored in configuration | |
files. Using CFE eliminates the need for creating customized CAS | |
consumers and writing Java code for every application. Instead, by using | |
FESL rules in XML format, users can customize the information extraction | |
process to suit their application. FESL's rule semantics allow the | |
precise identification of the information that is required to be | |
extracted by specifying precise multi-parameter criteria. The FESL syntax | |
and semantics are defined further in this guide.</para> | |
</section> | |
</section> | |
<section id="_CFE_Basics"> | |
<title> | |
CFE Basics | |
</title> | |
<para role="Normal">The feature extraction process involves three | |
major steps:</para> | |
<orderedlist numeration="arabic" spacing="normal"> | |
<listitem> | |
<para role="Normal"> | |
locating a concept of interest that is represented by a UIMA annotation | |
object; examples of such concepts could be "word that is a noun" or "a | |
car that has a six cylinder engine" etc. The annotation object that | |
represents such a concept is referred to as the Target Annotation (TA) | |
</para> | |
</listitem> | |
<listitem> | |
<para role="Normal"> | |
locating concepts, relative to the TAs, specifying the information to | |
extract. These are also represented by UIMA annotations, that are within | |
some context of the TAs. Some examples of context could be "to the left | |
of the TA" or "within the TA" etc. The annotation object that corresponds | |
to such a concept is referred to as the Feature Annotation (FA). | |
In relation to Figure 1, an example FA could be the expression "two words | |
to the left from word finish that is a noun", assuming that "word finish | |
that is a noun", describes the TA. The result of such a specification | |
will be tokens <code>at</code> and <code>the</code> | |
</para> | |
</listitem> | |
<listitem> | |
<para role="Normal">extraction of the specified information | |
from FAs | |
</para> | |
</listitem> | |
</orderedlist> | |
<para role="Normal"> | |
<anchor id="FA" /> | |
Just to illustrate the process, suppose the requirement is "to | |
extract indexes of two words to the left of the word finish that is | |
a noun". In such a scenario, in the first step, CFE locates a TA | |
that is represented by an annotation object corresponding to a word | |
<code>finish</code> and also has its <code>POS</code> attribute equal to <code>NN</code>. For the | |
second step, FAs that correspond to two words to the left from TA | |
are located. On the third step, values of the <code>Index</code> attribute for | |
each of FAs that were found are extracted. It is possible, however, | |
that the requirement is to extract the value of the <code>Index</code> attribute | |
from the annotation for the word <code>finish</code> itself. In such a case, | |
the TA and FA are represented by the same UIMA annotation object. | |
This is usually the case when extracting features for evaluation or | |
testing. The specification for a TA or FA can be specified by | |
complex multi-parameter conditions that are also expressed using | |
FESL, as will be shown later. | |
</para> | |
</section> | |
</chapter> | |
<chapter id="_Components"> | |
<title> | |
Components | |
</title> | |
<section id="_FESL_XSD"> | |
<title> | |
FESL XSD | |
</title> | |
<para role="Normal"> | |
The specification for FESL is written in XSD format and stored in the | |
file <CFE_HOME>/src/main/xsdForEmf/CFEConfigModel.xsd to be used | |
by EMF-based parser generator and in <CFE_HOME>/src/main/xsdForXMLBeans | |
for XMLBeans parser generator). Using this XSD in conjunction with an | |
XML editor that provides syntax validation can | |
help to provide more efficient editing of FESL configuration files. | |
</para> | |
</section> | |
<section id="_Source_Code"> | |
<title> | |
Source Code | |
</title> | |
<para role="Normal">CFE is implemented in Java 5.0 for Apache UIMA, and | |
resides in the org.apache.uima.tools.cfe package. CFE is dependent on | |
Eclipse EMF, Apache UIMA, and the Apache XMLBeans and JXPath | |
libraries. The source code contains the complete implementation of | |
CFE, including auxiliary utility classes that wrap some UIMA | |
functionality (located in org.apache.uima.tools.cfe.support package) | |
</para> | |
</section> | |
<section id="_Descriptors"> | |
<title> | |
Descriptors | |
</title> | |
<para role="Normal"> | |
A sample descriptor file that defines a type system for machine learning | |
processing is located in | |
<CFE_HOME>src/main/resources/descriptors/type_system/AppliedSenseAnnotation.xml | |
</para> | |
<para role="Normal"> | |
A sample descriptor that uses CFE in a CAS Consumer is located in | |
<CFE_HOME>src/main/resources/descriptors/cas_consumers/UIMAFeatureConsumer.xml | |
</para> | |
</section> | |
<section id="_Type_Dependencies"> | |
<title> | |
Type Dependencies | |
</title> | |
<para role="Normal"> | |
CFE code uses UIMA example annotation type | |
<code>org.apache.uima.examples.SourceDocumentInformation</code> | |
to retrieve the name of a document that is being processed. | |
Typically, annotations of this type are produces by a file collection reader, | |
provided by UIMA examples. If a UIMA application uses a different type | |
of a reader, an annotation of this type should be created and initialized | |
for each document prior to execution of TAE. Please see | |
<CFE_HOME>src/test/java/org/apache/uima/tools/cfe/test/CFEtest.java | |
for an example. | |
</para> | |
</section> | |
</chapter> | |
<chapter id="_Configuration_Files"> | |
<title> | |
Configuration Files | |
</title> | |
<section id="_Common_notations_and_tags"> | |
<title> | |
Common notations and tags | |
</title> | |
<para role="Normal"> | |
CFE configuration files are written using FESL semantic rules, as defined | |
in CFEConfig.xsd. These rules describe the information extraction process | |
and are independent of the application from which the information is to | |
be extracted. There are several common notations and tags that are used | |
in different elements of FESL | |
</para> | |
<section id="_Feature_path"> | |
<title> | |
Feature path | |
</title> | |
<para role="Normal"> | |
A "feature path" is a mechanism used by FESL to identify a particular | |
feature (not necessarily a UIMA feature) of an annotation. The value | |
associated with the feature, indicated by the feature path, can be either | |
evaluated to match a certain criteria or extracted to the final output or | |
both. The syntax of a feature path is an indexed sequence of | |
attribute/method names separated by the colon character. Such a sequence | |
mimics the sequence of Java method calls required to extract the feature | |
value. For example, a value of the <code>EngineAnnotation</code> attribute <code>Cylinders</code> | |
from Figure 2 can be written as <code>CarAnnotation:Engine:Cylinders</code>, where | |
Engine is an attribute of <code>CarAnnotation</code>. The intermediate results of each | |
step of the call sequence can be referred from different FESL structural | |
elements by their zero-based index. For instance, the Parent Tag notation | |
(see below) uses the index to access intermediate values. The feature | |
path can be used to identify feature values that are either primitives or | |
complex object types. | |
</para> | |
</section> | |
<section id="_Full_path_and_partial_path"> | |
<title> | |
Full path and partial path | |
</title> | |
<para role="Normal"> | |
There are two different ways of using feature path notation to identify | |
an object: full path and partial path. The object can be one of the | |
following: | |
</para> | |
<itemizedlist mark="disc" spacing="normal"> | |
<listitem> | |
<para role="Normal">an annotation</para> | |
</listitem> | |
<listitem> | |
<para role="Normal">value of an annotation's attribute</para> | |
</listitem> | |
<listitem> | |
<para role="Normal"> | |
value of a result of an annotation's method; only get-style methods | |
(methods that return a value and take no parameters) are supported. | |
</para> | |
</listitem> | |
</itemizedlist> | |
<para role="Normal"> | |
A full path specifies a path to an object starting from its type. For | |
instance, if <code>EngineAnnotation</code> is specified as a full path, it would refer | |
to all instances of annotations of that type. If <code>CarAnnotation:Engine</code> is | |
specified, it would refer only to instances of the <code>EngineAnnotation</code> type that are | |
attributes of instances of the <code>CarAnnotation</code> type. Full path notation is usually | |
used for TA or FA identification. | |
</para> | |
<para role="Normal"> | |
A partial path specifies a path to an object starting from a previously | |
located annotation object (whether TA or FA). For example, if an instance | |
of <code>CarAnnotation</code> is located as a TA, then the size of its engine can be | |
specified as Engine:Size. Partial path notation is usually used for | |
specification of feature values that are being examined or extracted. | |
The distinction between "full path" and "partial path" is very similar to | |
the concepts of "absolute path" and "relative path" when discussing a | |
computer's file system. | |
</para> | |
</section> | |
<section id="_TAM_and_FAM"> | |
<title> | |
TAM and FAM | |
</title> | |
<para role="Normal"> | |
Each FESL rule is represented by a1 XML element with the tag | |
<code>targetAnnotation</code> | |
, as specified in the XSD by the | |
<link linkend="_TargetAnnotationXML"> | |
<phrase role="Hyperlink2">TargetAnnotationXML</phrase> | |
</link> | |
type. Each element of this type is a composition of: | |
</para> | |
<itemizedlist mark="disc" spacing="normal"> | |
<listitem> | |
<para role="Normal"> | |
a single target annotation matcher ( | |
<code>TAM</code> | |
) that is denoted by an XML element with the tag | |
<code>targetAnnotationMatcher</code> | |
, of the type | |
<link linkend="_PartialObjectMatcherXML"> | |
<code>PartialObjectMatcherXML</code> | |
</link> | |
</para> | |
</listitem> | |
<listitem> | |
<para role="Normal"> | |
optional feature annotation matchers ( | |
<code>FAM</code> | |
) denoted by XML elements with the tag <code>featureAnnotationMatchers</code>, | |
of the type | |
<link linkend="_FeatureObjectMatcherXML"> | |
<code>FeatureObjectMatcherXML</code> | |
</link> | |
</para> | |
</listitem> | |
</itemizedlist> | |
<para role="Normal"> | |
The | |
<code>TAM</code> | |
specifies search criteria for locating Target Annotations ( | |
<code>TA</code> | |
s), while | |
<code>FAM</code> | |
s contain criteria for locating Feature Annotations ( | |
<code>FA</code> | |
s) and the specification of features for extraction from the | |
<code>FA</code> | |
s. The criteria for the search and the features to be extracted are | |
specified using the | |
<link linkend="_Feature_path"> | |
<phrase role="Hyperlink1">feature path</phrase> | |
</link> | |
notation, as explained earlier. The XML tags representing the | |
matchers are detailed below. | |
<phrase role="system1"> </phrase> | |
</para> | |
</section> | |
<section id="_Arrays"> | |
<title> | |
Arrays | |
</title> | |
<para role="Normal"> | |
Since UIMA annotations may have arrays as attributes, FESL provides the | |
ability to perform feature extraction from array objects. In particular, | |
going back to Figure 2, if the implementation for the <code>Wheels</code> attribute is | |
a UIMA <code>FSArray</code> type, then using feature path notation: | |
</para> | |
<itemizedlist mark="disc" spacing="normal"> | |
<listitem> | |
<para role="Normal"> | |
the feature value for the | |
<code>Wheels</code> | |
attribute of | |
<code>FSArray</code> | |
type can be specified as <code>CarAnnotation:Wheels</code>. | |
</para> | |
</listitem> | |
<listitem> | |
<para role="Normal"> | |
the feature value for the number of elements in the | |
<code>FSArray</code> | |
can be specified as <code>CarAnnotation:Wheels:size</code>, where size is a | |
method of | |
<code>FSArray</code> | |
; such value corresponds to a concept of how many wheels the car | |
has. | |
</para> | |
</listitem> | |
<listitem> | |
<para role="Normal">the feature values for individual elements of | |
<code>Wheels</code> attribute of type <code>WheelAnnotation</code> can be accessed as | |
<code>CarAnnotation:Wheels:toArray</code>. It should be noted that <code>toArray</code> is a | |
name of a method of the <code>FSArray</code> type rather than a name of an | |
attribute.</para> | |
</listitem> | |
<listitem> | |
<para role="Normal">the feature values for <code>Diameter</code> attribute of each | |
<code>WheelAnnotation</code> can be specified as | |
<code>CarAnnotation:Wheels:toArray:Diameter</code> | |
</para> | |
</listitem> | |
</itemizedlist> | |
<para role="Normal"> | |
The result of using toArray as an accessor is an array of values. FESL | |
also provides syntax for accessing individual elements of arrays by index. | |
</para> | |
<itemizedlist mark="disc" spacing="normal"> | |
<listitem> | |
<para role="Normal"> | |
the feature for the diameter of the first wheel can be specified as | |
<code>CarAnnotation:Wheels:toArray[0]:Diameter</code> | |
</para> | |
</listitem> | |
<listitem> | |
<para role="Normal"> | |
the feature for the diameter of the first and second wheels can be | |
specified as <code>CarAnnotation:Wheels:toArray[0][1]:Diameter</code> | |
</para> | |
</listitem> | |
<listitem> | |
<para role="Normal"> | |
the feature for the diameter of first three wheels can be specified | |
as <code>CarAnnotation:Wheels:toArray[0-2]:Diameter</code> | |
</para> | |
</listitem> | |
</itemizedlist> | |
<para role="Normal"> | |
The specification of individual elements can be mixed for example: | |
<code>CarAnnotation:Wheels:toArray[0][2-3]:Diameter</code> refers to all elements of | |
<code>Wheels</code> attribute except the second. If the index specified falls outside | |
the range of the matched data, a null value will be assigned. | |
</para> | |
<para role="Normal"> | |
If required, FESL allows sorting extracted features by an offset in the | |
text of the annotations that these features are extracted from. For | |
instance <code>CarAnnotation:Wheels:toArray[sort]:Diameter</code> would ensure such | |
an order. | |
</para> | |
</section> | |
<section id="_Parent_tag"> | |
<title> | |
Parent tag | |
</title> | |
<para role="Normal"> | |
The parent tag is used to access a specific element of a feature path of | |
a TA or FA by index. If a parent tag is used within a TAM specification, | |
it is applied to the full path of the corresponding TA. Likewise, parent | |
tags contained in FAMs are applied to the full a path of the | |
corresponding FA. The tag consists of <code>__p</code> prefix followed by the index | |
of an element that is being accessed. For instance, <code>__p0</code> addresses the | |
first element of a feature path. The tag can be a part of a feature path. | |
For example, if a TA is specified as <code>CarAnnotation:Wheels:toArray</code>, | |
corresponding to a concept of "wheels of a car" then the value of the | |
<code>Color</code> attribute of a <code>CarAnnotation</code> object can be accessed by specifying | |
<code>__p0:Color</code>. Such a specification can be used when it is required to | |
examine/extract features of a containing annotation along with features | |
of contained annotations. Samples of using parent tags are provided in | |
the sections that detail FESL syntax, below. | |
</para> | |
</section> | |
<section id="_Null_values"> | |
<title> | |
Null values | |
</title> | |
<para role="Normal"> | |
CFE allows comparing feature values for equality to null. The root XML | |
element CFEConfig has a string attribute <code>nullValueImage</code> that sets a | |
literal representation of a null value. If an extracted feature value is | |
null, it will be converted to a string that is assigned the | |
<code>nullValueImage</code> attribute. The example below illustrates the usage of this | |
attribute. | |
</para> | |
</section> | |
<section id="_Implicit_TA_exclusion"> | |
<title> | |
Implicit TA exclusion | |
</title> | |
<para role="Normal"> | |
While all FAM specifications for a single TAM are independent from | |
each other, there is an implicit dependency between TAMs. In | |
particular, they are dependent on the order in which they are | |
specified in a configuration file. Annotations corresponding to | |
certain concepts that were identified by a TAM that appear earlier in | |
the configuration file will be excluded from further processing by | |
FESL. This rule only applies to TAMs that use the | |
<code>fullPath</code> | |
attribute in their specification (see | |
<link linkend="_PartialObjectMatcherXML"> | |
<phrase role="Hyperlink1"> | |
<code>PartialObjectMatcherXML</code> | |
</phrase> | |
</link> | |
). Having the implicit exclusion helps to separate the processing of | |
same type annotations in the case when these annotations have | |
different semantic meaning. For instance, the set of features that is | |
required to be extracted from annotations of type | |
<code>EngineAnnotation</code> | |
that are attributes of | |
<code>CarAnnotation</code> | |
objects can be different than a set of features that is required to | |
be extracted from annotations of the same | |
<code>EngineAnnotation</code> | |
type that are attributes of some other type or are not attached to | |
any annotations of other types. To implement such a behavior in FESL, | |
the fist | |
<code>TAM</code> | |
would contain criteria for locating | |
<code>EngineAnnotation</code> | |
objects that are attached to objects of the | |
<code>CarAnnotation</code> | |
type, while the second | |
<code>TAM</code> | |
would not specify any restriction on containment of objects of the | |
<code>EngineAnnotation</code> | |
type. If such a specification is given, all | |
<code>EngineAnnotation</code> | |
objects located according to the rule in the first | |
<code>TAM</code> | |
will be excluded from further processing and, hence, will not be | |
available for processing by rules given in the second | |
<code>TAM</code> | |
</para> | |
</section> | |
</section> | |
<section id="_FESL_Elements"> | |
<title> | |
FESL Elements | |
</title> | |
<para role="Normal"> | |
FESL's XSD defines several elements that allow specify rules for feature | |
extraction. These elements may contains attributes and other elements in | |
their definition | |
</para> | |
<section id="_BitsetFeatureValuesXML"> | |
<title> | |
BitsetFeatureValuesXML | |
</title> | |
<itemizedlist mark="disc" spacing="normal"> | |
<listitem> | |
<para role="Normal">Attribute: bitmask[1]: Integer</para> | |
</listitem> | |
<listitem> | |
<para role="Normal">Attribute: exact_match[0..1]: boolean: default false</para> | |
</listitem> | |
</itemizedlist> | |
<para> | |
<inlinemediaobject> | |
<imageobject> | |
<imagedata fileref="../images/CFE_UG/CFE_UG-7.jpg" align="center"/> | |
</imageobject> | |
</inlinemediaobject> | |
</para> | |
<para role="Normal"> | |
The specification enables comparing a feature value to an integer | |
bitmask. The feature value is considered to be matched if it is of an | |
Integer type and: | |
</para> | |
<itemizedlist mark="disc" spacing="normal"> | |
<listitem> | |
<para role="Normal"> | |
if the <code>exact_match</code> attribute is set to true and all "1" bits specified in | |
bitmask are also set in feature value | |
</para> | |
</listitem> | |
<listitem> | |
<para role="Normal"> | |
if the <code>exact_match</code> attribute is set to false and any of "1" bits | |
specified in bitmask is also set in feature value | |
</para> | |
</listitem> | |
</itemizedlist> | |
<para role="Normal">Example:</para> | |
<para role="Normal"><bitsetFeatureValues bitmask="3" exact_match="false" /></para> | |
<para role="Normal"><bitsetFeatureValues bitmask="3" exact_match="true" /></para> | |
<para role="Normal"> | |
The first line of the example specifies a test whether either of the two | |
less significant bits of a feature value is set. To be successful, the | |
test specified by the second line requires both less significant bits to be set. | |
</para> | |
</section> | |
<section id="_EnumFeatureValuesXML"> | |
<title> | |
EnumFeatureValuesXML | |
</title> | |
<itemizedlist mark="disc" spacing="normal"> | |
<listitem> | |
<para role="Normal">Attribute: caseSensitive[0..1]: boolean: default false</para> | |
</listitem> | |
<listitem> | |
<para role="Normal">Element: values[0..*]: String</para> | |
</listitem> | |
</itemizedlist> | |
<para> | |
<inlinemediaobject> | |
<imageobject> | |
<imagedata fileref="../images/CFE_UG/CFE_UG-8.jpg" align="center"/> | |
</imageobject> | |
</inlinemediaobject> | |
</para> | |
<para role="Normal"> | |
EnumFeatureValuesXML element allow to test if a feature value belongs to | |
a finite set of values. According to EnumFeatureValuesXML specification, | |
if a feature value is equal to either one of the elements of values then | |
the feature is considered to be successfully evaluated. The <code>caseSensitive</code> | |
attribute indicates whether the comparison between the feature value and | |
members of the values element is case sensitive. The FESL fragment below | |
shows how to specify such a comparison: | |
</para> | |
<para role="Normal"><enumFeatureValues caseSensitive="true"></para> | |
<para role="Normal"><values>red</values></para> | |
<para role="Normal"><values>green</values></para> | |
<para role="Normal"><values>blue</values></para> | |
<para role="Normal"></enumFeatureValues></para> | |
<para role="Normal"> | |
This fragment specifies a case sensitive comparison of a feature value to | |
a set of strings: <code>red</code>, <code>green</code> and <code>blue</code>. | |
</para> | |
<para role="Normal"> | |
Special processing occurs when the array has only a single element that | |
starts with <code>file://</code>, enabling the use of external dictionaries for | |
comparison. In this case, the text within the | |
<code>values</code> | |
element is treated as a URI. The contents of the file referenced by the | |
URI will be loaded and used as a set of values against which the feature | |
value is going to be tested. The file should contain one dictionary entry | |
per line, with each line starting with the <code>#</code> character considered to be | |
a comment and thus will not be loaded. The dictionary handling is | |
implemented in org.apache.uima.tools.cfe.EnumeratedEntryDictionary. The default | |
implementation supports single token (whitespace separated) dictionary | |
entries. If a more sophisticated dictionary format is desired, then | |
either the constructor's parameters can be changed or methods for | |
initializing and loading the dictionary from a file can be overridden. | |
</para> | |
</section> | |
<section id="_ObjectPathFeatureValue"> | |
<title> | |
ObjectPathFeatureValuesXML | |
</title> | |
<itemizedlist mark="disc" spacing="normal"> | |
<listitem> | |
<para role="Normal">Attribute: objectPath[1]: String</para> | |
</listitem> | |
</itemizedlist> | |
<para> | |
<inlinemediaobject> | |
<imageobject> | |
<imagedata fileref="../images/CFE_UG/CFE_UG-9.jpg" align="center"/> | |
</imageobject> | |
</inlinemediaobject> | |
</para> | |
<para role="Normal"> | |
According to ObjectPathFeatureValuesXML specification, the | |
<link linkend="_CFE_Basics">TA</link> | |
or | |
<link linkend="_CFE_Basics"> | |
<phrase role="Hyperlink1">FA</phrase> | |
</link> | |
itself (depending on whether this element is in | |
<link linkend="_TAM_and_FAM"> | |
<phrase role="Hyperlink1">TAM</phrase> | |
</link> | |
or in | |
<link linkend="_TAM_and_FAM"> | |
<phrase role="Hyperlink1">FAM</phrase> | |
</link>) | |
is tested whether it is at the location defined by the objectPath. This | |
ability to evaluate whether a feature belongs to some CAS object is | |
useful specifically in the cases where a particular feature value is the | |
property of several different objects. For instance, this element can be | |
used when features from annotations should be extracted only if they are | |
attributes of other annotations. The FESL fragment below specifies a test | |
that checks if an object's full path is | |
<code>org.apache.uima.tools.cfe.sample.CarAnnotation:Wheels:toArray</code>. Such a test, for | |
instance, can be used to check if an instance of a <code>WheelAnnotation</code> | |
belongs to an instance <code>CarAnnotation</code>: | |
</para> | |
<para role="Normal"> | |
<objectFeatureValues objectPath="org.apache.uima.tools.cfe.sample.CarAnotation:Wheels:toArray"b> | |
</para> | |
</section> | |
<section id="_PatternFeatureValuesXM"> | |
<title> | |
PatternFeatureValuesXML | |
</title> | |
<itemizedlist mark="disc" spacing="normal"> | |
<listitem> | |
<para role="Normal">Attribute: pattern[1]: String</para> | |
</listitem> | |
</itemizedlist> | |
<para> | |
<inlinemediaobject> | |
<imageobject> | |
<imagedata fileref="../images/CFE_UG/CFE_UG-10.jpg" align="center"/> | |
</imageobject> | |
</inlinemediaobject> | |
</para> | |
<para role="Normal"> | |
The PatternFeatureValuesXML element enables comparing a feature value | |
against a regular expression specified by the <code>pattern</code> attribute using | |
Java Regular Expression syntax and considered to be successfully | |
evaluated if the value matches the pattern. | |
</para> | |
<para role="Normal"> | |
The FESL fragment below defines a test that checks if a feature value | |
conforms to the hex number format: | |
</para> | |
<para role="Normal"><patternFeatureValues pattern="(0[Xx][0-9A-Fa-f]+)" /></para> | |
</section> | |
<section id="_RangeFeatureValuesXML"> | |
<title> | |
RangeFeatureValuesXML | |
</title> | |
<itemizedlist mark="disc" spacing="normal"> | |
<listitem> | |
<para role="Normal">Attribute: lowerBoundary[0..1]: Comparable: default 0</para> | |
</listitem> | |
<listitem> | |
<para role="Normal">Attribute: lowerBoundaryInclusive[0..1]: boolean default false</para> | |
</listitem> | |
<listitem> | |
<para role="Normal">Attribute: upperBoundary[0..1]: Comparable default 0</para> | |
</listitem> | |
<listitem> | |
<para role="Normal">Attribute: upperBoundaryInclusive[0..1]: boolean default false</para> | |
</listitem> | |
</itemizedlist> | |
<mediaobject> | |
<imageobject> | |
<imagedata fileref="../images/CFE_UG/CFE_UG-11.jpg" align="center"/> | |
</imageobject> | |
</mediaobject> | |
<para role="Normal"> | |
According to RangeFeatureValuesXML specification the feature value is | |
evaluated whether it is of a Comparable type and belongs to the interval | |
specified by the attributes <code>lowerBoundary</code> and <code>upperBoundary</code>. The | |
attributes <code>lowerBoundaryInclusive</code> and <code>upperBoundaryInclusive</code> indicate | |
whether the corresponding boundaries should be included in the range for | |
comparison. FESL fragment below specifies a test that checks if feature | |
value is in the numeric range between 1 and 5, including 1 and excluding | |
5: | |
</para> | |
<para role="Normal"> | |
<rangeFeatureValues lowerBoundary="1.8" upperBoundaryInclusive="true" upperBoundary="3.0" /></para> | |
</section> | |
<section id="_SingleFeatureMatcherXML"> | |
<title> | |
SingleFeatureMatcherXML | |
</title> | |
<itemizedlist mark="disc" spacing="normal"> | |
<listitem> | |
<para role="Normal">Attribute: featurePath[1]: String</para> | |
</listitem> | |
<listitem> | |
<para role="Normal">Attribute: featureTypeName[0..1]: String: no default value</para> | |
</listitem> | |
<listitem> | |
<para role="Normal">Attribute: exclude[0..1]: boolean: default false</para> | |
</listitem> | |
<listitem> | |
<para role="Normal">Attribute: quiet[0..1]: boolean: default false</para> | |
</listitem> | |
<listitem> | |
<para role="Normal">Element: featureValues one of: </para> | |
<itemizedlist mark="disc" spacing="normal"> | |
<listitem> | |
<para role="Normal">bitsetFeatureValues: BitsetFeatureValuesXML</para> | |
</listitem> | |
<listitem> | |
<para role="Normal">enumFeatureValues: EnumFeatureValuesXML</para> | |
</listitem> | |
<listitem> | |
<para role="Normal">objectPathFeatureValues: ObjectPathFeatureValuesXML</para> | |
</listitem> | |
<listitem> | |
<para role="Normal">patternFeatureValues: PatternFeatureValuesXML</para> | |
</listitem> | |
<listitem> | |
<para role="Normal">rangeFeatureValues: RangeFeatureValuesXML</para> | |
</listitem> | |
</itemizedlist> | |
</listitem> | |
</itemizedlist> | |
<para> | |
<inlinemediaobject> | |
<imageobject> | |
<imagedata fileref="../images/CFE_UG/CFE_UG-12.jpg" align="center"/> | |
</imageobject> | |
</inlinemediaobject> | |
</para> | |
<para role="Normal"> | |
The <code>SingleFeatureMatcherXML</code> defines rules for matching of a feature value | |
to the featureValues element. The featureValues can be one of the | |
elements in the bullet list above. The previous section detailed rules | |
for matching a feature value to each of these elements. According to the | |
specification for matching of a single feature value, first, a value of a | |
feature denoted by the required <code>featurePath</code> attribute is located. For | |
features that have arrays in their featurePath multiple values can be | |
found. If such value(s) is found and optional <code>featureTypeName</code> attribute | |
specifies a type name of the feature value, every found feature value is | |
tested to be of that type. If the test is successful, then feature values | |
are evaluated according to a specification given in featureValues. After | |
the evaluation is performed a single feature is considered to be | |
successfully evaluated if: | |
</para> | |
<itemizedlist mark="disc" spacing="normal"> | |
<listitem> | |
<para role="Normal"> | |
the exclude attribute value is set to false and at least one | |
feature value is matched to <code>featureValues</code> specification. | |
</para> | |
</listitem> | |
<listitem> | |
<para role="Normal"> | |
the exclude attribute value is set to true and none of the | |
feature values is matched to <code>featureValues</code> specification. | |
</para> | |
</listitem> | |
</itemizedlist> | |
<para role="Normal"> | |
For <code>SingleFeatureMatcherXML</code> elements that are parts of TAM element only | |
evaluation of feature values is performed. If a <code>SingleFeatureMatcherXML</code> | |
element is a part of FAM then the feature value is output only if the | |
<code>quiet</code> attribute is set to false. If the value of the <code>quiet</code> attribute is | |
set to true, then, even if the feature is matched, only an evaluation is | |
performed, but no value is written into the final output. A <code>featurePath</code> | |
attribute uses feature path notation explained earlier. | |
</para> | |
<para role="Normal"> | |
FESL fragment below defines a test that checks if a value of the <code>Size</code> | |
attribute is in a range defined by <code>rangeFeatureVulues</code> element: | |
</para> | |
<para role="Normal"><featureMatchers featurePath="Size" featureTypeName="java.lang.Float"></para> | |
<para role="Normal"><rangeFeatureValues lowerBoundary="1.8" upperBoundaryInclusive="true" upperBoundary="3.0"/></para> | |
<para role="Normal"></featureMatchers></para> | |
<para role="Normal"> | |
In addition it is allowed to use the parent tag (see | |
<link linkend="_Parent_tag"> | |
<phrase role="Hyperlink1">Parent tag</phrase> | |
</link>) | |
in the <code>featurePath</code> attribute. A sample in the <code>PartialObjectMatcherXML</code> | |
section detail on how use the parent tag notation. | |
</para> | |
</section> | |
<section id="_GroupFeatureMatcherXML"> | |
<title> | |
GroupFeatureMatcherXML | |
</title> | |
<itemizedlist mark="disc" spacing="normal"> | |
<listitem> | |
<para role="Normal">Attribute: exclude[0..1]: boolean: default false</para> | |
</listitem> | |
<listitem> | |
<para role="Normal">Element: featureMatchers[1..*]: SingleFeatureMatcherXML</para> | |
</listitem> | |
</itemizedlist> | |
<para> | |
<inlinemediaobject> | |
<imageobject> | |
<imagedata fileref="../images/CFE_UG/CFE_UG-13.jpg" align="center"/> | |
</imageobject> | |
</inlinemediaobject> | |
</para> | |
<para role="Normal"> | |
This is a specification for matching a group of features. It can be applied | |
to both types of annotations, TAs and FAs. Each element in featureMatchers is | |
evaluated against either a TA or a FA annotation. The group is considered to | |
be matched if: | |
</para> | |
<itemizedlist mark="disc" spacing="normal"> | |
<listitem> | |
<para role="Normal"> | |
the <code>exclude</code> attribute value is set ao false and all elements in | |
<code>featureMatchers</code> have been successfully evaluated. | |
</para> | |
</listitem> | |
<listitem> | |
<para role="Normal"> | |
the <code>exclude</code> attribute value is set to true and evaluation of either | |
of the elements in <code>featureMatchers</code> is unsuccessful | |
</para> | |
</listitem> | |
</itemizedlist> | |
<para role="Normal"> | |
The FESL fragment below defines a group with the two features <code>Color</code> and | |
<code>Wheels:Size</code> to be matched. The entire group is to be successfully evaluated | |
if both features are matched. The first feature is successfully evaluated if | |
its value is one of the values listed by its <code>enumFeatureValues</code> element and | |
the second feature is matched if its value is not in the set contained in its | |
<code>enumFeatureValues</code> element, as specified by its <code>exclude</code> attribute. It should | |
be noted that if the optional attribute <code>featureTypeName</code> is omitted then a | |
feature value is assumed to be of a string type. Otherwise a feature value's type | |
will be evaluated if it is the same or derived from the type specified by the | |
<code>featureTypeName</code> attribute. Assuming the <code>groupFeatureMatcher</code> is specified for | |
the <code>CarAnnotation</code> type, the test defined by a FESL fragment below is | |
successful is a car is ether red, green or blue and it does not have 1 or 3 | |
wheels: | |
</para> | |
<para role="Normal"><groupFeatureMatchers></para> | |
<para role="Normal"> <featureMatchers featurePath="Color" featureTypeName="java.lang.Stting"> </para> | |
<para role="Normal"> <enumFeatureValues caseSensitive="true"> </para> | |
<para role="Normal"> <values>red</values> </para> | |
<para role="Normal"> <values>green</values></para> | |
<para role="Normal"> <values>blue</values></para> | |
<para role="Normal"> </enumFeatureValues></para> | |
<para role="Normal"> </featureMatcher></para> | |
<para role="Normal"> <featureMatchers featurePath="Wheels:Size" exclude="true"></para> | |
<para role="Normal"> <enumFeatureValues caseSensitive="true"></para> | |
<para role="Normal"> <values>1</values></para> | |
<para role="Normal"> <values>3</values></para> | |
<para role="Normal"> </enumFeatureValues></para> | |
<para role="Normal"> </featureMatchers></para> | |
<para role="Normal"><groupFeatureMatchers></para> | |
</section> | |
<section id="_PartialObjectMatcherXML"> | |
<title> | |
PartialObjectMatcherXML | |
</title> | |
<itemizedlist mark="disc" spacing="normal"> | |
<listitem> | |
<para role="Normal">Attribute: annotationTypeName[1]: String</para> | |
</listitem> | |
<listitem> | |
<para role="Normal">Attribute: fullPath[0..1]: String: no default value</para> | |
</listitem> | |
<listitem> | |
<para role="Normal"> | |
Element: groupFeatureMatchers[0..*]: GroupFeatureMatcherXML | |
</para> | |
</listitem> | |
</itemizedlist> | |
<para> | |
<inlinemediaobject> | |
<imageobject> | |
<imagedata fileref="../images/CFE_UG/CFE_UG-14.jpg" align="center"/> | |
</imageobject> | |
</inlinemediaobject> | |
</para> | |
<para role="Normal"> | |
This is a base specification for an annotation matcher that will search | |
annotations of a type specified by <code>annotationTypeName</code> located on a path | |
specified by <code>fullPath</code>. If <code>fullPath</code> is omitted or just contains the type | |
name of an annotation (same as <code>annotationTypeName</code> attribute) then all | |
instances of that type are considered for further feature value | |
evaluation. If <code>fullPath</code> contains a path to an object from an attribute of | |
a different object, then only instances of <code>annotationTypeName</code> that | |
located on that path will be considered for further evaluation Once an | |
annotation is successfully evaluated to match a type/path, its features | |
are evaluated according to specification given in all elements of | |
<code>groupFeatureMatchers</code>. If evaluation of any <code>groupFeatureMatchers</code> is | |
successful or if no <code>groupFeatureMatchers</code> is given, then the annotation is | |
considered to be successfully evaluated. The <code>fullPath</code> attribute should be | |
specified using syntax described in the | |
<link linkend="_Feature_path"> | |
<phrase role="Hyperlink2">feature path</phrase> | |
</link> | |
section above, with the exception that it can not contain any parent tags. | |
For instance, a specification where a value of the <code>fullPath</code> attribute is | |
<code>CarAnnotation:Engine</code> and a value of the <code>annotationTypeName</code> is | |
<code>EngineAnnotation</code> would address only engines that are car engines. | |
<code>PartialAnnotationMatcherXML</code> is used to specify search rules in TAM | |
specifications. To illustrate the use of parent tag notation let's | |
consider an example where it is required to identify engines of blue | |
cars that have a size more than 1.8 l but not greater then 3.0 l. | |
According to a class diagram in Figure 2, the FESL fragment below defines | |
rules for the task. It should be noted that the second feature matcher | |
uses the | |
<link linkend="_Parent_tag"> | |
<phrase role="Hyperlink2">parent tag</phrase> | |
</link> notation to access a value of the <code>CarAnnotation</code>'s attribute <code>Color</code>: | |
</para> | |
<para role="Normal"><targetAnnotationMatcher annotationTypeName="EngineAnnotation" fullPath="CarAnnotation:EngineAnnotation" ></para> | |
<para role="Normal"> <groupFeatureMatchers></para> | |
<para role="Normal"> <featureMatchers featurePath="Size" featureTypeName="java.lang.Float"></para> | |
<para role="Normal"> <rangeFeatureValues lowerBoundary="1.8" upperBoundaryInclusive="true" upperBoundary="3.0"/></para> | |
<para role="Normal"> </featureMatchers></para> | |
<para role="Normal"> <featureMatchers featurePath="__p0:Color" featureTypeName="java.lang.String"</para> | |
<para role="Normal"> <enumFeatureValues caseSensitive="true"></para> | |
<para role="Normal"> <values>red</values></para> | |
<para role="Normal"> <values>green</values></para> | |
<para role="Normal"> <values>blue</values></para> | |
<para role="Normal"> </enumFeatureValues></para> | |
<para role="Normal"> </featureMatcher></para> | |
<para role="Normal"> <groupFeatureMatchers></para> | |
<para role="Normal"></targetAnnotationMatcher></para> | |
</section> | |
<section id="_FeatureObjectMatcherXML"> | |
<title> | |
FeatureObjectMatcherXML | |
</title> | |
<para role="Normal">extends <code>PartialAnnotationMatcherXML</code></para> | |
<itemizedlist mark="disc" spacing="normal"> | |
<listitem> | |
<para role="Normal">Attribute: windowsizeLeft[0..1]: Integer: default 0</para> | |
</listitem> | |
<listitem> | |
<para role="Normal">Attribute: windowsizeInside[0..L]: Integer: default 0</para> | |
</listitem> | |
<listitem> | |
<para role="Normal">Attribute: windowsizeRight[0..1]: Integer: default 0</para> | |
</listitem> | |
<listitem> | |
<para role="Normal">Attribute: windowsizeEnclosed[0..1]: Integer: default 0</para> | |
</listitem> | |
<listitem> | |
<para role="Normal">Attribute: windowFlags[0..1]: Integer: default 0</para> | |
</listitem> | |
<listitem> | |
<para role="Normal">Attribute: orientation[0..1]: boolean: default false</para> | |
</listitem> | |
<listitem> | |
<para role="Normal">Attribute: distance[0..1]: boolean: default false</para> | |
</listitem> | |
</itemizedlist> | |
<para> | |
<inlinemediaobject> | |
<imageobject> | |
<imagedata fileref="../images/CFE_UG/CFE_UG-15.jpg" align="center"/> | |
</imageobject> | |
</inlinemediaobject> | |
</para> | |
<para role="Normal"> | |
The <code>FeatureObjectMatcherXML</code> element contains rules that specify how | |
<code>FeatureAnnotations</code> (FA) should be located and which features should be | |
extracted from them. It inherits its properties from | |
<code>PartialObjectMatcherXML</code>. In addition it has semantics for specifying: | |
</para> | |
<itemizedlist mark="disc" spacing="normal"> | |
<listitem> | |
<para role="Normal">a size of a search window</para> | |
</listitem> | |
<listitem> | |
<para role="Normal"> | |
a direction for the search relative to a corresponding Target Annotation (TA). | |
</para> | |
</listitem> | |
</itemizedlist> | |
<para role="Normal"> | |
It is done by using boolean attributes <code>windowsizeLeft</code>, <code>windowsizeInside</code>, | |
<code>windowsizeRight</code>, <code>windowsizeEnclosed</code> and the bitmask <code>windowFlags</code> attribute | |
that indicate FA's search rules: | |
</para> | |
<itemizedlist mark="disc" spacing="normal"> | |
<listitem> | |
<para role="Normal">windowsizeLeft - a size of the search window to the left from TA</para> | |
</listitem> | |
<listitem> | |
<para role="Normal">windowsizeRight - a size of the search window to the right from TA</para> | |
</listitem> | |
<listitem> | |
<para role="Normal">windowsizeInside - a size of the search window within TA boundaries; if the value of this attribute is 1, then the TA is considered to be an FA at the same time</para> | |
</listitem> | |
<listitem> | |
<para role="Normal">windowFlags - more precise criteria for search window; the value if this attribute is a bitmask with a combination of the following values:</para> | |
<orderedlist numeration="loweralpha" spacing="normal"> | |
<listitem> | |
<para role="Normal">1 - FA starts to the left from the TA and ends to the left from the TA</para> | |
</listitem> | |
<listitem> | |
<para role="Normal">2 - FA starts to the left from the TA and ends inside of TA boundaries</para> | |
</listitem> | |
<listitem> | |
<para role="Normal">4 - FA starts to the left from the TA and ends to the right from the TA</para> | |
</listitem> | |
<listitem> | |
<para role="Normal">8 - FA starts inside of the TA and ends inside of the TA boundaries</para> | |
</listitem> | |
<listitem> | |
<para role="Normal">16 - FA starts inside of the TA boundaries and ends to the right from the TA</para> | |
</listitem> | |
<listitem> | |
<para role="Normal">32 - FA starts to the right from the TA and ends to the right from the TA</para> | |
</listitem> | |
</orderedlist> | |
</listitem> | |
</itemizedlist> | |
<para role="Normal"> | |
The location of a FA is included in the generated output according to | |
optional orientation and distance attributes. For example, if values of | |
both of these attributes are set to true and the FA is a first annotation | |
of required type to the left from TA, then the generated feature value | |
will start with the prefix <code>L1</code>. If the values are set to false, then the | |
feature value's prefix will be <code>X0</code>. This allows generating unique | |
feature names for model building and evaluation for machine learning. | |
</para> | |
<para role="Normal"> | |
<code>FeatureObjectMatcherXML</code> is used to specify search rules in FAM | |
specifications. | |
</para> | |
<para role="Normal"> | |
The FESL fragment below adds rules to the previous sample to extract a | |
number of cylinders from engines of cars whose wheels diameter is at | |
least 20.0": | |
</para> | |
<para role="Normal"><targetAnnotationMatcher annotationTypeName="EngineAnnotation" fullPath="CarAnnotation:EngineAnnotation" ></para> | |
<para role="Normal"> <groupFeatureMatchers></para> | |
<para role="Normal"> <featureMatchers featurePath="Size" featureTypeName="java.lang.Float"></para> | |
<para role="Normal"> <rangeFeatureValues lowerBoundary="1.8" upperBoundaryInclusive="true" upperBoundary="3.0"/></para> | |
<para role="Normal"> </featureMatchers></para> | |
<para role="Normal"> <featureMatchers featurePath="__p0:Color" featureTypeName="java.lang.String"></para> | |
<para role="Normal"> <enumFeatureValues caseSensitive="true"></para> | |
<para role="Normal"> <values>red</values></para> | |
<para role="Normal"> <values>green</values></para> | |
<para role="Normal"> <values>blue</values></para> | |
<para role="Normal"> </enumFeatureValues></para> | |
<para role="Normal"> </featureMatcher></para> | |
<para role="Normal"> <groupFeatureMatchers></para> | |
<para role="Normal"></targetAnnotationMatcher></para> | |
<para role="Normal"><featureAnnotationMatcher annotationTypeName="EngineAnnotation" fullPath="CarAnnotation:EngineAnnotation" windowsizeInside=1 ></para> | |
<para role="Normal"> <groupFeatureMatchers></para> | |
<para role="Normal"> <featureMatchers featurePath="__p0:Wheels:toArray:Diameter" featureTypeName="java.lang.Float" quiet="true" ></para> | |
<para role="Normal"> <rangeFeatureValues lowerBoundary="20.0" lowerBoundaryInclusive="true"/></para> | |
<para role="Normal"> </featureMatcher></para> | |
<para role="Normal"> <featureMatchers featurePath="Cylinders" featureTypeName="java.lang.Float" /></para> | |
<para role="Normal"> <groupFeatureMatchers></para> | |
<para role="Normal"></featureAnnotationMatcher></para> | |
</section> | |
<section id="_TargetAnnotationXML"> | |
<title> | |
TargetAntotationXML | |
</title> | |
<itemizedlist mark="disc" spacing="normal"> | |
<listitem> | |
<para role="Normal">Attribute: className[1]: String</para> | |
</listitem> | |
<listitem> | |
<para role="Normal">Attribute: enclosingAnnotation[1]: String</para> | |
</listitem> | |
<listitem> | |
<para role="Normal">Element targetAnnotationMatcher[1..1]: PartialObjectMatcherXML</para> | |
</listitem> | |
<listitem> | |
<para role="Normal"> | |
Element featureAnnotationMatchers[0..*]: FeatureObjectMatcherXML | |
</para> | |
</listitem> | |
</itemizedlist> | |
<para> | |
<inlinemediaobject> | |
<imageobject> | |
<imagedata fileref="../images/CFE_UG/CFE_UG-16.jpg" align="center"/> | |
</imageobject> | |
</inlinemediaobject> | |
</para> | |
<para role="Normal"> | |
This is a root specification for a class (group) of annotations of all | |
extracted instances, which are assigned the same label (className) in the | |
final output. The label can be a literal string or a feature path in | |
curly brackets or a combination of the two (i.e. | |
<code>SomeText_{__p0:SomeProperty}</code>). If using a feature path in a class name | |
label it is required to use the parent tag notation. In such a case the | |
parent tag refers to the TA specified by the <code>targetAnnotationMatcher</code> | |
element. Annotations that belong to the group are searched within a span | |
of <code>enclosingAnnotation</code> according to the specification given in the | |
<code>targetAnnotationMatcher</code> (TAM) and features from matched annotations are | |
extracted according to specification given in <code>featureAnnotationMatchers</code> | |
(FAM). In general, the annotation that features are extracted from could | |
be different from annotations that are matched during the search This is | |
useful when extracting features for machine learning model building and | |
evaluation where features are selected from annotations that could be | |
located in a specific location relatively to the annotation that satisfy | |
a search criteria. For instance, POS tags of 5 words to the left and | |
right from a specific word. Only if an annotation is successfully | |
evaluated (matched) by a TAM further feature extraction is allowed and | |
rules specified by corresponding FAMs are executed. | |
</para> | |
</section> | |
</section> | |
<section id="_Configuration_file_sample"> | |
<title> | |
Configuration file sample | |
</title> | |
<section id="_Task_definition"> | |
<title> | |
Task definition | |
</title> | |
<para role="Normal"> | |
The sample configuration file below has been created for extracting | |
features in order to build models for a machine learning application. The | |
type system for this sample defines several UIMA annotation types: | |
</para> | |
<itemizedlist mark="disc" spacing="normal"> | |
<listitem> | |
<para role="Normal">org.apache.uima.tools.cfe.sample.Sentence - type that marks a sentence</para> | |
</listitem> | |
<listitem> | |
<para role="Normal">org.apache.uima.tools.cfe.sample.Token - type that marks a token with features:</para> | |
</listitem> | |
</itemizedlist> | |
<para role="Normal">pennTag: String - POS tag of a token</para> | |
<itemizedlist mark="disc" spacing="normal"> | |
<listitem> | |
<para role="Normal">org.apache.uima.tools.cfe.sample.NamedEntity - named entity type with features:</para> | |
</listitem> | |
</itemizedlist> | |
<para role="Normal">Code: String - specific code assigned to a named entity</para> | |
<para role="Normal">SemanticClass: String - semantic class of a named entity</para> | |
<para role="Normal">Tokens: FSArray - array of org.apache.uima.tools.cfe.sample.Token annotations, ordered by their offset, that are included in the named entity</para> | |
<para role="Normal">The classification task is defined as follows:</para> | |
<orderedlist numeration="loweralpha" spacing="normal"> | |
<listitem> | |
<para role="Normal"> | |
classify first token of each named entities that has semantic | |
class <code>Car Maker</code> with a class label that is a composite of | |
the string <code>CMBegin</code> and a value of the <code>Code</code> attribute that | |
named entity | |
</para> | |
</listitem> | |
<listitem> | |
<para role="Normal"> | |
classify all other tokens of named entities of a semantic class | |
<code>Car Maker</code> with a class label that is a composite of the string | |
<code>CMInside</code> and a value of the <code>Code</code> property of that named entity | |
</para> | |
</listitem> | |
<listitem> | |
<para role="Normal">classify all other tokens with a class label <code>Other_Token</code></para> | |
</listitem> | |
</orderedlist> | |
<para role="Normal"> | |
To build a model for machine learning it is required to extract | |
features from surrounding tokens for all classes listed above. | |
In particular the following features are required to be extracted: | |
</para> | |
<itemizedlist mark="disc" spacing="normal"> | |
<listitem> | |
<para role="Normal">a string literal of the token to which the class label is assigned (<code>class token</code>)</para> | |
</listitem> | |
<listitem> | |
<para role="Normal"> | |
a string literal of each token that is located with in a window of 5 | |
tokens from the <code>class token</code> with the exception of prepositions (POS tag | |
is IN), conjunctions (CC), delimiters (DT), punctuation (POS tag is not | |
defined - null) and numbers (CD) | |
</para> | |
</listitem> | |
<listitem> | |
<para role="Normal"> | |
all extracted features have to be unique with their position information | |
relative to the location of the <code>class token</code>. | |
</para> | |
</listitem> | |
</itemizedlist> | |
</section> | |
<section id="_Implementation"> | |
<title> | |
Implementation | |
</title> | |
<para role="Normal">Line 1 - a standard XML declaration that defines the XML version of the document and its encoding</para> | |
<para role="Normal">Line 2, 87 - FESL root element that references the schema and defines global variables, such as nullValueImage (see | |
<link linkend="_Null_values"> | |
<phrase role="Hyperlink1">Null values</phrase> | |
</link>) | |
</para> | |
<para role="Normal">Line 3-32 - rules for extracting features for first tokens of named entities.</para> | |
<para role="Normal">Line 3 - extracted features for those tokens are assigned a composite label that includes prefix <code>CMBegin_</code> pl s a value of a <code>Code</code> attribute of the first element of the TA's path. The search for FA is going to be performed within boundaries of enclosing org.apache.uima.tools.cfe.sample.Sentence annotation</para> | |
<para role="Normal">Line 4-12 - TAM that defines rules for identifying the fist TA</para> | |
<para role="Normal">Line 4 - defines TA's type (org.apache.uima.tools.cfe.sample.Token) and a full path to it (org.apache.uima.tools.cfe.sample.NamedEntity:Tokens:toArray[0]). According to this path notion, the CFE will:</para> | |
<itemizedlist mark="disc" spacing="normal"> | |
<listitem> | |
<para role="Normal">search for annotations of type org.apache.uima.tools.cfe.sample.NamedEntity</para> | |
</listitem> | |
<listitem> | |
<para role="Normal"> | |
for annotations that were found it accesses the value of their attribute | |
Tokens and if the value is not null, the method toArray is called to | |
convert the value to an array | |
</para> | |
</listitem> | |
<listitem> | |
<para role="Normal">if the resulted array is not empty, its first element will be considered to be a TA </para> | |
</listitem> | |
</itemizedlist> | |
<para role="Normal">Line 5-11 - defines rules for matching a group of features for TA</para> | |
<para role="Normal">Line 6-10 - defines rules for matching a feature for this group</para> | |
<para role="Normal">Line 6 - defines that the feature value is of the type | |
java.lang.String and has the feature the path __p0:SemanticClass, which | |
translates to a value of the attribute SemanticClass of the first element of | |
the TA's path (org.apache.uima.tools.cfe.sample.NamedEntity) | |
</para> | |
<para role="Normal">Line 7-9 - defines an explicit list of values that the feature value should be in</para> | |
<para role="Normal">Line 8 - defines the value <code>Car Maker</code> as the only possible value for the feature </para> | |
<para role="Normal">Line 13-17 - FAM that defines rules for identifying first FA and its feature extraction</para> | |
<para role="Normal">Line 13 - defines FA's type to be org.apache.uima.tools.cfe.sample.Token; | |
the attribute windowsizeInside with the value 1 tells CFE to extract features from TA | |
itself (TA=FA) and setting orientation and distance attributes to true tells CFE to | |
include position information into the generated feature value | |
</para> | |
<para role="Normal">Line 14-16 - defines rules for matching a group of features for the first FA.</para> | |
<para role="Normal">Line 15 - defines rules for matching the only feature for | |
this group of the type java.lang.String and with feature path coveredText that | |
eventually will be translated by CFE to a method call of a org.apache.uima.tools.cfe.sample.Token | |
annotation object; according to this specification the feature value will be | |
unconditionally extracted | |
</para> | |
<para role="Normal">Line 18-31 - FAM that defines rules for identifying second type of FA and its feature extraction</para> | |
<para role="Normal">Line 18 - defines FA's type to be org.apache.uima.tools.cfe.sample.Token; | |
the attributes windowsizeLeft and windowsizeRight with the values 5 tell CFE | |
to extract features from 5 nearest annotations of this type to the left and | |
to the right from TA and having orientation and distance attributes set to | |
true tells CFE to include position information into the generated feature | |
value. | |
</para> | |
<para role="Normal">Line 19-30 - defines rules for matching a group of features for the second FA.</para> | |
<para role="Normal">Line 20 - defines rules for matching the first feature of | |
the group to be of the type java.lang.String and with the feature path | |
coveredText that eventually will be translated by CFE to a method call of a | |
org.apache.uima.tools.cfe.sample.Token annotation object; according to this | |
specification the feature value will be unconditionally extracted | |
</para> | |
<para role="Normal">Line 21-29 - define rules for matching the second feature of the group</para> | |
<para role="Normal">Line 21 - defines rules for matching the second feature | |
of the group to be of the type java.lang.String and with the feature path | |
pennTag that eventually will be translated by CFE to <code>getPennTag</code> method call | |
of a org.apache.uima.tools.cfe.sample.Token annotation object; according to this | |
specification the feature will be evaluated against | |
<phrase role="Hyperlink1">enumFeatureValues</phrase> | |
and, as the exclude attribute is set to true: | |
</para> | |
<itemizedlist mark="disc" spacing="normal"> | |
<listitem> | |
<para role="Normal"> | |
if the evaluation is successful, the feature matcher will cause the | |
parent group to be unmatched and since it is the only group in the | |
FAM, no output for this FA will be produced | |
</para> | |
</listitem> | |
<listitem> | |
<para role="Normal"> | |
if the evaluation is unsuccessful, this feature matcher will not affect | |
matching status of the group, so the output for FA will be generated as | |
the first matcher of the group unconditionally produces output | |
</para> | |
</listitem> | |
</itemizedlist> | |
<para role="Normal">As the | |
<phrase role="Hyperlink1">quiet</phrase> | |
attribute is set to true, the feature value extracted by the second | |
matcher will not be added to the generated for this FA output </para> | |
<para role="Normal">Line 22-28 - defines an explicit list of values that the | |
value of the second feature should be in | |
</para> | |
<para role="Normal">Line 23-27 - defines values <code>IN</code>, <code>CC</code>, <code>DT</code>, <code>CD</code>, <code>null</code> | |
as possible values for the second feature; if the feature value is equal | |
to one of these values, evaluation of the enclosing feature matcher is | |
successful; if the feature value is null it will be converted to the | |
string defined by | |
<link linkend="_Null_values"> | |
<phrase role="Hyperlink1">nullValueImage</phrase> | |
</link> | |
(<code>null</code> as set in line 2 of this sample) and as <code>null</code> is one of the | |
list's elements, it will be successfully evaluated. | |
</para> | |
<para role="Normal">Line 34-63 - rules for extracting features for all tokens | |
of named entities except the first. These rules are the same as the rules | |
defined for first tokens of named entities (lines 3-32) with the following | |
exceptions: | |
</para> | |
<para role="Normal">Line 34 - defines that TAs matched by these rules will | |
be assigned a composite label that includes prefix <code>CMInside_</code> plus a | |
value of the <code>Code</code> attribute of a first element of the TA's path | |
</para> | |
<para role="Normal">Line 35 - sets the fullPath attribute to | |
org.apache.uima.tools.cfe.sample.NamedEntity:Tokens:toArray that can be | |
translated as <code>any token of a named entity</code>, but because of | |
<link linkend="_Implicit_TA_exclusion"> | |
<phrase role="Hyperlink1">implicit TA exclusion</phrase> | |
</link> | |
, the TAs that were matched for first tokens of named entities by the | |
rules for previous TAM are not included into the set of TAs that will be | |
evaluated by rules for this TAM | |
</para> | |
<para role="Normal">Line 65-86 - rules for extracting features for all tokens | |
other than tokens of named entities. These rules are the same as the rules | |
defined for previous categories with the following exceptions: | |
</para> | |
<para role="Normal">Line 65 - defines that TAs matched by the enclosed | |
rules will be assigned the string label <code>Other_token</code> | |
</para> | |
<para role="Normal">Line 66 - only defines a type of TAs that should be | |
processed by the corresponding TAM without fullPath attribute. Such a | |
notation can be translated as <code>all tokens</code>, but because of the | |
<link linkend="_Implicit_TA_exclusion"> | |
<phrase role="Hyperlink1">implicit TA exclusion</phrase> | |
</link> | |
, the TAs, which were matched for tokens of named entities by rules | |
defined by the previous TAMs, are not included into the set of TAs that | |
will be evaluated by rules for this TAM. So, the actual translation will | |
be <code>all tokens other than tokens of named entities.</code> | |
</para> | |
<orderedlist numeration="arabic" spacing="compact"> | |
<?dbfo label-width="1.5em" ?> | |
<listitem> | |
<simpara role="Normal"><?xml version="1.0" encoding="UTF-8"?></simpara> | |
</listitem> | |
<listitem> | |
<simpara role="Normal"><tns:CFEConfig nullValueImage="null" | |
xmlns:tns="http://www.apache.org/uima/cfe/config" | |
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" | |
xsi:schemaLocation="http://www.apache.org/uima/cfe/config CFEConfig.xsd "> | |
</simpara> | |
</listitem> | |
<listitem> | |
<simpara role="Normal"> <tns:targetAnnotations className="CMBegin_{__p0:Code}" | |
enclosingAnnotation="org.apache.uima.tools.cfe.sample.Sentence"> | |
</simpara> | |
</listitem> | |
<listitem> | |
<simpara role="Normal"> <tns:targetAnnotationMatcher | |
annotationTypeName="org.apache.uima.tools.cfe.sample.Token" | |
fullPath="org.apache.uima.tools.cfe.sample.NamedEntity:Tokens:toArray[0]"> | |
</simpara> | |
</listitem> | |
<listitem> | |
<simpara role="Normal"> <tns:groupFeatureMatchers></simpara> | |
</listitem> | |
<listitem> | |
<simpara role="Normal"> <tns:featureMatchers featurePath="__p0:SemanticClass" | |
featureTypeName="java.lang.String"></simpara> | |
</listitem> | |
<listitem> | |
<simpara role="Normal"> <tns:enumFeatureValues></simpara> | |
</listitem> | |
<listitem> | |
<simpara role="Normal"> <tns:values>Car Maker</tns:values></simpara> | |
</listitem> | |
<listitem> | |
<simpara role="Normal"> </tns:enumFeatureValues></simpara> | |
</listitem> | |
<listitem> | |
<simpara role="Normal"> </tns:featureMatchers></simpara> | |
</listitem> | |
<listitem> | |
<simpara role="Normal"> </tns:groupFeatureMatchers></simpara> | |
</listitem> | |
<listitem> | |
<simpara role="Normal"> </tns:targetAnnotationMatcher></simpara> | |
</listitem> | |
<listitem> | |
<simpara role="Normal"> <tns:featureAnnotationMatchers annotationTypeName= | |
"org.apache.uima.tools.cfe.sample.Token" windowsizeInside="1" | |
orientation="true" distance="true"> | |
</simpara> | |
</listitem> | |
<listitem> | |
<simpara role="Normal"> <tns:groupFeatureMatchers></simpara> | |
</listitem> | |
<listitem> | |
<simpara role="Normal"> <tns:featureMatchers featurePath="coveredText" | |
featureTypeName="java.lang.String"/></simpara> | |
</listitem> | |
<listitem> | |
<simpara role="Normal"> </tns:groupFeatureMatchers></simpara> | |
</listitem> | |
<listitem> | |
<simpara role="Normal"> </tns:featureAnnotationMatchers></simpara> | |
</listitem> | |
<listitem> | |
<simpara role="Normal"> <tns:featureAnnotationMatchers annotationTypeName= | |
"org.apache.uima.tools.cfe.sample.Token" windowsizeLeft="5" | |
windowsizeRight="5" orientation="true" distance="true"> | |
</simpara> | |
</listitem> | |
<listitem> | |
<simpara role="Normal"> <tns:groupFeatureMatchers></simpara> | |
</listitem> | |
<listitem> | |
<simpara role="Normal"> <tns:featureMatchers | |
featurePath="coveredText" featureTypeName="java.lang.String"/> | |
</simpara> | |
</listitem> | |
<listitem> | |
<simpara role="Normal"> <tns:featureMatchers featurePath="pennTag" | |
featureTypeName="java.lang.String" exclude="true" quiet="true"> | |
</simpara> | |
</listitem> | |
<listitem> | |
<simpara role="Normal"> <tns:enumFeatureValues caseSensitive="true"></simpara> | |
</listitem> | |
<listitem> | |
<simpara role="Normal"> <tns:values>IN</tns:values></simpara> | |
</listitem> | |
<listitem> | |
<simpara role="Normal"> <tns:values>CC</tns:values></simpara> | |
</listitem> | |
<listitem> | |
<simpara role="Normal"> <tns:values>DT</tns:values></simpara> | |
</listitem> | |
<listitem> | |
<simpara role="Normal"> <tns:values>CD</tns:values></simpara> | |
</listitem> | |
<listitem> | |
<simpara role="Normal"> <tns:values>null</tns:values></simpara> | |
</listitem> | |
<listitem> | |
<simpara role="Normal"> </tns:enumFeatureValues></simpara> | |
</listitem> | |
<listitem> | |
<simpara role="Normal"> </tns:featureMatchers></simpara> | |
</listitem> | |
<listitem> | |
<simpara role="Normal"> </tns:groupFeatureMatchers></simpara> | |
</listitem> | |
<listitem> | |
<simpara role="Normal"> < tns:featureAnnotationMatchers></simpara> | |
</listitem> | |
<listitem> | |
<simpara role="Normal"> </tns:targetAnnotations></simpara> | |
</listitem> | |
<listitem> | |
<simpara role="Normal"/> | |
</listitem> | |
<listitem> | |
<simpara role="Normal"> <tns:targetAnnotations className="CMInside_{__p0:Code}" | |
enclosingAnnotation="org.apache.uima.tools.cfe.sample.Sentence"> | |
</simpara> | |
</listitem> | |
<listitem> | |
<simpara role="Normal"> <tns:targetAnnotationMatcher | |
annotationTypeName="org.apache.uima.tools.cfe.sample.Token" | |
fullPath="org.apache.uima.tools.cfe.sample.NamedEntity:Tokens:toArray"> | |
</simpara> | |
</listitem> | |
<listitem> | |
<simpara role="Normal"> <tns:groupFeatureMatchers></simpara> | |
</listitem> | |
<listitem> | |
<simpara role="Normal"> <tns:featureMatchers featurePath="__p0:SemanticClass" | |
featureTypeName="java.lang.String"> | |
</simpara> | |
</listitem> | |
<listitem> | |
<simpara role="Normal"> <tns:enumFeatureValues></simpara> | |
</listitem> | |
<listitem> | |
<simpara role="Normal"> <tns:values>Car Maker</tns:values></simpara> | |
</listitem> | |
<listitem> | |
<simpara role="Normal"> </tns:enumFeatureValues></simpara> | |
</listitem> | |
<listitem> | |
<simpara role="Normal"> </tns:featureMatchers></simpara> | |
</listitem> | |
<listitem> | |
<simpara role="Normal"> </tns:groupFeatureMatchers></simpara> | |
</listitem> | |
<listitem> | |
<simpara role="Normal"> </tns:targetAnnotationMatcher></simpara> | |
</listitem> | |
<listitem> | |
<simpara role="Normal"> <tns:featureAnnotationMatchers | |
annotationTypeName="org.apache.uima.tools.cfe.sample.Token" | |
windowsizeInside="1" orientation="true" distance="true"> | |
</simpara> | |
</listitem> | |
<listitem> | |
<simpara role="Normal"> <tns:groupFeatureMatchers></simpara> | |
</listitem> | |
<listitem> | |
<simpara role="Normal"> <tns:featureMatchers | |
featurePath="coveredText" featureTypeName="java.lang.String"/> | |
</simpara> | |
</listitem> | |
<listitem> | |
<simpara role="Normal"> </tns:groupFeatureMatchers></simpara> | |
</listitem> | |
<listitem> | |
<simpara role="Normal"> </tns:featureAnnotationMatchers></simpara> | |
</listitem> | |
<listitem> | |
<simpara role="Normal"> <tns:featureAnnotationMatchers | |
annotationTypeName="org.apache.uima.tools.cfe.sample.Token" windowsizeLeft="5" | |
windowsizeRight="5" orientation="true" distance="true"> | |
</simpara> | |
</listitem> | |
<listitem> | |
<simpara role="Normal"> <tns:groupFeatureMatchers></simpara> | |
</listitem> | |
<listitem> | |
<simpara role="Normal"> <tns:featureMatchers | |
featurePath="coveredText" featureTypeName="java.lang.String"/> | |
</simpara> | |
</listitem> | |
<listitem> | |
<simpara role="Normal"> <tns:featureMatchers | |
featurePath="pennTag" featureTypeName="java.lang.String" exclude="true" quiet="true"> | |
</simpara> | |
</listitem> | |
<listitem> | |
<simpara role="Normal"> <tns:enumFeatureValues caseSensitive="true"></simpara> | |
</listitem> | |
<listitem> | |
<simpara role="Normal"> <tns:values>IN</tns:values></simpara> | |
</listitem> | |
<listitem> | |
<simpara role="Normal"> <tns:values>CC</tns:values></simpara> | |
</listitem> | |
<listitem> | |
<simpara role="Normal"> <tns:values>DT</tns:values></simpara> | |
</listitem> | |
<listitem> | |
<simpara role="Normal"> <tns:values>CD</tns:values></simpara> | |
</listitem> | |
<listitem> | |
<simpara role="Normal"> <tns:values>null</tns:values></simpara> | |
</listitem> | |
<listitem> | |
<simpara role="Normal"> </tns:enumFeatureValues></simpara> | |
</listitem> | |
<listitem> | |
<simpara role="Normal"> </tns:featureMatchers></simpara> | |
</listitem> | |
<listitem> | |
<simpara role="Normal"> </tns:groupFeatureMatchers></simpara> | |
</listitem> | |
<listitem> | |
<simpara role="Normal"> </tns:featureAnnotationMatchers></simpara> | |
</listitem> | |
<listitem> | |
<simpara role="Normal"> </tns:targetAnnotations></simpara> | |
</listitem> | |
<listitem> | |
<simpara role="Normal"/> | |
</listitem> | |
<listitem> | |
<simpara role="Normal"> <tns:targetAnnotations className="Other_token" | |
enclosingAnnotation="org.apache.uima.tools.cfe.sample.Sentence"> | |
</simpara> | |
</listitem> | |
<listitem> | |
<simpara role="Normal"> <tns:targetAnnotationMatcher | |
annotationTypeName="org.apache.uima.tools.cfe.sample.Token"/> | |
</simpara> | |
</listitem> | |
<listitem> | |
<simpara role="Normal"> <tns:featureAnnotationMatchers | |
annotationTypeName="org.apache.uima.tools.cfe.sample.Token" | |
windowsizeInside="1" orientation="true" distance="true"> | |
</simpara> | |
</listitem> | |
<listitem> | |
<simpara role="Normal"> <tns:groupFeatureMatchers></simpara> | |
</listitem> | |
<listitem> | |
<simpara role="Normal"> <tns:featureMatchers featurePath="coveredText" | |
featureTypeName="java.lang.String"/></simpara> | |
</listitem> | |
<listitem> | |
<simpara role="Normal"> </tns:groupFeatureMatchers></simpara> | |
</listitem> | |
<listitem> | |
<simpara role="Normal"> </tns:featureAnnotationMatchers></simpara> | |
</listitem> | |
<listitem> | |
<simpara role="Normal"> <tns:featureAnnotationMatchers | |
annotationTypeName="org.apache.uima.tools.cfe.sample.Token" | |
windowsizeLeft="c" windowsizeRight="5" orientation="true" distance="true"> | |
</simpara> | |
</listitem> | |
<listitem> | |
<simpara role="Normal"> <tns:groupFeatureMatchers></simpara> | |
</listitem> | |
<listitem> | |
<simpara role="Normal"> <tns:featureMatchers featurePath="coveredText" | |
featureTypeName="java.lang.String"/> | |
</simpara> | |
</listitem> | |
<listitem> | |
<simpara role="Normal"> <tns:featureMatchers featurePath="pennTag" | |
featureTypeName="java.lang.String" exclude="true" quiet="true"> | |
</simpara> | |
</listitem> | |
<listitem> | |
<simpara role="Normal"> <tns:enumFeatureValues caseSensitive="true"></simpara> | |
</listitem> | |
<listitem> | |
<simpara role="Normal"> <tns:values>IN</tns:values></simpara> | |
</listitem> | |
<listitem> | |
<simpara role="Normal"> <tns:values>CC</tns:values></simpara> | |
</listitem> | |
<listitem> | |
<simpara role="Normal"> <tns:values>DT</tns:values></simpara> | |
</listitem> | |
<listitem> | |
<simpara role="Normal"> <tns:values>CD</tns:values></simpara> | |
</listitem> | |
<listitem> | |
<simpara role="Normal"> <tns:values>null</tns:values></simpara> | |
</listitem> | |
<listitem> | |
<simpara role="Normal"> </tns:enumFeatureValues></simpara> | |
</listitem> | |
<listitem> | |
<simpara role="Normal"> </tns:featureMatchers></simpara> | |
</listitem> | |
<listitem> | |
<simpara role="Normal"> </tns:groupFeatureMatchers></simpara> | |
</listitem> | |
<listitem> | |
<simpara role="Normal"> </tns:featureAnnotationMatchers></simpara> | |
</listitem> | |
<listitem> | |
<simpara role="Normal"> </tns:targetAnnotations></simpara> | |
</listitem> | |
<listitem> | |
<simpara role="Normal"></tns:CFEConfig></simpara> | |
</listitem> | |
</orderedlist> | |
</section> | |
</section> | |
</chapter> | |
<chapter id="_Using_CFE_for_evaluation"> | |
<title> | |
Using CFE for evaluation | |
</title> | |
<para role="Normal"> | |
Comparison of results produced by a pipeline of UIMA annotators to a | |
<code>gold standard</code> or results of two different NLP systems is a frequent | |
task. With CFE this task can be automated. | |
</para> | |
<para role="Normal"> | |
The paper "CFE a system for testing, evaluation and machine learning of | |
UIMA based applications" by Sominsky, Coden and Tanenblatt describes details of the | |
evaluation process. | |
</para> | |
</chapter> | |
</book> |