blob: 2d20d882d7d465dbc18b6bf6c3587e5de4af35c8 [file] [log] [blame]
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN"
"http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd"[
<!ENTITY imgroot "images/tools/tools.textmarker/" >
<!ENTITY % uimaents SYSTEM "../../target/docbook-shared/entities.ent" >
%uimaents;
]>
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
<chapter id="ugr.tools.tm.overview">
<title>TextMarker Overview</title>
<para>
</para>
<section id="ugr.tools.tm.overview.intro">
<title>What is TextMarker?</title>
<para>
Apache UIMA&#8482; TextMarker is a rule-based script language supported by Eclipse-based tooling.
The language is designed to enable rapid development of text processing applications within UIMA
and a special focus lies on the intuitive and flexible domain specific language for defining
patterns of annotations. Writing rules for information extraction or other text processing
applications is a tedious process. The Eclipse-based tooling for TextMarker, called the TextMarker Workbench,
was created to support the user and to facilitate every step when writing TextMarker rules.
The TextMarker rule language and the TextMarker Workbench integrate both smoothly with Apache UIMA.
</para>
</section>
<section id="ugr.tools.tm.overview.gettingstarted">
<title>Getting started</title>
<para>
This section gives a short roadmap how to read the documentation and gives some recommendations how to
start developing TextMarker-based applications. This documentation assumes that the user knows about
the core concepts of Apache UIMA. Knowledge about the meaning and usage of at least the terms <quote>CAS</quote>,
<quote>Feature Structure</quote>, <quote>Annotation</quote>, <quote>Type</quote>, <quote>Type System</quote>
and <quote>Analysis Engine</quote> is required. Please refer to the documentation of Apache UIMA for an introduction.
</para>
<para>
Unexperienced users that want to learn about TextMarker can start with the next two sections:
<xref linkend="ugr.tools.tm.overview.coreconcepts"/>
gives a short overview about the core ideas and features of the TextMarker language and Workbench.
This section introduces the main concepts of the TextMarker language. It explains how TextMarker rules
are composed and applied, and discusses the advantages of the TextMarker system.
The following <xref linkend="ugr.tools.tm.overview.examples"/> approaches the TextMarker language using a different
perspective. Here, the language is introduced only with examples. The first example starts with explaining how a simple rule
looks like, and each following example extends the syntax or semantics of the TextMarker language.
After the consultation of these two sections, the reader should know enough to start writing her first TextMarker-based application.
</para>
<para>
The TextMarker Workbench was created to support the user and to facilitate the development process. It is strongly recommended to
use this Eclipse-based IDE since it, for example, automatically configures the component descriptors and provides editing support like
syntax checking. <xref linkend="section.ugr.tools.tm.workbench.install"/> describes how the TextMarker Workbench is installed.
TextMarker rules can of course also be applied on CAS without using the TextMarker Workbench.
<xref linkend="ugr.tools.tm.ae.basic.apply"/> contains examples how to execute TextMarker rules in plain java.
A good way to get started with TextMarker is to play around with an exemplary TextMarker project, e.g.,
<uri>https://svn.apache.org/repos/asf/uima/sandbox/trunk/TextMarker/example-projects/ExampleProject</uri>. This TextMarker project
contains some simple rules for processing citation metadata.
</para>
<para>
<xref linkend="ugr.tools.tm.language.language"/> and <xref linkend="ugr.tools.tm.workbench"/> provide
more detailed descriptions and can be referred to in order to gain knowledge about specific parts
of the TextMarker language or the TextMarker Workbench.
</para>
</section>
<section id="ugr.tools.tm.overview.coreconcepts">
<title>Core Concepts</title>
<para>
The TextMarker language is an imperative rule language extended with scripting elements. A TextMarker rule defines a
pattern of annotations with additional conditions. If this pattern applies, then the actions of the rule are performed
on the matched annotations. A rule is composed of a sequence of rule elements and a rule element essentially consist of four parts:
A matching condition, an optional quantifier, a list of conditions and a list of actions.
The matching condition is typically a type of an annotation by which the rule element matches on the covered text of one of those annotations.
The quantifier specifies, whether it is necessary that the rule element successfully matches and how often the rule element may match.
The list of conditions specify additional constraints that the matched text or annotations need to fulfill. The list of actions defines
the consequences of the rule and often create new annotations or modify existing annotations.
They are only applied if all rule elements of the rule have successfully matched. Examples for TextMarker rules can be found in
<xref linkend="ugr.tools.tm.overview.examples"/>.
</para>
<para>
When TextMarker rules are applied on a document, respectively on a CAS, then they are always grouped in a script file. However, a TextMarker
script file contains not only rules, but also other statements. First of all, each script file starts with a package declaration followed by
a list of optional imports. Then, common statements like rules, type declarations or blocks build the body and functionality of a script.
<xref linkend="ugr.tools.tm.ae.basic.apply"/> gives an example, how TextMarker scripts can be applied in plain Java.
TextMarker script files are naturally organized in TextMarker projects, which is a concept of the TextMarker Workbench.
The structure of a TextMarker project is described in <xref linkend="section.ugr.tools.tm.workbench.projects"/>
</para>
<para>
The inference of TextMarker rules, that is the approach how the rules are applied, can be described as imperative, depth-first matching.
In contrast to similar rule-based systems, TextMarker rules are applied in the order they are defined in the script.
The imperative execution of the matching rules may have disadvantages, but also many advantages like an increased rate of development or
an easier explanation. The second main property of the TextMarker inference is the depth-first matching. When a rule matches on a pattern of annotations, then
an alternative is always tracked until it has matched or failed before the next alternative is considered. Therefore, the behavior of a rule may change, if
it has already matched on a early alternative and thus performed an action, which influences some constraints of the rule.
Examples, how TextMarker rules are applied, are given in <xref linkend="ugr.tools.tm.overview.examples"/>.
</para>
<para>
The TextMarker language provides the possibility to approach an annotation problem in different ways. Let us distinguish
some approaches as an example.
It is common in the TextMarker language to create many annotations of different types. These annotations are probably not the targeted annotation of the domain,
but can be helpful to incrementally approximate that interesting annotation. This enables the user to work <quote>bottom-up</quote> and <quote>top-down</quote>.
In the former approach, the rules add incrementally more complex annotations using simple ones until the target annotation could be created.
In the latter approach, the rules get more and more specific while partitioning the document in smaller segments, which result in the targeted annotation in the end.
By using many <quote>helper</quote>-annotations the engineering task becomes easier and more comprehensive.
The TextMarker language provides distinctive language elements for different tasks. There are, for example, actions
that are able to create new annotations, actions that are able to remove annotations and actions that are able to modify the
offsets of an annotation. This enables - amongst other things - an transformation-based approach. The user starts by creating general rules that are able to
annotate most of the interesting text fragments. Then, instead of making these rules more complex by adding more conditions for situations where they fail,
additional rules are defined that correct the mistakes of the general rules, e.g., by deleting false positive annotations.
<xref linkend="ugr.tools.tm.overview.examples"/> provides some examples how TextMarker rules can be engineered.
</para>
<para>
Manually writing rules is a tedious and error-prone process. The <link linkend="ugr.tools.tm.workbench">TextMarker Workbench</link>
was developed for facilitate writing rules by providing as much tooling support as possible. This includes, for example, syntax checking and auto completion, which
make the development less error-prone. The user can annotate documents and use these documents as unit tests for test-driven development or
quality maintenance. Sometimes, it is necessary to debug the rules because they did not match as expected. For this use case, the explanation perspective provides views
that explain every detail of the matching process. Finally, the TextMarker language can also be used by the tooling, for example, by the <quote>Query</quote> view.
Here, the user can use TextMarker rules as query statements in order to investigate annotated documents.
</para>
<para>
TextMarker smoothly integrates with Apache UIMA. First of all, the TextMarker rules are applied using a generic Analysis Engine and thus TextMarker scripts can
easily added to Apache UIMA pipelines. TextMarker also provides the functionality to import and use other UIMA components like Analysis Engines and Type Systems.
TextMarker rules can refer to every type defined in an imported type system and the TextMarker Workbench generates a type system descriptor file containing all
types that were defined in a script file. Any Analysis Engine can be executed by rules as long as their implementation is available in the classpath. Therefore,
functionality outsourced in an arbitrary Analysis Engine can be added and used within TextMarker.
</para>
</section>
<section id="ugr.tools.tm.overview.examples">
<title>Learning by Example</title>
<para>
This section gives a short introduction to the TextMarker language by explaining the rule syntax
and inference with some simplified examples. It is recommended to use the TextMarker Workbench to write TextMarker rules
in order to gain the advantages like syntax checking. A short description how to install the TextMarker Workbench
is given <link linkend="section.ugr.tools.tm.workbench.install">here</link>. The following examples make use of the
annotations added by the default seeding of the TextMarker Analysis Engine. There meaning is explained along with the examples.
</para>
<note><para>
The examples in this section are not valid script files as they are missing at least a package declaration.
In order to obtain a valid script file, please ensure that all used types are imported or declared and
that a package declaration like <quote>PACKAGE uima.textmarker.example;</quote> is added in the first line of the script.
</para></note>
<para>
The first example consists of a declaration of a type followed by a simple rule. Type declaration always starts with the keyword
<quote>DECLARE</quote> followed by the short name of the new type. The namespace of the type is equal to the package declaration of the script file.
There is also the possibility to create more complex types with features or specific parent types, but this will be neglected for now.
In the example, a simple annotation type with the short name <quote>Animal</quote> is defined.
After the declaration of the type, a rule with one rule element is given.
TextMarker rules in general can consist of a sequence of rule elements. Simple rule elements themselves consist of four parts: A matching condition,
an optional quantifier, an optional list of conditions and an optional list of actions. The rule element in the
following example has a matching condition <quote>W</quote>, an annotation type standing for normal words.
Statements like declarations and rules always end with a semicolon.
</para>
<programlisting><![CDATA[DECLARE Animal;
W{REGEXP("dog") -> MARK(Animal)};]]></programlisting>
<para>
The rule element also contains one condition and one action, both surrounded by curly parentheses. In order to distinguish conditions from actions,
they are separated by <quote>-></quote>. The condition <quote>REGEXP("dog")</quote> indicates that the matched
word must match the regular expression <quote>dog</quote>. If the matching condition and the additional regular expression are fulfilled, then the action
is executed, which creates a new annotation of the type <quote>Animal</quote> with the same offsets as the matched token.
The default seeder does actually not add annotations of the type <quote>W</quote>, but annotations of the types <quote>SW</quote> and
<quote>CW</quote> for small written words and capitalized words, which both have the parent type <quote>W</quote>.
</para>
<para>
Since it is tedious to create Animal annotation by matching on different regular expression, we apply an external dictionary in the next example.
The first line defines a word list named <quote>AnimalsList</quote>, which is located in the resource folder (the file <quote>Animals.txt</quote>
contains one animal name in each line). After the declaration of the type, a rule uses this word list to find all occurrences of animals
in the complete document.
</para>
<programlisting><![CDATA[WORDLIST AnimalsList = 'Animals.txt;'
DECLARE Animal;
Document{-> MARKFAST(Animal, AnimalsList)};
]]></programlisting>
<para>
The matching condition of the rule element refers to the complete document, or more specific to the annotation of the type
<quote>DocumentAnnotation</quote> which covers the whole document.
The action <quote>MARKFAST</quote> of this rule element then creates an annotation of the type <quote>Animal</quote> for each found
entry of the dictionary <quote>AnimalsList</quote>.
</para>
<para>
The next example introduces rules with more than one rule element, whereby one of them is a composed rule element. The goal of the following rule tries to
annotate occurrences of animals separated by commas, e.g., <quote>dog, cat, bird</quote>.
</para>
<programlisting><![CDATA[DECLARE AnimalEnum;
(Animal COMMA)+{-> MARK(AnimalEnum,1,2)} Animal;]]></programlisting>
<para>
The rule consists of two rule elements, <quote>(Animal COMMA)+{-> MARK(AnimalEnum,1,2)}</quote> being the first rule element and
<quote>Animal</quote> the second one. Lets take a closer look at the first rule element. This rule element is actually composed of two normal rule elements,
that are <quote>Animal</quote> and <quote>COMMA</quote>, and contains a greedy quantifier and one action. Therefore, this rule element matches on
one Animal annotation and a following comma. This is repeated until one of the inner rule elements do not match anymore. Then, there has to be
another Animal annotation afterwards, specified by the second rule element of the rule. In this case, the rule matches and its action is executed:
The MARK action creates a new annotation of the type <quote>AnimalEnum</quote>. However, in contrast to the previous examples, this action also
contains two number. Those numbers refer to the rule elements, that should be used to calculate the span of the created annotation. The numbers
<quote>1, 2</quote> state that the new annotation should start with the first rule element, the composed one, and should end with the second rule element.
</para>
<para>
Lets make the composed rule element a bit more complex. The following rule also matches on lists of animals, which are
separated by semicolon. Therefore, a disjunctive rule element is added, indicated by the symbol <quote>|</quote>, which matches on
annotations of the type <quote>COMMA</quote> or <quote>SEMICOLON</quote>.
</para>
<programlisting><![CDATA[(Animal (COMMA | SEMICOLON))+{-> MARK(AnimalEnum,1,2)} Animal;]]></programlisting>
<para>
Rule elements can of course contain more then one condition. The rule in the next example tries to identify headlines, which are bold,
underlined and end with a colon.
</para>
<programlisting><![CDATA[DECLARE Headline;
Paragraph{CONTAINS(Bold, 90, 100, true),
CONTAINS(Underlined, 90, 100, true), ENDSWITH(COLON)
-> MARK(Headline)};]]></programlisting>
<para>
The matching condition of this rule element is given with the type <quote>Paragraph</quote>, thus the rule takes a look at all Paragraph annotations.
The rule matches only if the three conditions, separated by commas, are fulfilled. The first condition <quote>CONTAINS(Bold, 90, 100, true)</quote> states that
90%-100% of the matched paragraph annotation should also be annotated with annotations of the type <quote>Bold</quote>. The boolean parameter <quote>true</quote>
indicates that amount of Bold annotations should be calculated relatively to the matched annotation. Therefore, the two numbers <quote>90,100</quote> are interpreted as
percent amounts. The exact calculation of the coverage is dependent on the tokenization of the document and is neglected for now. The second condition
<quote>CONTAINS(Underlined, 90, 100, true)</quote> consequently states that the paragraph should also contain at least 90% of annotations of the type <quote>underlined</quote>.
The third condition <quote>ENDSWITH(COLON)</quote> finally forces the Paragraph annotation to end with a colon. It is only fulfilled, if there is an annotation of the type
<quote>COLON</quote>, whose end offset is equal to the end offset of the matched Paragraph annotation.
</para>
<para>
The readability and maintenance of rules does not increase if more and more conditions are added.
One of the strengths of the TextMarker language is that it provides always different approaches to solve an annotation task. The next two examples
introduce actions for transformation-based rules.
</para>
<programlisting><![CDATA[Headline{-CONTAINS(W) -> UNMARK(Headline)};]]></programlisting>
<para>
This rule consists of one condition and one action. The condition <quote>-CONTAINS(W)</quote> is negated (indicated by the character <quote>-</quote>),
and is therefore only fulfilled if there are no annotations of the type <quote>W</quote> within the bound of the matched Headline annotation.
The action <quote>UNMARK(Headline)</quote> removes the matched Headline annotation. Put into simple words, headlines that contain no words at all are not headlines.
</para>
<para>
The next rule does not remove an annotation, but changes its offsets dependent on the context.
</para>
<programlisting><![CDATA[]]>Headline{-> SHIFT(Headline, 1, 2)} COLON;</programlisting>
<para>
Here, the action <quote>SHIFT(Headline, 1, 2)</quote> expands the matched Headline annotation to the next colon if that Headline annotation
is followed by a COLON annotation.
</para>
<para>
TextMarker rules can of course contain arbitrary conditions and actions, which is illustrated by the next example.
</para>
<programlisting><![CDATA[DECLARE Month, Year, Date;
ANY{INLIST(MonthsList) -> MARK(Month), MARK(Date,1,3)}
PERIOD? NUM{REGEXP(".{2,4}") -> MARK(Year))};]]></programlisting>
<para>
This rule consists of three rule elements. The first one matches on any token, whose covered token occurs in a word lists named <quote>MonthsList</quote>.
The second rule element is optional and does not need to be fulfilled, which is indicated by the quantifier <quote>?</quote>. The last rule element matches
on numbers that fulfill the regular expression <quote>REGEXP(".{2,4}"</quote> and are therefore at least two characters to a maximum of four character long.
If this rule successfully matches on a text passage, then its three actions are executed: An annotation of the type <quote>Month</quote> is created for the first rule element,
an annotation of the type <quote>Year</quote> is created for the last rule element and an annotation of the type <quote>Date</quote>
is created for the span of all three rule elements. If the word list contains the correct entries, then this rule matches on strings like
<quote>Dec. 2004</quote>, <quote>July 85</quote> or <quote>11.2008</quote> and creates the corresponding annotations.
</para>
<para>
After introducing the composition of rule elements, the default matching strategy is examined. The two rules in the next example just create an annotation
for a sequence of arbitrary tokens with the only difference of one condition.
</para>
<programlisting><![CDATA[DECLARE Text1, Text2;
ANY+{ -> MARK(Text1)};
ANY+{-PARTOF(Text2) -> MARK(Text2)};]]></programlisting>
<para>
The first rule matches on each occurrence of an arbitrary token and continues this until the end of the document is reached.
This is of course caused by the greedy quantifier <quote>+</quote>. Note that this rule considers each occurrence of an token and is therefore
executed for each token resulting many overlapping annotations. Let this behavior be illustrated with an example:
When applied on the document <quote>Peter works for Frank</quote>, the rule creates four annotations with the covered texts
<quote>Peter works for Frank</quote>, <quote>works for Frank</quote>, <quote>for Frank</quote> and <quote>Frank</quote>.
The rule first tries to match on the token <quote>Peter</quote> and continues its matching. Then, it tries to match on the token <quote>works</quote> and
continues its matching, and so on.
</para>
<para>
In this example, the second rule only returns one annotation, which covers the complete document. This is caused by the additional
condition <quote>-PARTOF(Text2)</quote>. The PARTOF condition is fulfilled, if the matched annotation is located within an annotation of the given type, or
put in simple words, if the matched annotation is part of an annotation of the type <quote>Text2</quote>. When applied on the
document <quote>Peter works for Frank</quote>, the rule matches on the first token <quote>Peter</quote>, continues its match and
creates an annotations of the type <quote>Text2</quote> for the complete document. Then it tries to match on the second token <quote>works</quote>, but fails,
because this token is already part of an Text2 annotation.
</para>
<para>
TextMarker rules can not only be used to create or modify annotations, but also to create features for annotations. The next example defines
and assigns a relation about employment, by storing the given annotations as feature values.
</para>
<programlisting><![CDATA[DECLARE Annotation EmplRelation
(Employee employeeRef, Employer employerRef);
Sentence{CONTAINS(EmploymentIndicator) -> CREATE(EmplRelation,
"employeeRef" = Employee, "employerRef" = Employer)};]]></programlisting>
<para>
The first statement of this example is a declaration that defines a new type of annotation named <quote>EmplRelation</quote>.
This annotation has two features:
One feature with the name <quote>employeeRef</quote> of the type <quote>Employee</quote> and
one feature with the name <quote>employerRef</quote> of the type <quote>Employer</quote>.
The second statement of this example, that is a simple rule, creates one annotation of the type <quote>EmplRelation</quote> for
each Sentence annotation that contains at least one annotation of the type EmploymentIndicator. Additionally to creating an annotation,
the CREATE action also assigns an annotation of the <quote>Employee</quote>, which needs to be located within the span of the matched sentence,
to the feature <quote>employeeRef</quote> and an Employer annotation to the feature <quote>employerRef</quote>. The annotations mentioned in this
example need of course to be present in advance.
</para>
<para>
In the last example, the values of features were defined as annotation types. However, also primitive
types can be used, as will be shown in the next example, together with a short introduction of variables.
</para>
<programlisting><![CDATA[DECLARE Annotation MoneyAmount(STRING currency, INT amount);
INT moneyAmount;
STRING moneyCurrency;
NUM{PARSE(moneyAmount)} SPECIAL{REGEXP("€") -> MATCHEDTEXT(moneyCurrency),
CREATE(MoneyAmount, 1, 2, "amount" = moneyAmount,
"currency" = moneyCurrency)};]]></programlisting>
<para>
First, a new annotation with the name <quote>MoneyAmount</quote> and two features is defined, one string feature and one integer feature.
Then two TextMarker variables are declared, one integer variable and one string variable. The rule matches on a number, whose value is stored
in the variable <quote>moneyAmount</quote>, followed by a special token that needs to be equal to the string <quote>€</quote>. Then,
the covered text of the special annotation is stored in the string variable <quote>moneyCurrency</quote> and annotation of the
type <quote>MoneyAmount</quote> spanning over both rule elements is created. Additionally, the variables are assigned as feature values.
</para>
<para>
TextMarker script files with many rules can quickly get confusing. Therefore, TextMarker allows to import other script files in order to increase
the modularity of a project or to create rule libraries. The next example imports the rules together with all known types of another script file
and calls that script file.
</para>
<programlisting><![CDATA[SCRIPT uima.textmarker.example.SecondaryScript;
Document{-> CALL(SecondaryScript)};]]></programlisting>
<para>
The script file with the name <quote>SecondaryScript.tm</quote> located in the package <quote>uima/textmarker/example</quote> is imported and executed
by the CALL action on the complete document. The script needs to be located in the folder specified by the parameter
<link linkend="ugr.tools.tm.ae.basic.parameter.scriptPaths">scriptPaths</link>. It is also possible to import script files of other TextMarker projects, e.g.,
by adapting the configuration parameters of the TextMarker Analysis Engine or
by setting a project reference in the project properties of a TextMarker project.
</para>
<para>
The types of important annotations of the application are often defined in a separate type system. The next example shows how to import those types.
</para>
<programlisting><![CDATA[TYPESYSTEM my.package.NamedEntityTypeSystem;
Person{PARTOF(Organization) -> UNMARK(Person)};
]]></programlisting>
<para>
The type system descriptor file with the name <quote>NamedEntityTypeSystem.xml</quote> located in the package <quote>my/package</quote> is imported.
The descriptor needs to be located in a folder specified by the parameter
<link linkend="ugr.tools.tm.ae.basic.parameter.descriptorPaths">descriptorPaths</link>.
</para>
<para>
It is sometimes easier to express some functionality with control structures known by programming languages rather than to engineer all functionality
only with matching rules. The TextMarker language provides the BLOCK element for some of those use cases.
The TextMarker BLOCK element starts with the keyword <quote>BLOCK</quote> followed by its name in parentheses. The name of a block has two purposes:
On the one hand, its easier to distinguish the block if they have different names, e.g., in the
<link linkend="section.ugr.tools.tm.workbench.explain_perspective">explain perspective</link> of the TextMarker Workbench. On the other hand,
the name can be used to execute this block using the CALL action. Hereby, it is possible to access only specific sets of rules of other script files,
or to implement a recursive call of rules. After the name of the block, a single rule element is given, which has curly parentheses,
even if no conditions or actions are specified. Then, the body of the block is framed by curly brackets.
</para>
<programlisting><![CDATA[BLOCK(English) Document{FEATURE("language", "en")} {
// rules for english documents
}
BLOCK(English) Document{FEATURE("language", "de")} {
// rules for german documents
}]]></programlisting>
<para>
This example contains two simpler BLOCK statements. The rules defined with the block are only executed if the condition in the head of the block is fulfilled.
Therefore, the rules of the first block are only considered if the feature <quote>language</quote> of the document annotation has the value <quote>en</quote>.
Following this, the rules of the second block are only considered for german documents.
</para>
<para>
The rule element of the block definition can also refer to other annotation types than <quote>Document</quote>. While the last example implemented something similar
to an if-statement, the next example provides a show case for something similar to a for-each-statement.
</para>
<programlisting><![CDATA[DECLARE SentenceWithNoLeadingNP;
BLOCK(ForEach) Sentence{} {
Document{-STARTSWITH(NP) -> MARK(SentenceWithNoLeadingNP)};
}
]]></programlisting>
<para>
Here, the rule in the block statement is performed for each occurence of an annotation of the type <quote>Sentence</quote>.
The rule within the block matches on the complete document, which is the current sentence in the context of the block statement.
As a consequence, this example creates an annotation of the type <quote>SentenceWithNoLeadingNP</quote> for each sentence
that does not start with an NP annotation.
</para>
<para>
Lets take a closer look on what exactly the TextMarker rules match. The following rule matches on a word followed by another word:
</para>
<programlisting><![CDATA[W W;]]></programlisting>
<para>
To be more precise, this rule matches on all documents like <quote>Apache UIMA</quote>, <quote> Apache UIMA </quote>, <quote>ApacheUIMA</quote>,
<quote><![CDATA[Apache <b>UIMA</b>]]></quote>. There are two main reasons for this: First of all, this depends on how the available annotations are defined. The default seeder
for the inital annotations creates an annotation for all characters until a upper case character occurs. Thus, the string <quote>ApacheUIMA</quote> consists of
two tokens.
However, more important, the TextMarker language provides a concept of visibility of the annotations. By default, all annotations of the types
<quote>SPACE</quote>, <quote>NBSP</quote>, <quote>BREAK</quote> and <quote>MARKUP</quote> (whitespace and XML elements) are filtered and not visible. This holds of course for
their covered text too. The rule elements skip all positions of the
document where those annotations occur. Therefore, the rule in the last example matches on all examples. Without the default filtering settings,
with all annotations set to visible, the rule matches only on the document <quote>ApacheUIMA</quote> since it is the only one that contains two word annotations without
any whitespace between them.
</para>
<para>
The filtering setting can also be modified by the TextMarker rules themselves. The next example provides some rules that extend and limit
the amount of visible text of the document.
</para>
<programlisting><![CDATA[Sentence;
Document{-> RETAINTYPE(SPACE)};
Sentence;
Document{-> FILTERTYPE(CW)};
Sentence;
Document{-> RETAINTYPE, FILTERTYPE};]]></programlisting>
<para>
The first rule simply matches on sentences that starts not with any filtered type. Sentences that start with whitespace or markup,
for example, are not considered.
The next rule retains all text that is covered by annotations of the type <quote>SPACE</quote> meaning
that the rule elements are now sensible to whitespaces. Therefore, the following rule will match on sentences that start with whitespaces.
The third rule now filters the type <quote>CW</quote> with the consequence that all capitalized words are invisible.
If the following rule now wants to match on sentences, then this is only possible for Sentence annotations that do not start with a capitalized word.
The last rule finally resets the filtering setting to the default configuration in the TextMarker Analysis Engine.
</para>
<para>
The next exmaple gives a showcase for importing external Analysis Engines and for modifying the documents by creating a new view named <quote>modified</quote>.
Additional Analysis Engines can be imported with the keyword <quote>ENGINE</quote> followed by the name of the descriptor. These imported Analysis Engines can be
executed with the actions <quote>CALL</quote> or <quote>EXEC</quote>. If the executed Analysis Engine adds, removes or modifies annotations, then their types need
to be mentioned when calling the descriptor, or else those annotations will not be correctly processed by the following TextMarker rules.
</para>
<programlisting><![CDATA[ENGINE utils.Modifier;
Date{-> DEL};
MoneyAmount{-> REPLACE("<MoneyAmount/>")};
Document{-> COLOR(Headline, "green")};
Document{-> EXEC(Modifier)};
]]></programlisting>
<para>
In this example, we first import an Analysis Engine defined by the descriptor <quote>Modifier.xml</quote> located in the folder <quote>utils</quote>.
The descriptor needs, of course, be located in the folder specified by the parameter <link linkend="ugr.tools.tm.ae.basic.parameter.descriptorPaths">descriptorPaths</link>.
The first rule deletes all text covered by annotations of the type <quote>DEL</quote>. The second rule replaces the text of all annotations of the type <quote>MoneyAmount</quote>
with the string <quote><![CDATA[<MoneyAmount/>]]></quote>. The third rule remembers to set the background color of text in Headline annotation to green. The last rule
finally performs all of these changes in an additonal view called <quote>modified</quote>.
</para>
</section>
<section id="ugr.tools.tm.ae">
<title>UIMA Analysis Engines</title>
<para>This section gives an overview of the UIMA Analysis Engines shipped with TextMarker. The most
important one is <quote>TextMarkerEngine</quote>, a generic analysis engine, which is able to interpret
and execute script files. The other analysis engines provide support for some additional functionality or
add certain types of annotations.
</para>
<section id="ugr.tools.tm.ae.basic">
<title>TextMarker Engine</title>
<para>
This generic Analysis Engine is the most important one for the TextMarker language since it is
responsible for applying the TextMarker rules on a CAS. Its functionality is configured by the configuration parameters,
which, for example, specify the rule file that should be executed. In the TextMarker IDE, a basic template named <quote>BasicEngine.xml</quote>
is given in the descriptor folder of a TextMarker project and correctly configured descriptors typically named <quote>MyScriptEngine.xml</quote>
are generated in the descriptor folder corresponding to the package namespace of the script file.
The available configuration parameters of the TextMarker Analysis Engine are described in the following.
</para>
<section id="ugr.tools.tm.ae.basic.apply">
<title>Apply TextMarker Analysis Engine in plain Java</title>
<para>
Let's assume that you wrote the TextMarker rules using the TextMarker Workbench, which already creates correctly configured descriptors.
In this case, you can simply use the following java code to apply the TextMarker script.
</para>
<programlisting><![CDATA[File specFile = new File("pathToMyWorkspace/MyProject/descriptor/"+
"my/package/MyScriptEngine.xml");
XMLInputSource in = new XMLInputSource(specFile);
ResourceSpecifier specifier = UIMAFramework.getXMLParser().
parseResourceSpecifier(in);
// for import by name... set the datapath in the ResourceManager
AnalysisEngine ae = UIMAFramework.produceAnalysisEngine(specifier);
CAS cas = ae.newCAS();
cas.setDocumentText("This is my document.");
ae.process(cas);]]></programlisting>
<note><para>
The TextMarker Analysis Engine utilizes type priorities. If the CAS object is
not created using the TextMarker Analysis Engine descriptor by other means, then please
provide the necessary type priorities for a valid execution of the TextMarker rules.
</para></note>
<para>
If the TextMarker script was written, for example, with a common text editor and no configured descriptors are yet available,
then the following java code can be used, which, however, is only applicable for executing single script files that do not import
additional components or scripts. In that case the other parameters, e.g., <quote>additionalScripts</quote>, need to be configured correctly.
</para>
<programlisting><![CDATA[URL aedesc = TextMarkerEngine.class.getResource("BasicEngine.xml");
XMLInputSource inae = new XMLInputSource(aedesc);
ResourceSpecifier specifier = UIMAFramework.getXMLParser().
parseResourceSpecifier(inae);
ResourceManager resMgr = UIMAFramework.newDefaultResourceManager();
AnalysisEngineDescription aed = (AnalysisEngineDescription) specifier;
TypeSystemDescription basicTypeSystem = aed.getAnalysisEngineMetaData().
getTypeSystem();
Collection<TypeSystemDescription> tsds =
new ArrayList<TypeSystemDescription>();
tsds.add(basicTypeSystem);
// add some other type system descriptors
// that are needed by your script file
TypeSystemDescription mergeTypeSystems = CasCreationUtils.
mergeTypeSystems(tsds);
aed.getAnalysisEngineMetaData().setTypeSystem(mergeTypeSystems);
aed.resolveImports(resMgr);
AnalysisEngine ae = UIMAFramework.produceAnalysisEngine(aed,
resMgr, null);
File scriptFile = new File("path/to/file/MyScript.tm");
ae.setConfigParameterValue(TextMarkerEngine.SCRIPT_PATHS,
new String[] { scriptFile.getParent().getAbsolutePath() });
String name = scriptFile.getName().substring(0,
scriptFile.getName().length() - 3);
ae.setConfigParameterValue(TextMarkerEngine.MAIN_SCRIPT, name);
ae.reconfigure();
CAS cas = ae.newCAS();
cas.setDocumentText("This is my document.");
ae.process(cas);]]></programlisting>
</section>
<section id="ugr.tools.tm.ae.basic.parameter">
<title>Configuration Parameters</title>
<para>
The configuration parameters of the TextMarker Analysis Engine can be separated into three
different groups: parameters for the setup of the environment (<link linkend='ugr.tools.tm.ae.basic.parameter.mainScript'>mainScript</link>
to <link linkend='ugr.tools.tm.ae.basic.parameter.additionalExtensions'>additionalExtensions</link>),
parameters that change the behavior of the analysis engine (<link linkend='ugr.tools.tm.ae.basic.parameter.reloadScript'>reloadScript</link>
to <link linkend='ugr.tools.tm.ae.basic.parameter.simpleGreedyForComposed'>simpleGreedyForComposed</link>)
and parameters for creating additional information how the rules were executed
(<link linkend='ugr.tools.tm.ae.basic.parameter.debug'>debug</link>
to <link linkend='ugr.tools.tm.ae.basic.parameter.createdBy'>createdBy</link>). First, a short overview of the configuration parameters is given in
<xref linkend='table.ugr.tools.tm.ae.parameter' />. Then all parameters are described in detail with examples.
</para>
<para>
To change the value of any configuration parameter within a TextMarker script, the CONFIGURE action (see <xref linkend='ugr.tools.tm.language.actions.configure' />)
can be used. For changing behaviour of <link linkend='ugr.tools.tm.ae.basic.parameter.dynamicAnchoring'>dynamicAnchoring</link> the DYNAMICANCHORING action
(see <xref linkend='ugr.tools.tm.language.actions.dynamicanchoring' />) is recommended.
</para>
<para>
<table id="table.ugr.tools.tm.ae.parameter" frame="all">
<title>Configuration parameters of the TextMarker Analysis Engine </title>
<tgroup cols="3" colsep="1" rowsep="1">
<colspec colname="c1" colwidth="1.2*" />
<colspec colname="c2" colwidth="2*" />
<colspec colname="c3" colwidth="0.8*" />
<thead>
<row>
<entry align="center">Name</entry>
<entry align="center">Short description</entry>
<entry align="center">Type</entry>
</row>
</thead>
<tbody>
<row>
<entry>
<link linkend='ugr.tools.tm.ae.basic.parameter.mainScript'>mainScript</link>
</entry>
<entry>Name with complete namespace of the script which will be interpreted and
executed by the analysis engine.
</entry>
<entry>Single String</entry>
</row>
<row>
<entry>
<link linkend='ugr.tools.tm.ae.basic.parameter.scriptEncoding'>scriptEncoding</link>
</entry>
<entry>Encoding of all TextMarker script files.</entry>
<entry>Single String</entry>
</row>
<row>
<entry>
<link linkend='ugr.tools.tm.ae.basic.parameter.scriptPaths'>scriptPaths</link>
</entry>
<entry>List of absolute locations, which contain the neccessary script files like
the main script.
</entry>
<entry>Multi String</entry>
</row>
<row>
<entry>
<link linkend='ugr.tools.tm.ae.basic.parameter.descriptorPaths'>descriptorPaths</link>
</entry>
<entry>List of absolute locations, which contain the neccessary descriptor files
like type systems.
</entry>
<entry>Multi String</entry>
</row>
<row>
<entry>
<link linkend='ugr.tools.tm.ae.basic.parameter.resourcePaths'>resourcePaths</link>
</entry>
<entry>List of absolute locations, which contain the neccessary resource files like
word lists.
</entry>
<entry>Multi String</entry>
</row>
<row>
<entry>
<link linkend='ugr.tools.tm.ae.basic.parameter.additionalScripts'>additionalScripts</link>
</entry>
<entry>List of names with complete namespace of additional scripts, which can be
referred to.
</entry>
<entry>Multi String</entry>
</row>
<row>
<entry>
<link linkend='ugr.tools.tm.ae.basic.parameter.additionalEngines'>additionalEngines</link>
</entry>
<entry>List of names with complete namespace of additional analysis engines, which
can be called by TextMarker rules.
</entry>
<entry>Multi String</entry>
</row>
<row>
<entry>
<link linkend='ugr.tools.tm.ae.basic.parameter.additionalEngineLoaders'>additionalEngineLoaders</link>
</entry>
<entry>List of class names of implementations that are able to perform additional
task when loading external analysis engines.
</entry>
<entry>Multi String</entry>
</row>
<row>
<entry>
<link linkend='ugr.tools.tm.ae.basic.parameter.additionalExtensions'>additionalExtensions</link>
</entry>
<entry>List of factory classes for additional extensions of the TextMarker language
like proprietary conditions.
</entry>
<entry>Multi String</entry>
</row>
<row>
<entry>
<link linkend='ugr.tools.tm.ae.basic.parameter.reloadScript'>reloadScript</link>
</entry>
<entry>Option to initialize the rule script each time the analysis engine processes
a CAS.
</entry>
<entry>Single Boolean</entry>
</row>
<row>
<entry>
<link linkend='ugr.tools.tm.ae.basic.parameter.seeders'>seeders</link>
</entry>
<entry>List of class names that provide additional annotations before the rules are
executed.
</entry>
<entry>Multi String</entry>
</row>
<row>
<entry>
<link linkend='ugr.tools.tm.ae.basic.parameter.defaultFilteredTypes'>defaultFilteredTypes</link>
</entry>
<entry>List of complete type names of annotations that are invisible by default.
</entry>
<entry>Multi String</entry>
</row>
<row>
<entry>
<link linkend='ugr.tools.tm.ae.basic.parameter.removeBasics'>removeBasics</link>
</entry>
<entry>Option to remove all inference annotations after execution of the rule script.
</entry>
<entry>Single Boolean</entry>
</row>
<row>
<entry>
<link linkend='ugr.tools.tm.ae.basic.parameter.dynamicAnchoring'>dynamicAnchoring</link>
</entry>
<entry>Option to allow rule matches to start at any rule element.</entry>
<entry>Single Boolean</entry>
</row>
<row>
<entry>
<link linkend='ugr.tools.tm.ae.basic.parameter.lowMemoryProfile'>lowMemoryProfile</link>
</entry>
<entry>Option to decrease the memory consumption when processing a large CAS.</entry>
<entry>Single Boolean</entry>
</row>
<row>
<entry>
<link linkend='ugr.tools.tm.ae.basic.parameter.simpleGreedyForComposed'>simpleGreedyForComposed</link>
</entry>
<entry>Option to activate a different inferencer for composed rule elements.</entry>
<entry>Single Boolean</entry>
</row>
<row>
<entry>
<link linkend='ugr.tools.tm.ae.basic.parameter.debug'>debug</link>
</entry>
<entry>Option to add debug information to the CAS.</entry>
<entry>Single Boolean</entry>
</row>
<row>
<entry>
<link linkend='ugr.tools.tm.ae.basic.parameter.debugWithMatches'>debugWithMatches</link>
</entry>
<entry>Option to add information about the rule matches to the CAS.</entry>
<entry>Single Boolean</entry>
</row>
<row>
<entry>
<link linkend='ugr.tools.tm.ae.basic.parameter.debugOnlyFor'>debugOnlyFor</link>
</entry>
<entry>List of rule ids. If provided, then debug information is only created for
those rules.
</entry>
<entry>Multi String</entry>
</row>
<row>
<entry>
<link linkend='ugr.tools.tm.ae.basic.parameter.profile'>profile</link>
</entry>
<entry>Option to add profile information to the CAS.</entry>
<entry>Single Boolean</entry>
</row>
<row>
<entry>
<link linkend='ugr.tools.tm.ae.basic.parameter.statistics'>statistics</link>
</entry>
<entry>Option to add statistics of conditions and actions to the CAS.</entry>
<entry>Single Boolean</entry>
</row>
<row>
<entry>
<link linkend='ugr.tools.tm.ae.basic.parameter.createdBy'>createdBy</link>
</entry>
<entry>Option to add additional information, which rule created a annotation.
</entry>
<entry>Single Boolean</entry>
</row>
</tbody>
</tgroup>
</table>
</para>
<section id="ugr.tools.tm.ae.basic.parameter.mainScript">
<title>mainScript</title>
<para>
This parameter specifies the rule file that will be executed by the analysis engine and is
therefore one of the most important ones. The extact name of the script is given by the complete namespace of the file, which corresponds to its location
relative to the given parameter <link linkend='ugr.tools.tm.ae.basic.parameter.scriptPaths'>scriptPaths</link>.
The single names of packages (or folders) are separated by periods. An exemplary value for this parameter could be "org.apache.uima.Main",
whereas "Main" specifies the file containing the rules and "org.apache.uima" its package.
In this case, the analysis engine loads the script file "Main.tm", which is located in the folder structure "org/apache/uima/".
This parameter has no default value and has to be provided, although it is not specified as mandatory.
</para>
</section>
<section id="ugr.tools.tm.ae.basic.parameter.scriptEncoding">
<title>scriptEncoding</title>
<para>
This parameter specifies the encoding of the rule files. Its default value is "UTF-8".
</para>
</section>
<section id="ugr.tools.tm.ae.basic.parameter.scriptPaths">
<title>scriptPaths</title>
<para>
The parameter scriptPaths refers to a list of String values, which specify the possible locations of script files.
The given locations are absolute paths. A typical value for this parameter is for example "C:/TextMarker/MyProject/script/".
If the parameter <link linkend='ugr.tools.tm.ae.basic.parameter.mainScript'>mainScript</link> is set to org.apache.uima.Main,
then the absolute path of the script file has to be "C:/TextMarker/MyProject/script/org/apache/uima/Main.tm".
This parameter can contain multiple values, as the main script can refer to multiple projects similar to a class path in Java.
</para>
</section>
<section id="ugr.tools.tm.ae.basic.parameter.descriptorPaths">
<title>descriptorPaths</title>
<para>
This parameter specifies the possible locations for descriptors like analysis engines or type systems, similar to the parameter
<link linkend='ugr.tools.tm.ae.basic.parameter.scriptPaths'>scriptPaths</link> for the script files. A typical value for this parameter
is for example "C:/TextMarker/MyProject/descriptor/".
The relative values of the parameter <link linkend='ugr.tools.tm.ae.basic.parameter.additionalEngines'>additionalEngines</link> are
resolved to these absolute locations.
This parameter can contain multiple values, as the main script can refer to multiple projects similar to a class path in Java.
</para>
</section>
<section id="ugr.tools.tm.ae.basic.parameter.resourcePaths">
<title>resourcePaths</title>
<para>
This parameter specifies the possible locations of additional resources like word lists or CSV tables. The string values have to contain absolute
locations, for example, "C:/TextMarker/MyProject/resources/".
</para>
</section>
<section id="ugr.tools.tm.ae.basic.parameter.additionalScripts">
<title>additionalScripts</title>
<para>
The parameter additionalScripts is defined as a list of string values and contains script files, which are additionally loaded by the analysis engine. These script files are specified by their
complete namespace, exactly like the value of the parameter <link linkend='ugr.tools.tm.ae.basic.parameter.mainScript'>mainScript</link>
and can be refered to by language elements, e.g., by executing the containing rules. An exemplary value of this parameter is "org.apache.uima.SecondaryScript". In this example, the main script could import
this script file by the declaration "SCRIPT org.apache.uima.SecondaryScript;" and then could execute it with the rule
"Document{-> CALL(SecondaryScript)};".
</para>
</section>
<section id="ugr.tools.tm.ae.basic.parameter.additionalEngines">
<title>additionalEngines</title>
<para>
This parameter contains a list of additional analysis engines, which can be executed by the TextMarker rules. The single values
are given by the name of the analysis engine with their complete namespace and have to be located relative to one value of the parameter
<link linkend='ugr.tools.tm.ae.basic.parameter.descriptorPaths'>descriptorPaths</link>, the location, where the analysis engine searches for the descriptor file.
An exmaple for one value of the parameter is "utils.HtmlAnnotator", which points to the descriptor "HtmlAnnotator.xml" in the folder "utils".
</para>
</section>
<section id="ugr.tools.tm.ae.basic.parameter.additionalEngineLoaders">
<title>additionalEngineLoaders</title>
<para>
The parameter "additionalEngineLoaders" specifies a list of optional implementations of the interface
"org.apache.uima.textmarker.extensions.IEngineLoader", which can be used to application-specific configurations of
additional analysis engines.
</para>
</section>
<section id="ugr.tools.tm.ae.basic.parameter.additionalExtensions">
<title>additionalExtensions</title>
<para>
This parameter specifies optional extensions of the TextMarker language. The elements of the string list must implement the interface
"org.apache.uima.textmarker.extensions.ITextMarkerExtension". With those extensions, application-specific conditions and actions can be
added to the set of provided ones.
</para>
</section>
<section id="ugr.tools.tm.ae.basic.parameter.reloadScript">
<title>reloadScript</title>
<para>
This boolean parameter indicates whether the script or resource files should be reloaded when processing a CAS. The default value is set to false.
In this case, the script files are loaded when the analysis engine is initialized. If script files or resource files are extended, e.g., a dictionary is filled
yet when a collection of documents are processed, then the parameter is needed to be set to true in order to include the changes.
</para>
</section>
<section id="ugr.tools.tm.ae.basic.parameter.seeders">
<title>seeders</title>
<para>
This list of string values refers to implementations of the interface "org.apache.uima.textmarker.seed.TextMarkerAnnotationSeeder",
which can be used to automatically add annotations to the CAS. The default value of the parameter is a single seeder, namely "org.apache.uima.textmarker.seed.DefaultSeeder"
that adds annotations for token classes like CW, MARKUP or SEMICOLON. Remember that additional annotations can also be added with
an additional engine that is executed by a TextMarker rule.
</para>
</section>
<section id="ugr.tools.tm.ae.basic.parameter.defaultFilteredTypes">
<title>defaultFilteredTypes</title>
<para>
This parameter specifies a list of types, which are filtered by default when executing a script file. Using the default values of this parameter,
whitespaces, line breaks and markup elements are not visible to TextMarker rules. The visibility of annotations and therefore the covered text can be changed
using the actions <link linkend='ugr.tools.tm.language.actions.filtertype'>FILTERTYPE</link> and
<link linkend='ugr.tools.tm.language.actions.retaintype'>RETAINTYPE</link>.
</para>
</section>
<section id="ugr.tools.tm.ae.basic.parameter.removeBasics">
<title>removeBasics</title>
<para>
This parameter specifies whether the inference annotations created by the analysis engine should be removed after processing the CAS.
The default value is set to false.
</para>
</section>
<section id="ugr.tools.tm.ae.basic.parameter.dynamicAnchoring">
<title>dynamicAnchoring</title>
<para>
If this parameter is set to true, then the TextMarker rules are not forced to start to match with the first rule element.
Rather the rule element referring to the most rare type is chosen. Therefore, this option can be utilized to optimize the performance.
Please mind that the matching result can vary in some cases when greedy rule elements are applied.
The default value is set to false.
</para>
</section>
<section id="ugr.tools.tm.ae.basic.parameter.lowMemoryProfile">
<title>lowMemoryProfile</title>
<para>
This parameter specifies whether the memory consumption should be reduced. This parameter should be set to true for
very large CAS documents (e.g., > 500k tokens), but it also reduces the performance. The default value is set to false.
</para>
</section>
<section id="ugr.tools.tm.ae.basic.parameter.simpleGreedyForComposed">
<title>simpleGreedyForComposed</title>
<para>
This parameter specifies whether a different inference strategy for composed rule elements should be applied. This option is only neccessary,
if the composed rule element is expected to match very often, e.g., a rule element like (ANY ANY).
The default value of this parameter is set to false.
</para>
</section>
<section id="ugr.tools.tm.ae.basic.parameter.debug">
<title>debug</title>
<para>
If this parameter is set to true, then additional information about the execution of a rule script is added to the CAS.
The actual information is specified by the following parameters.
The default value of this parameter is set to false.
</para>
</section>
<section id="ugr.tools.tm.ae.basic.parameter.debugWithMatches">
<title>debugWithMatches</title>
<para>
This parameter specificies whether the match information (covered text) of the rules should be stored in the CAS.
The default value of this parameter is set to false.
</para>
</section>
<section id="ugr.tools.tm.ae.basic.parameter.debugOnlyFor">
<title>debugOnlyFor</title>
<para>
This parameter specifies a list of rule-ids that enumerate the rule for which debug information should be created.
No specific ids are given by default.
</para>
</section>
<section id="ugr.tools.tm.ae.basic.parameter.profile">
<title>profile</title>
<para>
If this parameter is set to true, then additional information about the runtime of applied rules is added to the CAS.
The default value of this parameter is set to false.
</para>
</section>
<section id="ugr.tools.tm.ae.basic.parameter.statistics">
<title>statistics</title>
<para>
If this parameter is set to true, then additional information about the runtime of TextMarker language elements like conditions and actions
is added to the CAS.
The default value of this parameter is set to false.
</para>
</section>
<section id="ugr.tools.tm.ae.basic.parameter.createdBy">
<title>createdBy</title>
<para>
If this parameter is set to true, then additional information is added to the CAS about what annotation was created by which rule.
The default value of this parameter is set to false.
</para>
</section>
</section>
</section>
<section id="ugr.tools.tm.ae.annotationwriter">
<title>Annotation Writer</title>
<para>
This Analysis Engine can be utilized to write the covered text of annotions in a text file whereas each covered text is put into a new line.
If the Analyis engine, for example, is configured for the type uima.example.Person, then all the covered texts of all person annotions are stored
in a text file, one person in each line.
A descriptor file for this Analysis Engine is located in the folder <quote>descriptor/utils</quote> of a TextMarker project.
</para>
<section id="ugr.tools.tm.ae.annotationwriter.parameter">
<title>Configuration Parameters</title>
<para>
</para>
<section id="ugr.tools.tm.ae.annotationwriter.parameter.output">
<title>Output</title>
<para>
This string parameter specifies the absolute path of the resulting file named <quote>output.txt</quote>. However, if an annotation of the
type <quote>org.apache.uima.examples.SourceDocumentInformation</quote> is given, then the value of this parameter is interpreted to be relative
to the URI stored in the annotation and the name of the file will be adapted to the name of the source file. The TextMarker IDE automatically adds
the SourceDocumentInformation annotation when the user launches a script file. The default value of this parameter is <quote>/../output/</quote>.
</para>
</section>
<section id="ugr.tools.tm.ae.annotationwriter.parameter.encoding">
<title>Encoding</title>
<para>
This string parameter specifies the encoding of the resulting file. The default value of this parameter is <quote>UTF-8</quote>.
</para>
</section>
<section id="ugr.tools.tm.ae.annotationwriter.parameter.type">
<title>Type</title>
<para>
Only the covered texts of annotations of the type specified with this parameter are stored in the resulting file.
The default value of this parameter is <quote>uima.tcas.DocumentAnnotation</quote>, which will store the complete document in a new file.
</para>
</section>
</section>
</section>
<section id="ugr.tools.tm.ae.plaintext">
<title>Plain Text Annotator</title>
<para>
This Analysis Engines adds annotations for lines and paragraphs.
A descriptor file for this Analysis Engine is located in the folder <quote>descriptor/utils</quote> of a TextMarker project. There are no configuration parameters
</para>
</section>
<section id="ugr.tools.tm.ae.modifier">
<title>Modifier</title>
<para>
The Modifier Analysis Engine can be used to create an additional view <quote>modified</quote>, which contains all textual modifications and HTML highlightings that
were specified by the executed rules. Therefore, this Analysis Engine can be applied, e.g.,
for anonymization where all annotations of persons are replaced by the string <quote>Person</quote>.
Furthermore, the content of the new view can optionally be stored in a new HTML file.
A descriptor file for this Analysis Engine is located in the folder <quote>descriptor/utils</quote> of a TextMarker project.
</para>
<section id="ugr.tools.tm.ae.modifier.parameter">
<title>Configuration Parameters</title>
<para>
</para>
<section id="ugr.tools.tm.ae.modifier.parameter.styleMap">
<title>styleMap</title>
<para>
This string parameter specifies the name of the style map file created by the Style Map Creator Analysis Engine, which stores the colors for
additional highlightings in the modified view.
</para>
</section>
<section id="ugr.tools.tm.ae.modifier.parameter.descriptorPaths">
<title>descriptorPaths</title>
<para>
This parameter can contain multiple string values and specifies the absolute paths where the style map file can be found.
</para>
</section>
<section id="ugr.tools.tm.ae.modifier.parameter.outputLocation">
<title>outputLocation</title>
<para>
This string parameter specifies the absolute path of the resulting file named <quote>output.modified.html</quote>. However, if an annotation of the
type <quote>org.apache.uima.examples.SourceDocumentInformation</quote> is given, then the value of this parameter is interpreted to be relative
to the URI stored in the annotation and the name of the file will be adapted to the name of the source file. The TextMarker IDE automatically adds
the SourceDocumentInformation annotation when the user launches a script file. The default value of this parameter is <quote>/../</quote>.
</para>
</section>
</section>
</section>
<section id="ugr.tools.tm.ae.html">
<title>HMTL Annotator</title>
<para>
This Analysis Engine provides support for HTML files by adding annotations for the HTML elements. Using the default values, the HTML Annotator creates annotations
for each HTML element spanning the content of the element, whereas the most common elements are represented by own types.
The document <quote><![CDATA[This text is <b>bold</b>.]]></quote>, for example, would be annotated with an annotation of the type
<quote>org.apache.uima.textmarker.type.html.B</quote> for the word <quote>bold</quote>. The HTML annotator can be configured
in order to include the start and end element in the created annotations. Additionally, the Analysis Engine is also able to strip the HTML element,
but retraining the HTML annotations. Thereby, an HTML document can be converted to a plain text document, which contains the annotations about the HTML layout.
A descriptor file for this Analysis Engine is located in the folder <quote>descriptor/utils</quote> of a TextMarker project.
</para>
<section id="ugr.tools.tm.ae.html.parameter">
<title>Configuration Parameters</title>
<para>
</para>
<section id="ugr.tools.tm.ae.html.parameter.plainTextOutput">
<title>plainTextOutput</title>
<para>
This parameter specifies whether a new document without the HTML elements should be created. The default value is <quote>false</quote>.
</para>
</section>
<section id="ugr.tools.tm.ae.html.parameter.outputViewName">
<title>outputViewName</title>
<para>
This parameter specifies in which view the optional new document without HTML element should be stored.
</para>
</section>
<section id="ugr.tools.tm.ae.html.parameter.onlyContent">
<title>onlyContent</title>
<para>
This parameter specifies whether created annotations should cover only the content of the HTML elements or also their start and end element.
The default value is <quote>true</quote>
</para>
</section>
</section>
</section>
<section id="ugr.tools.tm.ae.stylemap">
<title>Style Map Creator</title>
<para>
This Analysis Engine can be utilized to create style map information, which is needed by the Modifier Analysis Engine in order to create
highlightings for some annotations.
Style map information can be created using the <link linkend='ugr.tools.tm.language.actions.color'>COLOR</link> action.
A descriptor file for this Analysis Engine is located in the folder <quote>descriptor/utils</quote> of a TextMarker project.
</para>
<section id="ugr.tools.tm.ae.stylemap.parameter">
<title>Configuration Parameters</title>
<para>
</para>
<section id="ugr.tools.tm.ae.stylemap.parameter.styleMap">
<title>styleMap</title>
<para>
This string parameter specifies the name of the style map file created by the Style Map Creator Analysis Engine, which stores the colors for
additional highlightings in the modified view.
</para>
</section>
<section id="ugr.tools.tm.ae.stylemap.parameter.descriptorPaths">
<title>descriptorPaths</title>
<para>
This parameter can contain multiple string values and specifies the absolute paths where the style map fgile can be found.
</para>
</section>
</section>
</section>
<section id="ugr.tools.tm.ae.xmi">
<title>XMI Writer</title>
<para>
This Analysis Engine is able to serialize the processed CAS to an XMI file. One use case for the XMI Writer is, for example, a rule-based sort,
which stores the processed XMI files in different folder, dependent on the execution of the rules, e.g., whether a pattern of annotations occurs or not.
A descriptor file for this Analysis Engine is located in the folder <quote>descriptor/utils</quote> of a TextMarker project.
</para>
<section id="ugr.tools.tm.ae.xmi.parameter">
<title>Configuration Parameters</title>
<para>
</para>
<section id="ugr.tools.tm.ae.xmi.parameter.output">
<title>Output</title>
<para>
This string parameter specifies the absolute path of the resulting file named <quote>output.xmi</quote>. However, if an annotation of the
type <quote>org.apache.uima.examples.SourceDocumentInformation</quote> is given, then the value of this parameter is interpreted to be relative
to the URI stored in the annotation and the name of the file will be adapted to the name of the source file. The TextMarker IDE automatically adds
the SourceDocumentInformation annotation when the user launches a script file.
The default value is <quote>/../output/</quote>
</para>
</section>
</section>
</section>
</section>
</chapter>