<?xml version="1.0" encoding="UTF-8"?> | |
<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN" | |
"http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd"[ | |
<!ENTITY imgroot "images/tools/tools.textmarker/" > | |
<!ENTITY % uimaents SYSTEM "../../target/docbook-shared/entities.ent" > | |
%uimaents; | |
]> | |
<!-- | |
Licensed to the Apache Software Foundation (ASF) under one | |
or more contributor license agreements. See the NOTICE file | |
distributed with this work for additional information | |
regarding copyright ownership. The ASF licenses this file | |
to you under the Apache License, Version 2.0 (the | |
"License"); you may not use this file except in compliance | |
with the License. You may obtain a copy of the License at | |
http://www.apache.org/licenses/LICENSE-2.0 | |
Unless required by applicable law or agreed to in writing, | |
software distributed under the License is distributed on an | |
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | |
KIND, either express or implied. See the License for the | |
specific language governing permissions and limitations | |
under the License. | |
--> | |
<chapter id="ugr.tools.tm.overview"> | |
<title>TextMarker Overview</title> | |
<para> | |
</para> | |
<section id="ugr.tools.tm.overview.intro"> | |
<title>What is TextMarker?</title> | |
<para> | |
Apache UIMA™ TextMarker is a rule-based script language supported by Eclipse-based tooling. | |
The language is designed to enable rapid development of text processing applications within UIMA | |
and a special focus lies on the intuitive and flexible domain specific language for defining | |
patterns of annotations. Writing rules for information extraction or other text processing | |
applications is a tedious process. The Eclipse-based tooling for TextMarker, called the TextMarker Workbench, | |
was created to support the user and to facilitate every step when writing TextMarker rules. | |
The TextMarker rule language and the TextMarker Workbench integrate both smoothly with Apache UIMA. | |
</para> | |
</section> | |
<section id="ugr.tools.tm.overview.gettingstarted"> | |
<title>Getting started</title> | |
<para> | |
This section gives a short roadmap how to read the documentation and gives some recommendations how to | |
start developing TextMarker-based applications. This documentation assumes that the user knows about | |
the core concepts of Apache UIMA. Knowledge about the meaning and usage of at least the terms <quote>CAS</quote>, | |
<quote>Feature Structure</quote>, <quote>Annotation</quote>, <quote>Type</quote>, <quote>Type System</quote> | |
and <quote>Analysis Engine</quote> is required. Please refer to the documentation of Apache UIMA for an introduction. | |
</para> | |
<para> | |
Unexperienced users that want to learn about TextMarker can start with the next two sections: | |
<xref linkend="ugr.tools.tm.overview.coreconcepts"/> | |
gives a short overview about the core ideas and features of the TextMarker language and Workbench. | |
This section introduces the main concepts of the TextMarker language. It explains how TextMarker rules | |
are composed and applied, and discusses the advantages of the TextMarker system. | |
The following <xref linkend="ugr.tools.tm.overview.examples"/> approaches the TextMarker language using a different | |
perspective. Here, the language is introduced only with examples. The first example starts with explaining how a simple rule | |
looks like, and each following example extends the syntax or semantics of the TextMarker language. | |
After the consultation of these two sections, the reader should know enough to start writing her first TextMarker-based application. | |
</para> | |
<para> | |
The TextMarker Workbench was created to support the user and to facilitate the development process. It is strongly recommended to | |
use this Eclipse-based IDE since it, for example, automatically configures the component descriptors and provides editing support like | |
syntax checking. <xref linkend="section.ugr.tools.tm.workbench.install"/> describes how the TextMarker Workbench is installed. | |
TextMarker rules can of course also be applied on CAS without using the TextMarker Workbench. | |
<xref linkend="ugr.tools.tm.ae.basic.apply"/> contains examples how to execute TextMarker rules in plain java. | |
A good way to get started with TextMarker is to play around with an exemplary TextMarker project, e.g., | |
<uri>https://svn.apache.org/repos/asf/uima/sandbox/trunk/TextMarker/example-projects/ExampleProject</uri>. This TextMarker project | |
contains some simple rules for processing citation metadata. | |
</para> | |
<para> | |
<xref linkend="ugr.tools.tm.language.language"/> and <xref linkend="ugr.tools.tm.workbench"/> provide | |
more detailed descriptions and can be referred to in order to gain knowledge about specific parts | |
of the TextMarker language or the TextMarker Workbench. | |
</para> | |
</section> | |
<section id="ugr.tools.tm.overview.coreconcepts"> | |
<title>Core Concepts</title> | |
<para> | |
The TextMarker language is an imperative rule language extended with scripting elements. A TextMarker rule defines a | |
pattern of annotations with additional conditions. If this pattern applies, then the actions of the rule are performed | |
on the matched annotations. A rule is composed of a sequence of rule elements and a rule element essentially consist of four parts: | |
A matching condition, an optional quantifier, a list of conditions and a list of actions. | |
The matching condition is typically a type of an annotation by which the rule element matches on the covered text of one of those annotations. | |
The quantifier specifies, whether it is necessary that the rule element successfully matches and how often the rule element may match. | |
The list of conditions specify additional constraints that the matched text or annotations need to fulfill. The list of actions defines | |
the consequences of the rule and often create new annotations or modify existing annotations. | |
They are only applied if all rule elements of the rule have successfully matched. Examples for TextMarker rules can be found in | |
<xref linkend="ugr.tools.tm.overview.examples"/>. | |
</para> | |
<para> | |
When TextMarker rules are applied on a document, respectively on a CAS, then they are always grouped in a script file. However, a TextMarker | |
script file contains not only rules, but also other statements. First of all, each script file starts with a package declaration followed by | |
a list of optional imports. Then, common statements like rules, type declarations or blocks build the body and functionality of a script. | |
<xref linkend="ugr.tools.tm.ae.basic.apply"/> gives an example, how TextMarker scripts can be applied in plain Java. | |
TextMarker script files are naturally organized in TextMarker projects, which is a concept of the TextMarker Workbench. | |
The structure of a TextMarker project is described in <xref linkend="section.ugr.tools.tm.workbench.projects"/> | |
</para> | |
<para> | |
The inference of TextMarker rules, that is the approach how the rules are applied, can be described as imperative, depth-first matching. | |
In contrast to similar rule-based systems, TextMarker rules are applied in the order they are defined in the script. | |
The imperative execution of the matching rules may have disadvantages, but also many advantages like an increased rate of development or | |
an easier explanation. The second main property of the TextMarker inference is the depth-first matching. When a rule matches on a pattern of annotations, then | |
an alternative is always tracked until it has matched or failed before the next alternative is considered. Therefore, the behavior of a rule may change, if | |
it has already matched on a early alternative and thus performed an action, which influences some constraints of the rule. | |
Examples, how TextMarker rules are applied, are given in <xref linkend="ugr.tools.tm.overview.examples"/>. | |
</para> | |
<para> | |
The TextMarker language provides the possibility to approach an annotation problem in different ways. Let us distinguish | |
some approaches as an example. | |
It is common in the TextMarker language to create many annotations of different types. These annotations are probably not the targeted annotation of the domain, | |
but can be helpful to incrementally approximate that interesting annotation. This enables the user to work <quote>bottom-up</quote> and <quote>top-down</quote>. | |
In the former approach, the rules add incrementally more complex annotations using simple ones until the target annotation could be created. | |
In the latter approach, the rules get more and more specific while partitioning the document in smaller segments, which result in the targeted annotation in the end. | |
By using many <quote>helper</quote>-annotations the engineering task becomes easier and more comprehensive. | |
The TextMarker language provides distinctive language elements for different tasks. There are, for example, actions | |
that are able to create new annotations, actions that are able to remove annotations and actions that are able to modify the | |
offsets of an annotation. This enables - amongst other things - an transformation-based approach. The user starts by creating general rules that are able to | |
annotate most of the interesting text fragments. Then, instead of making these rules more complex by adding more conditions for situations where they fail, | |
additional rules are defined that correct the mistakes of the general rules, e.g., by deleting false positive annotations. | |
<xref linkend="ugr.tools.tm.overview.examples"/> provides some examples how TextMarker rules can be engineered. | |
</para> | |
<para> | |
Manually writing rules is a tedious and error-prone process. The <link linkend="ugr.tools.tm.workbench">TextMarker Workbench</link> | |
was developed for facilitate writing rules by providing as much tooling support as possible. This includes, for example, syntax checking and auto completion, which | |
make the development less error-prone. The user can annotate documents and use these documents as unit tests for test-driven development or | |
quality maintenance. Sometimes, it is necessary to debug the rules because they did not match as expected. For this use case, the explanation perspective provides views | |
that explain every detail of the matching process. Finally, the TextMarker language can also be used by the tooling, for example, by the <quote>Query</quote> view. | |
Here, the user can use TextMarker rules as query statements in order to investigate annotated documents. | |
</para> | |
<para> | |
TextMarker smoothly integrates with Apache UIMA. First of all, the TextMarker rules are applied using a generic Analysis Engine and thus TextMarker scripts can | |
easily added to Apache UIMA pipelines. TextMarker also provides the functionality to import and use other UIMA components like Analysis Engines and Type Systems. | |
TextMarker rules can refer to every type defined in an imported type system and the TextMarker Workbench generates a type system descriptor file containing all | |
types that were defined in a script file. Any Analysis Engine can be executed by rules as long as their implementation is available in the classpath. Therefore, | |
functionality outsourced in an arbitrary Analysis Engine can be added and used within TextMarker. | |
</para> | |
</section> | |
<section id="ugr.tools.tm.overview.examples"> | |
<title>Learning by Example</title> | |
<para> | |
This section gives a short introduction to the TextMarker language by explaining the rule syntax | |
and inference with some simplified examples. It is recommended to use the TextMarker Workbench to write TextMarker rules | |
in order to gain the advantages like syntax checking. A short description how to install the TextMarker Workbench | |
is given <link linkend="section.ugr.tools.tm.workbench.install">here</link>. The following examples make use of the | |
annotations added by the default seeding of the TextMarker Analysis Engine. There meaning is explained along with the examples. | |
</para> | |
<note><para> | |
The examples in this section are not valid script files as they are missing at least a package declaration. | |
In order to obtain a valid script file, please ensure that all used types are imported or declared and | |
that a package declaration like <quote>PACKAGE uima.textmarker.example;</quote> is added in the first line of the script. | |
</para></note> | |
<para> | |
The first example consists of a declaration of a type followed by a simple rule. Type declaration always starts with the keyword | |
<quote>DECLARE</quote> followed by the short name of the new type. The namespace of the type is equal to the package declaration of the script file. | |
There is also the possibility to create more complex types with features or specific parent types, but this will be neglected for now. | |
In the example, a simple annotation type with the short name <quote>Animal</quote> is defined. | |
After the declaration of the type, a rule with one rule element is given. | |
TextMarker rules in general can consist of a sequence of rule elements. Simple rule elements themselves consist of four parts: A matching condition, | |
an optional quantifier, an optional list of conditions and an optional list of actions. The rule element in the | |
following example has a matching condition <quote>W</quote>, an annotation type standing for normal words. | |
Statements like declarations and rules always end with a semicolon. | |
</para> | |
<programlisting><![CDATA[DECLARE Animal; | |
W{REGEXP("dog") -> MARK(Animal)};]]></programlisting> | |
<para> | |
The rule element also contains one condition and one action, both surrounded by curly parentheses. In order to distinguish conditions from actions, | |
they are separated by <quote>-></quote>. The condition <quote>REGEXP("dog")</quote> indicates that the matched | |
word must match the regular expression <quote>dog</quote>. If the matching condition and the additional regular expression are fulfilled, then the action | |
is executed, which creates a new annotation of the type <quote>Animal</quote> with the same offsets as the matched token. | |
The default seeder does actually not add annotations of the type <quote>W</quote>, but annotations of the types <quote>SW</quote> and | |
<quote>CW</quote> for small written words and capitalized words, which both have the parent type <quote>W</quote>. | |
</para> | |
<para> | |
Since it is tedious to create Animal annotation by matching on different regular expression, we apply an external dictionary in the next example. | |
The first line defines a word list named <quote>AnimalsList</quote>, which is located in the resource folder (the file <quote>Animals.txt</quote> | |
contains one animal name in each line). After the declaration of the type, a rule uses this word list to find all occurrences of animals | |
in the complete document. | |
</para> | |
<programlisting><![CDATA[WORDLIST AnimalsList = 'Animals.txt;' | |
DECLARE Animal; | |
Document{-> MARKFAST(Animal, AnimalsList)}; | |
]]></programlisting> | |
<para> | |
The matching condition of the rule element refers to the complete document, or more specific to the annotation of the type | |
<quote>DocumentAnnotation</quote> which covers the whole document. | |
The action <quote>MARKFAST</quote> of this rule element then creates an annotation of the type <quote>Animal</quote> for each found | |
entry of the dictionary <quote>AnimalsList</quote>. | |
</para> | |
<para> | |
The next example introduces rules with more than one rule element, whereby one of them is a composed rule element. The goal of the following rule tries to | |
annotate occurrences of animals separated by commas, e.g., <quote>dog, cat, bird</quote>. | |
</para> | |
<programlisting><![CDATA[DECLARE AnimalEnum; | |
(Animal COMMA)+{-> MARK(AnimalEnum,1,2)} Animal;]]></programlisting> | |
<para> | |
The rule consists of two rule elements, <quote>(Animal COMMA)+{-> MARK(AnimalEnum,1,2)}</quote> being the first rule element and | |
<quote>Animal</quote> the second one. Lets take a closer look at the first rule element. This rule element is actually composed of two normal rule elements, | |
that are <quote>Animal</quote> and <quote>COMMA</quote>, and contains a greedy quantifier and one action. Therefore, this rule element matches on | |
one Animal annotation and a following comma. This is repeated until one of the inner rule elements do not match anymore. Then, there has to be | |
another Animal annotation afterwards, specified by the second rule element of the rule. In this case, the rule matches and its action is executed: | |
The MARK action creates a new annotation of the type <quote>AnimalEnum</quote>. However, in contrast to the previous examples, this action also | |
contains two number. Those numbers refer to the rule elements, that should be used to calculate the span of the created annotation. The numbers | |
<quote>1, 2</quote> state that the new annotation should start with the first rule element, the composed one, and should end with the second rule element. | |
</para> | |
<para> | |
Lets make the composed rule element a bit more complex. The following rule also matches on lists of animals, which are | |
separated by semicolon. Therefore, a disjunctive rule element is added, indicated by the symbol <quote>|</quote>, which matches on | |
annotations of the type <quote>COMMA</quote> or <quote>SEMICOLON</quote>. | |
</para> | |
<programlisting><![CDATA[(Animal (COMMA | SEMICOLON))+{-> MARK(AnimalEnum,1,2)} Animal;]]></programlisting> | |
<para> | |
Rule elements can of course contain more then one condition. The rule in the next example tries to identify headlines, which are bold, | |
underlined and end with a colon. | |
</para> | |
<programlisting><![CDATA[DECLARE Headline; | |
Paragraph{CONTAINS(Bold, 90, 100, true), | |
CONTAINS(Underlined, 90, 100, true), ENDSWITH(COLON) | |
-> MARK(Headline)};]]></programlisting> | |
<para> | |
The matching condition of this rule element is given with the type <quote>Paragraph</quote>, thus the rule takes a look at all Paragraph annotations. | |
The rule matches only if the three conditions, separated by commas, are fulfilled. The first condition <quote>CONTAINS(Bold, 90, 100, true)</quote> states that | |
90%-100% of the matched paragraph annotation should also be annotated with annotations of the type <quote>Bold</quote>. The boolean parameter <quote>true</quote> | |
indicates that amount of Bold annotations should be calculated relatively to the matched annotation. Therefore, the two numbers <quote>90,100</quote> are interpreted as | |
percent amounts. The exact calculation of the coverage is dependent on the tokenization of the document and is neglected for now. The second condition | |
<quote>CONTAINS(Underlined, 90, 100, true)</quote> consequently states that the paragraph should also contain at least 90% of annotations of the type <quote>underlined</quote>. | |
The third condition <quote>ENDSWITH(COLON)</quote> finally forces the Paragraph annotation to end with a colon. It is only fulfilled, if there is an annotation of the type | |
<quote>COLON</quote>, whose end offset is equal to the end offset of the matched Paragraph annotation. | |
</para> | |
<para> | |
The readability and maintenance of rules does not increase if more and more conditions are added. | |
One of the strengths of the TextMarker language is that it provides always different approaches to solve an annotation task. The next two examples | |
introduce actions for transformation-based rules. | |
</para> | |
<programlisting><![CDATA[Headline{-CONTAINS(W) -> UNMARK(Headline)};]]></programlisting> | |
<para> | |
This rule consists of one condition and one action. The condition <quote>-CONTAINS(W)</quote> is negated (indicated by the character <quote>-</quote>), | |
and is therefore only fulfilled if there are no annotations of the type <quote>W</quote> within the bound of the matched Headline annotation. | |
The action <quote>UNMARK(Headline)</quote> removes the matched Headline annotation. Put into simple words, headlines that contain no words at all are not headlines. | |
</para> | |
<para> | |
The next rule does not remove an annotation, but changes its offsets dependent on the context. | |
</para> | |
<programlisting><![CDATA[]]>Headline{-> SHIFT(Headline, 1, 2)} COLON;</programlisting> | |
<para> | |
Here, the action <quote>SHIFT(Headline, 1, 2)</quote> expands the matched Headline annotation to the next colon if that Headline annotation | |
is followed by a COLON annotation. | |
</para> | |
<para> | |
TextMarker rules can of course contain arbitrary conditions and actions, which is illustrated by the next example. | |
</para> | |
<programlisting><![CDATA[DECLARE Month, Year, Date; | |
ANY{INLIST(MonthsList) -> MARK(Month), MARK(Date,1,3)} | |
PERIOD? NUM{REGEXP(".{2,4}") -> MARK(Year))};]]></programlisting> | |
<para> | |
This rule consists of three rule elements. The first one matches on any token, whose covered token occurs in a word lists named <quote>MonthsList</quote>. | |
The second rule element is optional and does not need to be fulfilled, which is indicated by the quantifier <quote>?</quote>. The last rule element matches | |
on numbers that fulfill the regular expression <quote>REGEXP(".{2,4}"</quote> and are therefore at least two characters to a maximum of four character long. | |
If this rule successfully matches on a text passage, then its three actions are executed: An annotation of the type <quote>Month</quote> is created for the first rule element, | |
an annotation of the type <quote>Year</quote> is created for the last rule element and an annotation of the type <quote>Date</quote> | |
is created for the span of all three rule elements. If the word list contains the correct entries, then this rule matches on strings like | |
<quote>Dec. 2004</quote>, <quote>July 85</quote> or <quote>11.2008</quote> and creates the corresponding annotations. | |
</para> | |
<para> | |
After introducing the composition of rule elements, the default matching strategy is examined. The two rules in the next example just create an annotation | |
for a sequence of arbitrary tokens with the only difference of one condition. | |
</para> | |
<programlisting><![CDATA[DECLARE Text1, Text2; | |
ANY+{ -> MARK(Text1)}; | |
ANY+{-PARTOF(Text2) -> MARK(Text2)};]]></programlisting> | |
<para> | |
The first rule matches on each occurrence of an arbitrary token and continues this until the end of the document is reached. | |
This is of course caused by the greedy quantifier <quote>+</quote>. Note that this rule considers each occurrence of an token and is therefore | |
executed for each token resulting many overlapping annotations. Let this behavior be illustrated with an example: | |
When applied on the document <quote>Peter works for Frank</quote>, the rule creates four annotations with the covered texts | |
<quote>Peter works for Frank</quote>, <quote>works for Frank</quote>, <quote>for Frank</quote> and <quote>Frank</quote>. | |
The rule first tries to match on the token <quote>Peter</quote> and continues its matching. Then, it tries to match on the token <quote>works</quote> and | |
continues its matching, and so on. | |
</para> | |
<para> | |
In this example, the second rule only returns one annotation, which covers the complete document. This is caused by the additional | |
condition <quote>-PARTOF(Text2)</quote>. The PARTOF condition is fulfilled, if the matched annotation is located within an annotation of the given type, or | |
put in simple words, if the matched annotation is part of an annotation of the type <quote>Text2</quote>. When applied on the | |
document <quote>Peter works for Frank</quote>, the rule matches on the first token <quote>Peter</quote>, continues its match and | |
creates an annotations of the type <quote>Text2</quote> for the complete document. Then it tries to match on the second token <quote>works</quote>, but fails, | |
because this token is already part of an Text2 annotation. | |
</para> | |
<para> | |
TextMarker rules can not only be used to create or modify annotations, but also to create features for annotations. The next example defines | |
and assigns a relation about employment, by storing the given annotations as feature values. | |
</para> | |
<programlisting><![CDATA[DECLARE Annotation EmplRelation | |
(Employee employeeRef, Employer employerRef); | |
Sentence{CONTAINS(EmploymentIndicator) -> CREATE(EmplRelation, | |
"employeeRef" = Employee, "employerRef" = Employer)};]]></programlisting> | |
<para> | |
The first statement of this example is a declaration that defines a new type of annotation named <quote>EmplRelation</quote>. | |
This annotation has two features: | |
One feature with the name <quote>employeeRef</quote> of the type <quote>Employee</quote> and | |
one feature with the name <quote>employerRef</quote> of the type <quote>Employer</quote>. | |
The second statement of this example, that is a simple rule, creates one annotation of the type <quote>EmplRelation</quote> for | |
each Sentence annotation that contains at least one annotation of the type EmploymentIndicator. Additionally to creating an annotation, | |
the CREATE action also assigns an annotation of the <quote>Employee</quote>, which needs to be located within the span of the matched sentence, | |
to the feature <quote>employeeRef</quote> and an Employer annotation to the feature <quote>employerRef</quote>. The annotations mentioned in this | |
example need of course to be present in advance. | |
</para> | |
<para> | |
In the last example, the values of features were defined as annotation types. However, also primitive | |
types can be used, as will be shown in the next example, together with a short introduction of variables. | |
</para> | |
<programlisting><![CDATA[DECLARE Annotation MoneyAmount(STRING currency, INT amount); | |
INT moneyAmount; | |
STRING moneyCurrency; | |
NUM{PARSE(moneyAmount)} SPECIAL{REGEXP("€") -> MATCHEDTEXT(moneyCurrency), | |
CREATE(MoneyAmount, 1, 2, "amount" = moneyAmount, | |
"currency" = moneyCurrency)};]]></programlisting> | |
<para> | |
First, a new annotation with the name <quote>MoneyAmount</quote> and two features is defined, one string feature and one integer feature. | |
Then two TextMarker variables are declared, one integer variable and one string variable. The rule matches on a number, whose value is stored | |
in the variable <quote>moneyAmount</quote>, followed by a special token that needs to be equal to the string <quote>€</quote>. Then, | |
the covered text of the special annotation is stored in the string variable <quote>moneyCurrency</quote> and annotation of the | |
type <quote>MoneyAmount</quote> spanning over both rule elements is created. Additionally, the variables are assigned as feature values. | |
</para> | |
<para> | |
TextMarker script files with many rules can quickly get confusing. Therefore, TextMarker allows to import other script files in order to increase | |
the modularity of a project or to create rule libraries. The next example imports the rules together with all known types of another script file | |
and calls that script file. | |
</para> | |
<programlisting><![CDATA[SCRIPT uima.textmarker.example.SecondaryScript; | |
Document{-> CALL(SecondaryScript)};]]></programlisting> | |
<para> | |
The script file with the name <quote>SecondaryScript.tm</quote> located in the package <quote>uima/textmarker/example</quote> is imported and executed | |
by the CALL action on the complete document. The script needs to be located in the folder specified by the parameter | |
<link linkend="ugr.tools.tm.ae.basic.parameter.scriptPaths">scriptPaths</link>. It is also possible to import script files of other TextMarker projects, e.g., | |
by adapting the configuration parameters of the TextMarker Analysis Engine or | |
by setting a project reference in the project properties of a TextMarker project. | |
</para> | |
<para> | |
The types of important annotations of the application are often defined in a separate type system. The next example shows how to import those types. | |
</para> | |
<programlisting><![CDATA[TYPESYSTEM my.package.NamedEntityTypeSystem; | |
Person{PARTOF(Organization) -> UNMARK(Person)}; | |
]]></programlisting> | |
<para> | |
The type system descriptor file with the name <quote>NamedEntityTypeSystem.xml</quote> located in the package <quote>my/package</quote> is imported. | |
The descriptor needs to be located in a folder specified by the parameter | |
<link linkend="ugr.tools.tm.ae.basic.parameter.descriptorPaths">descriptorPaths</link>. | |
</para> | |
<para> | |
It is sometimes easier to express some functionality with control structures known by programming languages rather than to engineer all functionality | |
only with matching rules. The TextMarker language provides the BLOCK element for some of those use cases. | |
The TextMarker BLOCK element starts with the keyword <quote>BLOCK</quote> followed by its name in parentheses. The name of a block has two purposes: | |
On the one hand, its easier to distinguish the block if they have different names, e.g., in the | |
<link linkend="section.ugr.tools.tm.workbench.explain_perspective">explain perspective</link> of the TextMarker Workbench. On the other hand, | |
the name can be used to execute this block using the CALL action. Hereby, it is possible to access only specific sets of rules of other script files, | |
or to implement a recursive call of rules. After the name of the block, a single rule element is given, which has curly parentheses, | |
even if no conditions or actions are specified. Then, the body of the block is framed by curly brackets. | |
</para> | |
<programlisting><![CDATA[BLOCK(English) Document{FEATURE("language", "en")} { | |
// rules for english documents | |
} | |
BLOCK(English) Document{FEATURE("language", "de")} { | |
// rules for german documents | |
}]]></programlisting> | |
<para> | |
This example contains two simpler BLOCK statements. The rules defined with the block are only executed if the condition in the head of the block is fulfilled. | |
Therefore, the rules of the first block are only considered if the feature <quote>language</quote> of the document annotation has the value <quote>en</quote>. | |
Following this, the rules of the second block are only considered for german documents. | |
</para> | |
<para> | |
The rule element of the block definition can also refer to other annotation types than <quote>Document</quote>. While the last example implemented something similar | |
to an if-statement, the next example provides a show case for something similar to a for-each-statement. | |
</para> | |
<programlisting><![CDATA[DECLARE SentenceWithNoLeadingNP; | |
BLOCK(ForEach) Sentence{} { | |
Document{-STARTSWITH(NP) -> MARK(SentenceWithNoLeadingNP)}; | |
} | |
]]></programlisting> | |
<para> | |
Here, the rule in the block statement is performed for each occurence of an annotation of the type <quote>Sentence</quote>. | |
The rule within the block matches on the complete document, which is the current sentence in the context of the block statement. | |
As a consequence, this example creates an annotation of the type <quote>SentenceWithNoLeadingNP</quote> for each sentence | |
that does not start with an NP annotation. | |
</para> | |
<para> | |
Lets take a closer look on what exactly the TextMarker rules match. The following rule matches on a word followed by another word: | |
</para> | |
<programlisting><![CDATA[W W;]]></programlisting> | |
<para> | |
To be more precise, this rule matches on all documents like <quote>Apache UIMA</quote>, <quote> Apache UIMA </quote>, <quote>ApacheUIMA</quote>, | |
<quote><![CDATA[Apache <b>UIMA</b>]]></quote>. There are two main reasons for this: First of all, this depends on how the available annotations are defined. The default seeder | |
for the inital annotations creates an annotation for all characters until a upper case character occurs. Thus, the string <quote>ApacheUIMA</quote> consists of | |
two tokens. | |
However, more important, the TextMarker language provides a concept of visibility of the annotations. By default, all annotations of the types | |
<quote>SPACE</quote>, <quote>NBSP</quote>, <quote>BREAK</quote> and <quote>MARKUP</quote> (whitespace and XML elements) are filtered and not visible. This holds of course for | |
their covered text too. The rule elements skip all positions of the | |
document where those annotations occur. Therefore, the rule in the last example matches on all examples. Without the default filtering settings, | |
with all annotations set to visible, the rule matches only on the document <quote>ApacheUIMA</quote> since it is the only one that contains two word annotations without | |
any whitespace between them. | |
</para> | |
<para> | |
The filtering setting can also be modified by the TextMarker rules themselves. The next example provides some rules that extend and limit | |
the amount of visible text of the document. | |
</para> | |
<programlisting><![CDATA[Sentence; | |
Document{-> RETAINTYPE(SPACE)}; | |
Sentence; | |
Document{-> FILTERTYPE(CW)}; | |
Sentence; | |
Document{-> RETAINTYPE, FILTERTYPE};]]></programlisting> | |
<para> | |
The first rule simply matches on sentences that starts not with any filtered type. Sentences that start with whitespace or markup, | |
for example, are not considered. | |
The next rule retains all text that is covered by annotations of the type <quote>SPACE</quote> meaning | |
that the rule elements are now sensible to whitespaces. Therefore, the following rule will match on sentences that start with whitespaces. | |
The third rule now filters the type <quote>CW</quote> with the consequence that all capitalized words are invisible. | |
If the following rule now wants to match on sentences, then this is only possible for Sentence annotations that do not start with a capitalized word. | |
The last rule finally resets the filtering setting to the default configuration in the TextMarker Analysis Engine. | |
</para> | |
<para> | |
The next exmaple gives a showcase for importing external Analysis Engines and for modifying the documents by creating a new view named <quote>modified</quote>. | |
Additional Analysis Engines can be imported with the keyword <quote>ENGINE</quote> followed by the name of the descriptor. These imported Analysis Engines can be | |
executed with the actions <quote>CALL</quote> or <quote>EXEC</quote>. If the executed Analysis Engine adds, removes or modifies annotations, then their types need | |
to be mentioned when calling the descriptor, or else those annotations will not be correctly processed by the following TextMarker rules. | |
</para> | |
<programlisting><![CDATA[ENGINE utils.Modifier; | |
Date{-> DEL}; | |
MoneyAmount{-> REPLACE("<MoneyAmount/>")}; | |
Document{-> COLOR(Headline, "green")}; | |
Document{-> EXEC(Modifier)}; | |
]]></programlisting> | |
<para> | |
In this example, we first import an Analysis Engine defined by the descriptor <quote>Modifier.xml</quote> located in the folder <quote>utils</quote>. | |
The descriptor needs, of course, be located in the folder specified by the parameter <link linkend="ugr.tools.tm.ae.basic.parameter.descriptorPaths">descriptorPaths</link>. | |
The first rule deletes all text covered by annotations of the type <quote>DEL</quote>. The second rule replaces the text of all annotations of the type <quote>MoneyAmount</quote> | |
with the string <quote><![CDATA[<MoneyAmount/>]]></quote>. The third rule remembers to set the background color of text in Headline annotation to green. The last rule | |
finally performs all of these changes in an additonal view called <quote>modified</quote>. | |
</para> | |
</section> | |
<section id="ugr.tools.tm.ae"> | |
<title>UIMA Analysis Engines</title> | |
<para>This section gives an overview of the UIMA Analysis Engines shipped with TextMarker. The most | |
important one is <quote>TextMarkerEngine</quote>, a generic analysis engine, which is able to interpret | |
and execute script files. The other analysis engines provide support for some additional functionality or | |
add certain types of annotations. | |
</para> | |
<section id="ugr.tools.tm.ae.basic"> | |
<title>TextMarker Engine</title> | |
<para> | |
This generic Analysis Engine is the most important one for the TextMarker language since it is | |
responsible for applying the TextMarker rules on a CAS. Its functionality is configured by the configuration parameters, | |
which, for example, specify the rule file that should be executed. In the TextMarker IDE, a basic template named <quote>BasicEngine.xml</quote> | |
is given in the descriptor folder of a TextMarker project and correctly configured descriptors typically named <quote>MyScriptEngine.xml</quote> | |
are generated in the descriptor folder corresponding to the package namespace of the script file. | |
The available configuration parameters of the TextMarker Analysis Engine are described in the following. | |
</para> | |
<section id="ugr.tools.tm.ae.basic.apply"> | |
<title>Apply TextMarker Analysis Engine in plain Java</title> | |
<para> | |
Let's assume that you wrote the TextMarker rules using the TextMarker Workbench, which already creates correctly configured descriptors. | |
In this case, you can simply use the following java code to apply the TextMarker script. | |
</para> | |
<programlisting><![CDATA[File specFile = new File("pathToMyWorkspace/MyProject/descriptor/"+ | |
"my/package/MyScriptEngine.xml"); | |
XMLInputSource in = new XMLInputSource(specFile); | |
ResourceSpecifier specifier = UIMAFramework.getXMLParser(). | |
parseResourceSpecifier(in); | |
// for import by name... set the datapath in the ResourceManager | |
AnalysisEngine ae = UIMAFramework.produceAnalysisEngine(specifier); | |
CAS cas = ae.newCAS(); | |
cas.setDocumentText("This is my document."); | |
ae.process(cas);]]></programlisting> | |
<note><para> | |
The TextMarker Analysis Engine utilizes type priorities. If the CAS object is | |
not created using the TextMarker Analysis Engine descriptor by other means, then please | |
provide the necessary type priorities for a valid execution of the TextMarker rules. | |
</para></note> | |
<para> | |
If the TextMarker script was written, for example, with a common text editor and no configured descriptors are yet available, | |
then the following java code can be used, which, however, is only applicable for executing single script files that do not import | |
additional components or scripts. In that case the other parameters, e.g., <quote>additionalScripts</quote>, need to be configured correctly. | |
</para> | |
<programlisting><![CDATA[URL aedesc = TextMarkerEngine.class.getResource("BasicEngine.xml"); | |
XMLInputSource inae = new XMLInputSource(aedesc); | |
ResourceSpecifier specifier = UIMAFramework.getXMLParser(). | |
parseResourceSpecifier(inae); | |
ResourceManager resMgr = UIMAFramework.newDefaultResourceManager(); | |
AnalysisEngineDescription aed = (AnalysisEngineDescription) specifier; | |
TypeSystemDescription basicTypeSystem = aed.getAnalysisEngineMetaData(). | |
getTypeSystem(); | |
Collection<TypeSystemDescription> tsds = | |
new ArrayList<TypeSystemDescription>(); | |
tsds.add(basicTypeSystem); | |
// add some other type system descriptors | |
// that are needed by your script file | |
TypeSystemDescription mergeTypeSystems = CasCreationUtils. | |
mergeTypeSystems(tsds); | |
aed.getAnalysisEngineMetaData().setTypeSystem(mergeTypeSystems); | |
aed.resolveImports(resMgr); | |
AnalysisEngine ae = UIMAFramework.produceAnalysisEngine(aed, | |
resMgr, null); | |
File scriptFile = new File("path/to/file/MyScript.tm"); | |
ae.setConfigParameterValue(TextMarkerEngine.SCRIPT_PATHS, | |
new String[] { scriptFile.getParent().getAbsolutePath() }); | |
String name = scriptFile.getName().substring(0, | |
scriptFile.getName().length() - 3); | |
ae.setConfigParameterValue(TextMarkerEngine.MAIN_SCRIPT, name); | |
ae.reconfigure(); | |
CAS cas = ae.newCAS(); | |
cas.setDocumentText("This is my document."); | |
ae.process(cas);]]></programlisting> | |
</section> | |
<section id="ugr.tools.tm.ae.basic.parameter"> | |
<title>Configuration Parameters</title> | |
<para> | |
The configuration parameters of the TextMarker Analysis Engine can be separated into three | |
different groups: parameters for the setup of the environment (<link linkend='ugr.tools.tm.ae.basic.parameter.mainScript'>mainScript</link> | |
to <link linkend='ugr.tools.tm.ae.basic.parameter.additionalExtensions'>additionalExtensions</link>), | |
parameters that change the behavior of the analysis engine (<link linkend='ugr.tools.tm.ae.basic.parameter.reloadScript'>reloadScript</link> | |
to <link linkend='ugr.tools.tm.ae.basic.parameter.simpleGreedyForComposed'>simpleGreedyForComposed</link>) | |
and parameters for creating additional information how the rules were executed | |
(<link linkend='ugr.tools.tm.ae.basic.parameter.debug'>debug</link> | |
to <link linkend='ugr.tools.tm.ae.basic.parameter.createdBy'>createdBy</link>). First, a short overview of the configuration parameters is given in | |
<xref linkend='table.ugr.tools.tm.ae.parameter' />. Then all parameters are described in detail with examples. | |
</para> | |
<para> | |
To change the value of any configuration parameter within a TextMarker script, the CONFIGURE action (see <xref linkend='ugr.tools.tm.language.actions.configure' />) | |
can be used. For changing behaviour of <link linkend='ugr.tools.tm.ae.basic.parameter.dynamicAnchoring'>dynamicAnchoring</link> the DYNAMICANCHORING action | |
(see <xref linkend='ugr.tools.tm.language.actions.dynamicanchoring' />) is recommended. | |
</para> | |
<para> | |
<table id="table.ugr.tools.tm.ae.parameter" frame="all"> | |
<title>Configuration parameters of the TextMarker Analysis Engine </title> | |
<tgroup cols="3" colsep="1" rowsep="1"> | |
<colspec colname="c1" colwidth="1.2*" /> | |
<colspec colname="c2" colwidth="2*" /> | |
<colspec colname="c3" colwidth="0.8*" /> | |
<thead> | |
<row> | |
<entry align="center">Name</entry> | |
<entry align="center">Short description</entry> | |
<entry align="center">Type</entry> | |
</row> | |
</thead> | |
<tbody> | |
<row> | |
<entry> | |
<link linkend='ugr.tools.tm.ae.basic.parameter.mainScript'>mainScript</link> | |
</entry> | |
<entry>Name with complete namespace of the script which will be interpreted and | |
executed by the analysis engine. | |
</entry> | |
<entry>Single String</entry> | |
</row> | |
<row> | |
<entry> | |
<link linkend='ugr.tools.tm.ae.basic.parameter.scriptEncoding'>scriptEncoding</link> | |
</entry> | |
<entry>Encoding of all TextMarker script files.</entry> | |
<entry>Single String</entry> | |
</row> | |
<row> | |
<entry> | |
<link linkend='ugr.tools.tm.ae.basic.parameter.scriptPaths'>scriptPaths</link> | |
</entry> | |
<entry>List of absolute locations, which contain the neccessary script files like | |
the main script. | |
</entry> | |
<entry>Multi String</entry> | |
</row> | |
<row> | |
<entry> | |
<link linkend='ugr.tools.tm.ae.basic.parameter.descriptorPaths'>descriptorPaths</link> | |
</entry> | |
<entry>List of absolute locations, which contain the neccessary descriptor files | |
like type systems. | |
</entry> | |
<entry>Multi String</entry> | |
</row> | |
<row> | |
<entry> | |
<link linkend='ugr.tools.tm.ae.basic.parameter.resourcePaths'>resourcePaths</link> | |
</entry> | |
<entry>List of absolute locations, which contain the neccessary resource files like | |
word lists. | |
</entry> | |
<entry>Multi String</entry> | |
</row> | |
<row> | |
<entry> | |
<link linkend='ugr.tools.tm.ae.basic.parameter.additionalScripts'>additionalScripts</link> | |
</entry> | |
<entry>List of names with complete namespace of additional scripts, which can be | |
referred to. | |
</entry> | |
<entry>Multi String</entry> | |
</row> | |
<row> | |
<entry> | |
<link linkend='ugr.tools.tm.ae.basic.parameter.additionalEngines'>additionalEngines</link> | |
</entry> | |
<entry>List of names with complete namespace of additional analysis engines, which | |
can be called by TextMarker rules. | |
</entry> | |
<entry>Multi String</entry> | |
</row> | |
<row> | |
<entry> | |
<link linkend='ugr.tools.tm.ae.basic.parameter.additionalEngineLoaders'>additionalEngineLoaders</link> | |
</entry> | |
<entry>List of class names of implementations that are able to perform additional | |
task when loading external analysis engines. | |
</entry> | |
<entry>Multi String</entry> | |
</row> | |
<row> | |
<entry> | |
<link linkend='ugr.tools.tm.ae.basic.parameter.additionalExtensions'>additionalExtensions</link> | |
</entry> | |
<entry>List of factory classes for additional extensions of the TextMarker language | |
like proprietary conditions. | |
</entry> | |
<entry>Multi String</entry> | |
</row> | |
<row> | |
<entry> | |
<link linkend='ugr.tools.tm.ae.basic.parameter.reloadScript'>reloadScript</link> | |
</entry> | |
<entry>Option to initialize the rule script each time the analysis engine processes | |
a CAS. | |
</entry> | |
<entry>Single Boolean</entry> | |
</row> | |
<row> | |
<entry> | |
<link linkend='ugr.tools.tm.ae.basic.parameter.seeders'>seeders</link> | |
</entry> | |
<entry>List of class names that provide additional annotations before the rules are | |
executed. | |
</entry> | |
<entry>Multi String</entry> | |
</row> | |
<row> | |
<entry> | |
<link linkend='ugr.tools.tm.ae.basic.parameter.defaultFilteredTypes'>defaultFilteredTypes</link> | |
</entry> | |
<entry>List of complete type names of annotations that are invisible by default. | |
</entry> | |
<entry>Multi String</entry> | |
</row> | |
<row> | |
<entry> | |
<link linkend='ugr.tools.tm.ae.basic.parameter.removeBasics'>removeBasics</link> | |
</entry> | |
<entry>Option to remove all inference annotations after execution of the rule script. | |
</entry> | |
<entry>Single Boolean</entry> | |
</row> | |
<row> | |
<entry> | |
<link linkend='ugr.tools.tm.ae.basic.parameter.dynamicAnchoring'>dynamicAnchoring</link> | |
</entry> | |
<entry>Option to allow rule matches to start at any rule element.</entry> | |
<entry>Single Boolean</entry> | |
</row> | |
<row> | |
<entry> | |
<link linkend='ugr.tools.tm.ae.basic.parameter.lowMemoryProfile'>lowMemoryProfile</link> | |
</entry> | |
<entry>Option to decrease the memory consumption when processing a large CAS.</entry> | |
<entry>Single Boolean</entry> | |
</row> | |
<row> | |
<entry> | |
<link linkend='ugr.tools.tm.ae.basic.parameter.simpleGreedyForComposed'>simpleGreedyForComposed</link> | |
</entry> | |
<entry>Option to activate a different inferencer for composed rule elements.</entry> | |
<entry>Single Boolean</entry> | |
</row> | |
<row> | |
<entry> | |
<link linkend='ugr.tools.tm.ae.basic.parameter.debug'>debug</link> | |
</entry> | |
<entry>Option to add debug information to the CAS.</entry> | |
<entry>Single Boolean</entry> | |
</row> | |
<row> | |
<entry> | |
<link linkend='ugr.tools.tm.ae.basic.parameter.debugWithMatches'>debugWithMatches</link> | |
</entry> | |
<entry>Option to add information about the rule matches to the CAS.</entry> | |
<entry>Single Boolean</entry> | |
</row> | |
<row> | |
<entry> | |
<link linkend='ugr.tools.tm.ae.basic.parameter.debugOnlyFor'>debugOnlyFor</link> | |
</entry> | |
<entry>List of rule ids. If provided, then debug information is only created for | |
those rules. | |
</entry> | |
<entry>Multi String</entry> | |
</row> | |
<row> | |
<entry> | |
<link linkend='ugr.tools.tm.ae.basic.parameter.profile'>profile</link> | |
</entry> | |
<entry>Option to add profile information to the CAS.</entry> | |
<entry>Single Boolean</entry> | |
</row> | |
<row> | |
<entry> | |
<link linkend='ugr.tools.tm.ae.basic.parameter.statistics'>statistics</link> | |
</entry> | |
<entry>Option to add statistics of conditions and actions to the CAS.</entry> | |
<entry>Single Boolean</entry> | |
</row> | |
<row> | |
<entry> | |
<link linkend='ugr.tools.tm.ae.basic.parameter.createdBy'>createdBy</link> | |
</entry> | |
<entry>Option to add additional information, which rule created a annotation. | |
</entry> | |
<entry>Single Boolean</entry> | |
</row> | |
</tbody> | |
</tgroup> | |
</table> | |
</para> | |
<section id="ugr.tools.tm.ae.basic.parameter.mainScript"> | |
<title>mainScript</title> | |
<para> | |
This parameter specifies the rule file that will be executed by the analysis engine and is | |
therefore one of the most important ones. The extact name of the script is given by the complete namespace of the file, which corresponds to its location | |
relative to the given parameter <link linkend='ugr.tools.tm.ae.basic.parameter.scriptPaths'>scriptPaths</link>. | |
The single names of packages (or folders) are separated by periods. An exemplary value for this parameter could be "org.apache.uima.Main", | |
whereas "Main" specifies the file containing the rules and "org.apache.uima" its package. | |
In this case, the analysis engine loads the script file "Main.tm", which is located in the folder structure "org/apache/uima/". | |
This parameter has no default value and has to be provided, although it is not specified as mandatory. | |
</para> | |
</section> | |
<section id="ugr.tools.tm.ae.basic.parameter.scriptEncoding"> | |
<title>scriptEncoding</title> | |
<para> | |
This parameter specifies the encoding of the rule files. Its default value is "UTF-8". | |
</para> | |
</section> | |
<section id="ugr.tools.tm.ae.basic.parameter.scriptPaths"> | |
<title>scriptPaths</title> | |
<para> | |
The parameter scriptPaths refers to a list of String values, which specify the possible locations of script files. | |
The given locations are absolute paths. A typical value for this parameter is for example "C:/TextMarker/MyProject/script/". | |
If the parameter <link linkend='ugr.tools.tm.ae.basic.parameter.mainScript'>mainScript</link> is set to org.apache.uima.Main, | |
then the absolute path of the script file has to be "C:/TextMarker/MyProject/script/org/apache/uima/Main.tm". | |
This parameter can contain multiple values, as the main script can refer to multiple projects similar to a class path in Java. | |
</para> | |
</section> | |
<section id="ugr.tools.tm.ae.basic.parameter.descriptorPaths"> | |
<title>descriptorPaths</title> | |
<para> | |
This parameter specifies the possible locations for descriptors like analysis engines or type systems, similar to the parameter | |
<link linkend='ugr.tools.tm.ae.basic.parameter.scriptPaths'>scriptPaths</link> for the script files. A typical value for this parameter | |
is for example "C:/TextMarker/MyProject/descriptor/". | |
The relative values of the parameter <link linkend='ugr.tools.tm.ae.basic.parameter.additionalEngines'>additionalEngines</link> are | |
resolved to these absolute locations. | |
This parameter can contain multiple values, as the main script can refer to multiple projects similar to a class path in Java. | |
</para> | |
</section> | |
<section id="ugr.tools.tm.ae.basic.parameter.resourcePaths"> | |
<title>resourcePaths</title> | |
<para> | |
This parameter specifies the possible locations of additional resources like word lists or CSV tables. The string values have to contain absolute | |
locations, for example, "C:/TextMarker/MyProject/resources/". | |
</para> | |
</section> | |
<section id="ugr.tools.tm.ae.basic.parameter.additionalScripts"> | |
<title>additionalScripts</title> | |
<para> | |
The parameter additionalScripts is defined as a list of string values and contains script files, which are additionally loaded by the analysis engine. These script files are specified by their | |
complete namespace, exactly like the value of the parameter <link linkend='ugr.tools.tm.ae.basic.parameter.mainScript'>mainScript</link> | |
and can be refered to by language elements, e.g., by executing the containing rules. An exemplary value of this parameter is "org.apache.uima.SecondaryScript". In this example, the main script could import | |
this script file by the declaration "SCRIPT org.apache.uima.SecondaryScript;" and then could execute it with the rule | |
"Document{-> CALL(SecondaryScript)};". | |
</para> | |
</section> | |
<section id="ugr.tools.tm.ae.basic.parameter.additionalEngines"> | |
<title>additionalEngines</title> | |
<para> | |
This parameter contains a list of additional analysis engines, which can be executed by the TextMarker rules. The single values | |
are given by the name of the analysis engine with their complete namespace and have to be located relative to one value of the parameter | |
<link linkend='ugr.tools.tm.ae.basic.parameter.descriptorPaths'>descriptorPaths</link>, the location, where the analysis engine searches for the descriptor file. | |
An exmaple for one value of the parameter is "utils.HtmlAnnotator", which points to the descriptor "HtmlAnnotator.xml" in the folder "utils". | |
</para> | |
</section> | |
<section id="ugr.tools.tm.ae.basic.parameter.additionalEngineLoaders"> | |
<title>additionalEngineLoaders</title> | |
<para> | |
The parameter "additionalEngineLoaders" specifies a list of optional implementations of the interface | |
"org.apache.uima.textmarker.extensions.IEngineLoader", which can be used to application-specific configurations of | |
additional analysis engines. | |
</para> | |
</section> | |
<section id="ugr.tools.tm.ae.basic.parameter.additionalExtensions"> | |
<title>additionalExtensions</title> | |
<para> | |
This parameter specifies optional extensions of the TextMarker language. The elements of the string list must implement the interface | |
"org.apache.uima.textmarker.extensions.ITextMarkerExtension". With those extensions, application-specific conditions and actions can be | |
added to the set of provided ones. | |
</para> | |
</section> | |
<section id="ugr.tools.tm.ae.basic.parameter.reloadScript"> | |
<title>reloadScript</title> | |
<para> | |
This boolean parameter indicates whether the script or resource files should be reloaded when processing a CAS. The default value is set to false. | |
In this case, the script files are loaded when the analysis engine is initialized. If script files or resource files are extended, e.g., a dictionary is filled | |
yet when a collection of documents are processed, then the parameter is needed to be set to true in order to include the changes. | |
</para> | |
</section> | |
<section id="ugr.tools.tm.ae.basic.parameter.seeders"> | |
<title>seeders</title> | |
<para> | |
This list of string values refers to implementations of the interface "org.apache.uima.textmarker.seed.TextMarkerAnnotationSeeder", | |
which can be used to automatically add annotations to the CAS. The default value of the parameter is a single seeder, namely "org.apache.uima.textmarker.seed.DefaultSeeder" | |
that adds annotations for token classes like CW, MARKUP or SEMICOLON. Remember that additional annotations can also be added with | |
an additional engine that is executed by a TextMarker rule. | |
</para> | |
</section> | |
<section id="ugr.tools.tm.ae.basic.parameter.defaultFilteredTypes"> | |
<title>defaultFilteredTypes</title> | |
<para> | |
This parameter specifies a list of types, which are filtered by default when executing a script file. Using the default values of this parameter, | |
whitespaces, line breaks and markup elements are not visible to TextMarker rules. The visibility of annotations and therefore the covered text can be changed | |
using the actions <link linkend='ugr.tools.tm.language.actions.filtertype'>FILTERTYPE</link> and | |
<link linkend='ugr.tools.tm.language.actions.retaintype'>RETAINTYPE</link>. | |
</para> | |
</section> | |
<section id="ugr.tools.tm.ae.basic.parameter.removeBasics"> | |
<title>removeBasics</title> | |
<para> | |
This parameter specifies whether the inference annotations created by the analysis engine should be removed after processing the CAS. | |
The default value is set to false. | |
</para> | |
</section> | |
<section id="ugr.tools.tm.ae.basic.parameter.dynamicAnchoring"> | |
<title>dynamicAnchoring</title> | |
<para> | |
If this parameter is set to true, then the TextMarker rules are not forced to start to match with the first rule element. | |
Rather the rule element referring to the most rare type is chosen. Therefore, this option can be utilized to optimize the performance. | |
Please mind that the matching result can vary in some cases when greedy rule elements are applied. | |
The default value is set to false. | |
</para> | |
</section> | |
<section id="ugr.tools.tm.ae.basic.parameter.lowMemoryProfile"> | |
<title>lowMemoryProfile</title> | |
<para> | |
This parameter specifies whether the memory consumption should be reduced. This parameter should be set to true for | |
very large CAS documents (e.g., > 500k tokens), but it also reduces the performance. The default value is set to false. | |
</para> | |
</section> | |
<section id="ugr.tools.tm.ae.basic.parameter.simpleGreedyForComposed"> | |
<title>simpleGreedyForComposed</title> | |
<para> | |
This parameter specifies whether a different inference strategy for composed rule elements should be applied. This option is only neccessary, | |
if the composed rule element is expected to match very often, e.g., a rule element like (ANY ANY). | |
The default value of this parameter is set to false. | |
</para> | |
</section> | |
<section id="ugr.tools.tm.ae.basic.parameter.debug"> | |
<title>debug</title> | |
<para> | |
If this parameter is set to true, then additional information about the execution of a rule script is added to the CAS. | |
The actual information is specified by the following parameters. | |
The default value of this parameter is set to false. | |
</para> | |
</section> | |
<section id="ugr.tools.tm.ae.basic.parameter.debugWithMatches"> | |
<title>debugWithMatches</title> | |
<para> | |
This parameter specificies whether the match information (covered text) of the rules should be stored in the CAS. | |
The default value of this parameter is set to false. | |
</para> | |
</section> | |
<section id="ugr.tools.tm.ae.basic.parameter.debugOnlyFor"> | |
<title>debugOnlyFor</title> | |
<para> | |
This parameter specifies a list of rule-ids that enumerate the rule for which debug information should be created. | |
No specific ids are given by default. | |
</para> | |
</section> | |
<section id="ugr.tools.tm.ae.basic.parameter.profile"> | |
<title>profile</title> | |
<para> | |
If this parameter is set to true, then additional information about the runtime of applied rules is added to the CAS. | |
The default value of this parameter is set to false. | |
</para> | |
</section> | |
<section id="ugr.tools.tm.ae.basic.parameter.statistics"> | |
<title>statistics</title> | |
<para> | |
If this parameter is set to true, then additional information about the runtime of TextMarker language elements like conditions and actions | |
is added to the CAS. | |
The default value of this parameter is set to false. | |
</para> | |
</section> | |
<section id="ugr.tools.tm.ae.basic.parameter.createdBy"> | |
<title>createdBy</title> | |
<para> | |
If this parameter is set to true, then additional information is added to the CAS about what annotation was created by which rule. | |
The default value of this parameter is set to false. | |
</para> | |
</section> | |
</section> | |
</section> | |
<section id="ugr.tools.tm.ae.annotationwriter"> | |
<title>Annotation Writer</title> | |
<para> | |
This Analysis Engine can be utilized to write the covered text of annotions in a text file whereas each covered text is put into a new line. | |
If the Analyis engine, for example, is configured for the type uima.example.Person, then all the covered texts of all person annotions are stored | |
in a text file, one person in each line. | |
A descriptor file for this Analysis Engine is located in the folder <quote>descriptor/utils</quote> of a TextMarker project. | |
</para> | |
<section id="ugr.tools.tm.ae.annotationwriter.parameter"> | |
<title>Configuration Parameters</title> | |
<para> | |
</para> | |
<section id="ugr.tools.tm.ae.annotationwriter.parameter.output"> | |
<title>Output</title> | |
<para> | |
This string parameter specifies the absolute path of the resulting file named <quote>output.txt</quote>. However, if an annotation of the | |
type <quote>org.apache.uima.examples.SourceDocumentInformation</quote> is given, then the value of this parameter is interpreted to be relative | |
to the URI stored in the annotation and the name of the file will be adapted to the name of the source file. The TextMarker IDE automatically adds | |
the SourceDocumentInformation annotation when the user launches a script file. The default value of this parameter is <quote>/../output/</quote>. | |
</para> | |
</section> | |
<section id="ugr.tools.tm.ae.annotationwriter.parameter.encoding"> | |
<title>Encoding</title> | |
<para> | |
This string parameter specifies the encoding of the resulting file. The default value of this parameter is <quote>UTF-8</quote>. | |
</para> | |
</section> | |
<section id="ugr.tools.tm.ae.annotationwriter.parameter.type"> | |
<title>Type</title> | |
<para> | |
Only the covered texts of annotations of the type specified with this parameter are stored in the resulting file. | |
The default value of this parameter is <quote>uima.tcas.DocumentAnnotation</quote>, which will store the complete document in a new file. | |
</para> | |
</section> | |
</section> | |
</section> | |
<section id="ugr.tools.tm.ae.plaintext"> | |
<title>Plain Text Annotator</title> | |
<para> | |
This Analysis Engines adds annotations for lines and paragraphs. | |
A descriptor file for this Analysis Engine is located in the folder <quote>descriptor/utils</quote> of a TextMarker project. There are no configuration parameters | |
</para> | |
</section> | |
<section id="ugr.tools.tm.ae.modifier"> | |
<title>Modifier</title> | |
<para> | |
The Modifier Analysis Engine can be used to create an additional view <quote>modified</quote>, which contains all textual modifications and HTML highlightings that | |
were specified by the executed rules. Therefore, this Analysis Engine can be applied, e.g., | |
for anonymization where all annotations of persons are replaced by the string <quote>Person</quote>. | |
Furthermore, the content of the new view can optionally be stored in a new HTML file. | |
A descriptor file for this Analysis Engine is located in the folder <quote>descriptor/utils</quote> of a TextMarker project. | |
</para> | |
<section id="ugr.tools.tm.ae.modifier.parameter"> | |
<title>Configuration Parameters</title> | |
<para> | |
</para> | |
<section id="ugr.tools.tm.ae.modifier.parameter.styleMap"> | |
<title>styleMap</title> | |
<para> | |
This string parameter specifies the name of the style map file created by the Style Map Creator Analysis Engine, which stores the colors for | |
additional highlightings in the modified view. | |
</para> | |
</section> | |
<section id="ugr.tools.tm.ae.modifier.parameter.descriptorPaths"> | |
<title>descriptorPaths</title> | |
<para> | |
This parameter can contain multiple string values and specifies the absolute paths where the style map file can be found. | |
</para> | |
</section> | |
<section id="ugr.tools.tm.ae.modifier.parameter.outputLocation"> | |
<title>outputLocation</title> | |
<para> | |
This string parameter specifies the absolute path of the resulting file named <quote>output.modified.html</quote>. However, if an annotation of the | |
type <quote>org.apache.uima.examples.SourceDocumentInformation</quote> is given, then the value of this parameter is interpreted to be relative | |
to the URI stored in the annotation and the name of the file will be adapted to the name of the source file. The TextMarker IDE automatically adds | |
the SourceDocumentInformation annotation when the user launches a script file. The default value of this parameter is <quote>/../</quote>. | |
</para> | |
</section> | |
</section> | |
</section> | |
<section id="ugr.tools.tm.ae.html"> | |
<title>HMTL Annotator</title> | |
<para> | |
This Analysis Engine provides support for HTML files by adding annotations for the HTML elements. Using the default values, the HTML Annotator creates annotations | |
for each HTML element spanning the content of the element, whereas the most common elements are represented by own types. | |
The document <quote><![CDATA[This text is <b>bold</b>.]]></quote>, for example, would be annotated with an annotation of the type | |
<quote>org.apache.uima.textmarker.type.html.B</quote> for the word <quote>bold</quote>. The HTML annotator can be configured | |
in order to include the start and end element in the created annotations. Additionally, the Analysis Engine is also able to strip the HTML element, | |
but retraining the HTML annotations. Thereby, an HTML document can be converted to a plain text document, which contains the annotations about the HTML layout. | |
A descriptor file for this Analysis Engine is located in the folder <quote>descriptor/utils</quote> of a TextMarker project. | |
</para> | |
<section id="ugr.tools.tm.ae.html.parameter"> | |
<title>Configuration Parameters</title> | |
<para> | |
</para> | |
<section id="ugr.tools.tm.ae.html.parameter.plainTextOutput"> | |
<title>plainTextOutput</title> | |
<para> | |
This parameter specifies whether a new document without the HTML elements should be created. The default value is <quote>false</quote>. | |
</para> | |
</section> | |
<section id="ugr.tools.tm.ae.html.parameter.outputViewName"> | |
<title>outputViewName</title> | |
<para> | |
This parameter specifies in which view the optional new document without HTML element should be stored. | |
</para> | |
</section> | |
<section id="ugr.tools.tm.ae.html.parameter.onlyContent"> | |
<title>onlyContent</title> | |
<para> | |
This parameter specifies whether created annotations should cover only the content of the HTML elements or also their start and end element. | |
The default value is <quote>true</quote> | |
</para> | |
</section> | |
</section> | |
</section> | |
<section id="ugr.tools.tm.ae.stylemap"> | |
<title>Style Map Creator</title> | |
<para> | |
This Analysis Engine can be utilized to create style map information, which is needed by the Modifier Analysis Engine in order to create | |
highlightings for some annotations. | |
Style map information can be created using the <link linkend='ugr.tools.tm.language.actions.color'>COLOR</link> action. | |
A descriptor file for this Analysis Engine is located in the folder <quote>descriptor/utils</quote> of a TextMarker project. | |
</para> | |
<section id="ugr.tools.tm.ae.stylemap.parameter"> | |
<title>Configuration Parameters</title> | |
<para> | |
</para> | |
<section id="ugr.tools.tm.ae.stylemap.parameter.styleMap"> | |
<title>styleMap</title> | |
<para> | |
This string parameter specifies the name of the style map file created by the Style Map Creator Analysis Engine, which stores the colors for | |
additional highlightings in the modified view. | |
</para> | |
</section> | |
<section id="ugr.tools.tm.ae.stylemap.parameter.descriptorPaths"> | |
<title>descriptorPaths</title> | |
<para> | |
This parameter can contain multiple string values and specifies the absolute paths where the style map fgile can be found. | |
</para> | |
</section> | |
</section> | |
</section> | |
<section id="ugr.tools.tm.ae.xmi"> | |
<title>XMI Writer</title> | |
<para> | |
This Analysis Engine is able to serialize the processed CAS to an XMI file. One use case for the XMI Writer is, for example, a rule-based sort, | |
which stores the processed XMI files in different folder, dependent on the execution of the rules, e.g., whether a pattern of annotations occurs or not. | |
A descriptor file for this Analysis Engine is located in the folder <quote>descriptor/utils</quote> of a TextMarker project. | |
</para> | |
<section id="ugr.tools.tm.ae.xmi.parameter"> | |
<title>Configuration Parameters</title> | |
<para> | |
</para> | |
<section id="ugr.tools.tm.ae.xmi.parameter.output"> | |
<title>Output</title> | |
<para> | |
This string parameter specifies the absolute path of the resulting file named <quote>output.xmi</quote>. However, if an annotation of the | |
type <quote>org.apache.uima.examples.SourceDocumentInformation</quote> is given, then the value of this parameter is interpreted to be relative | |
to the URI stored in the annotation and the name of the file will be adapted to the name of the source file. The TextMarker IDE automatically adds | |
the SourceDocumentInformation annotation when the user launches a script file. | |
The default value is <quote>/../output/</quote> | |
</para> | |
</section> | |
</section> | |
</section> | |
</section> | |
</chapter> |