blob: 28a5b2fac93056e23e7b90c26c915362896959c4 [file] [log] [blame]
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE section PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN"
<!ENTITY imgroot "images/tools/tm/language/" >
<!ENTITY % uimaents SYSTEM "../../target/docbook-shared/entities.ent" >
<!-- Licensed to the Apache Software Foundation (ASF) under one or more contributor
license agreements. See the NOTICE file distributed with this work for additional
information regarding copyright ownership. The ASF licenses this file to
you under the Apache License, Version 2.0 (the "License"); you may not use
this file except in compliance with the License. You may obtain a copy of
the License at Unless required
by applicable law or agreed to in writing, software distributed under the
OF ANY KIND, either express or implied. See the License for the specific
language governing permissions and limitations under the License. -->
<section id="">
<title>Basic annotations and tokens</title>
The TextMarker system uses a JFlex lexer to initially create a
seed of basic token annotations. These tokens build a hierarchy
which is shown in <xref linkend='' />. The
<quote>ALL</quote> (green) annotation is the root of the hierarchy. ALL and the red
marked annotation types are abstract. This means that they are not
actually created by the lexer. An overview of these abstract types can
be found in <xref linkend='' />. The leafs of the hierarchy (blue) are created by the lexer. Each
leaf is an own type but also inherits the types of the abstract
annotation types further up in the hierarchy. The leaf types are
described in more detail in <xref linkend='' />
Each text unit within an input document belongs to exactly one of these
annotation types.
<figure id="">
<title>Basic token hierarchy
<imageobject role="html">
<imagedata width="576px" format="PNG" align="center"
fileref="&imgroot;basic_token/basic_token.png" />
<imageobject role="fo">
<imagedata width="5.5in" format="PNG" align="center"
fileref="&imgroot;basic_token/basic_token.png" />
Basic token hierarchy.
<table id=""
<title>Abstract annotations</title>
<tgroup cols="3" colsep="1" rowsep="1">
<colspec colname="c1" colwidth="1*" />
<colspec colname="c2" colwidth="1*" />
<colspec colname="c3" colwidth="3*" />
<entry align="center">Annotation</entry>
<entry align="center">Parent</entry>
<entry align="center">Description</entry>
<entry>parent type of all tokens</entry>
<entry>all token but markup</entry>
<entry>all kinds of words</entry>
<entry>all kinds of punctuation marks</entry>
<entry>all kinds of white spaces</entry>
<entry>all kinds of punctuation marks that indicate the end of a
<table id=""
<title>Annotations created by lexer</title>
<tgroup cols="4" colsep="1" rowsep="1">
<colspec colname="c1" colwidth="1*" />
<colspec colname="c2" colwidth="1*" />
<colspec colname="c3" colwidth="1*" />
<colspec colname="c4" colwidth="1*" />
<entry align="center">Annotation</entry>
<entry align="center">Parent</entry>
<entry align="center">Description</entry>
<entry align="center">Example</entry>
<entry>HTML and XML elements</entry>
<entry><![CDATA[<p class="Headline">]]></entry>
<entry>non breaking space</entry>
<entry>ampersant expression</entry>
<entry>line break</entry>
<entry><![CDATA[" "]]></entry>
<entry>exclamation mark</entry>
<entry>question mark</entry>
<entry>lower case work</entry>
<entry>work starting with one capitalized letter</entry>
<entry>word only containing capitalized letters</entry>
<entry>sequence of digits</entry>
<entry>all other tokens and symbols</entry>