trunk/ruta-docbook/src/docbook/tools.ruta.language.basic_annotations.xml - uima-ruta - Git at Google

 <?xml version="1.0" encoding="UTF-8"?>
 <!DOCTYPE section PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN"
 "http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd"[
 <!ENTITY imgroot "images/tools/ruta/language/" >
 <!ENTITY % uimaents SYSTEM "../../target/docbook-shared/entities.ent" >
 %uimaents;
 ]>
 <!-- Licensed to the Apache Software Foundation (ASF) under one or more contributor
   license agreements. See the NOTICE file distributed with this work for additional
   information regarding copyright ownership. The ASF licenses this file to
   you under the Apache License, Version 2.0 (the "License"); you may not use
   this file except in compliance with the License. You may obtain a copy of
   the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required
   by applicable law or agreed to in writing, software distributed under the
   License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS
   OF ANY KIND, either express or implied. See the License for the specific
   language governing permissions and limitations under the License. -->

 <section id="ugr.tools.ruta.language.seeding">
   <title>Basic annotations and tokens</title>
   <para>
     The UIMA Ruta system uses a JFlex lexer to initially create a
     seed of basic token annotations. These tokens build a hierarchy shown in <xref linkend='figure.ugr.tools.ruta.language.seeding.basic_token' />. The
     <quote>ALL</quote> (green) annotation is the root of the hierarchy. ALL and the red
     marked annotation types are abstract. This means that they are actually not
     created by the lexer. An overview of these abstract types can
     be found in <xref linkend='table.ugr.tools.ruta.language.seeding.basic_token.abstract' />. The leafs of the hierarchy (blue) are created by the lexer. Each
     leaf is an own type, but also inherits the types of the abstract
     annotation types further up in the hierarchy. The leaf types are
     described in more detail in <xref linkend='table.ugr.tools.ruta.language.seeding.basic_token.created' />.
     Each text unit within an input document belongs to exactly one of these
     annotation types.
   </para>
   <para>
     <figure id="figure.ugr.tools.ruta.language.seeding.basic_token">
       <title>Basic token hierarchy
       </title>
       <mediaobject>
         <imageobject role="html">
           <imagedata width="576px" format="PNG" align="center"
             fileref="&imgroot;basic_token/basic_token.png" />
         </imageobject>
         <imageobject role="fo">
           <imagedata width="5.5in" format="PNG" align="center"
             fileref="&imgroot;basic_token/basic_token.png" />
         </imageobject>
         <textobject>
           <phrase>
             Basic token hierarchy.
           </phrase>
         </textobject>
       </mediaobject>
     </figure>
   </para>
   <para>
     <table id="table.ugr.tools.ruta.language.seeding.basic_token.abstract"
       frame="all">
       <title>Abstract annotations</title>
       <tgroup cols="3" colsep="1" rowsep="1">
         <colspec colname="c1" colwidth="1*" />
         <colspec colname="c2" colwidth="1*" />
         <colspec colname="c3" colwidth="3*" />
         <thead>
           <row>
             <entry align="center">Annotation</entry>
             <entry align="center">Parent</entry>
             <entry align="center">Description</entry>
           </row>
         </thead>
         <tbody>
           <row>
             <entry>ALL</entry>
             <entry>-</entry>
             <entry>parent type of all tokens</entry>
           </row>
           <row>
             <entry>ANY</entry>
             <entry>ALL</entry>
             <entry>all tokens except for markup</entry>
           </row>
           <row>
             <entry>W</entry>
             <entry>ANY</entry>
             <entry>all kinds of words</entry>
           </row>
           <row>
             <entry>PM</entry>
             <entry>ANY</entry>
             <entry>all kinds of punctuation marks</entry>
           </row>
           <row>
             <entry>WS</entry>
             <entry>ANY</entry>
             <entry>all kinds of white spaces</entry>
           </row>
           <row>
             <entry>SENTENCEEND</entry>
             <entry>PM</entry>
             <entry>all kinds of punctuation marks that indicate the end of a
               sentence
             </entry>
           </row>
         </tbody>
       </tgroup>
     </table>
   </para>
   <para>
     <table id="table.ugr.tools.ruta.language.seeding.basic_token.created"
       frame="all">
       <title>Annotations created by lexer</title>
       <tgroup cols="4" colsep="1" rowsep="1">
         <colspec colname="c1" colwidth="1*" />
         <colspec colname="c2" colwidth="1*" />
         <colspec colname="c3" colwidth="1*" />
         <colspec colname="c4" colwidth="1*" />

         <thead>
           <row>
             <entry align="center">Annotation</entry>
             <entry align="center">Parent</entry>
             <entry align="center">Description</entry>
             <entry align="center">Example</entry>
           </row>
         </thead>
         <tbody>
           <row>
             <entry>MARKUP</entry>
             <entry>ALL</entry>
             <entry>HTML and XML elements</entry>
             <entry><![CDATA[<p class="Headline">]]></entry>
           </row>
           <row>
             <entry>NBSP</entry>
             <entry>ANY</entry>
             <entry>non breaking space</entry>
             <entry><![CDATA[" "]]></entry>
           </row>
           <row>
             <entry>AMP</entry>
             <entry>ANY</entry>
             <entry>ampersand expression</entry>
             <entry><![CDATA[&amp;]]></entry>
           </row>
           <row>
             <entry>BREAK</entry>
             <entry>WS</entry>
             <entry>line break</entry>
             <entry><![CDATA[\n]]></entry>
           </row>
           <row>
             <entry>SPACE</entry>
             <entry>WS</entry>
             <entry>spaces</entry>
             <entry><![CDATA[" "]]></entry>
           </row>
           <row>
             <entry>COLON</entry>
             <entry>PM</entry>
             <entry>colon</entry>
             <entry><![CDATA[:]]></entry>
           </row>
           <row>
             <entry>COMMA</entry>
             <entry>PM</entry>
             <entry>comma</entry>
             <entry><![CDATA[,]]></entry>
           </row>
           <row>
             <entry>PERIOD</entry>
             <entry>SENTENCEEND</entry>
             <entry>period</entry>
             <entry><![CDATA[.]]></entry>
           </row>
           <row>
             <entry>EXCLAMATION</entry>
             <entry>SENTENCEEND</entry>
             <entry>exclamation mark</entry>
             <entry><![CDATA[!]]></entry>
           </row>
           <row>
             <entry>SEMICOLON</entry>
             <entry>PM</entry>
             <entry>semicolon</entry>
             <entry><![CDATA[;]]></entry>
           </row>
           <row>
             <entry>QUESTION</entry>
             <entry>SENTENCEEND</entry>
             <entry>question mark</entry>
             <entry><![CDATA[?]]></entry>
           </row>
           <row>
             <entry>SW</entry>
             <entry>W</entry>
             <entry>lower case work</entry>
             <entry><![CDATA[annotation]]></entry>
           </row>
           <row>
             <entry>CW</entry>
             <entry>W</entry>
             <entry>work starting with one capitalized letter</entry>
             <entry><![CDATA[Annotation]]></entry>
           </row>
           <row>
             <entry>CAP</entry>
             <entry>W</entry>
             <entry>word only containing capitalized letters</entry>
             <entry><![CDATA[ANNOTATION]]></entry>
           </row>
           <row>
             <entry>NUM</entry>
             <entry>ANY</entry>
             <entry>sequence of digits</entry>
             <entry><![CDATA[0123]]></entry>
           </row>
           <row>
             <entry>SPECIAL</entry>
             <entry>ANY</entry>
             <entry>all other tokens and symbols</entry>
             <entry><![CDATA[/]]></entry>
           </row>
         </tbody>
       </tgroup>
     </table>
   </para>
 </section>
	<?xml version="1.0" encoding="UTF-8"?>
	<!DOCTYPE section PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN"
	"http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd"[
	<!ENTITY imgroot "images/tools/ruta/language/" >
	<!ENTITY % uimaents SYSTEM "../../target/docbook-shared/entities.ent" >
	%uimaents;
	]>
	<!-- Licensed to the Apache Software Foundation (ASF) under one or more contributor
	license agreements. See the NOTICE file distributed with this work for additional
	information regarding copyright ownership. The ASF licenses this file to
	you under the Apache License, Version 2.0 (the "License"); you may not use
	this file except in compliance with the License. You may obtain a copy of
	the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required
	by applicable law or agreed to in writing, software distributed under the
	License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS
	OF ANY KIND, either express or implied. See the License for the specific
	language governing permissions and limitations under the License. -->

	<section id="ugr.tools.ruta.language.seeding">
	<title>Basic annotations and tokens</title>
	<para>
	The UIMA Ruta system uses a JFlex lexer to initially create a
	seed of basic token annotations. These tokens build a hierarchy shown in <xref linkend='figure.ugr.tools.ruta.language.seeding.basic_token' />. The
	<quote>ALL</quote> (green) annotation is the root of the hierarchy. ALL and the red
	marked annotation types are abstract. This means that they are actually not
	created by the lexer. An overview of these abstract types can
	be found in <xref linkend='table.ugr.tools.ruta.language.seeding.basic_token.abstract' />. The leafs of the hierarchy (blue) are created by the lexer. Each
	leaf is an own type, but also inherits the types of the abstract
	annotation types further up in the hierarchy. The leaf types are
	described in more detail in <xref linkend='table.ugr.tools.ruta.language.seeding.basic_token.created' />.
	Each text unit within an input document belongs to exactly one of these
	annotation types.
	</para>
	<para>
	<figure id="figure.ugr.tools.ruta.language.seeding.basic_token">
	<title>Basic token hierarchy
	</title>
	<mediaobject>
	<imageobject role="html">
	<imagedata width="576px" format="PNG" align="center"
	fileref="&imgroot;basic_token/basic_token.png" />
	</imageobject>
	<imageobject role="fo">
	<imagedata width="5.5in" format="PNG" align="center"
	fileref="&imgroot;basic_token/basic_token.png" />
	</imageobject>
	<textobject>
	<phrase>
	Basic token hierarchy.
	</phrase>
	</textobject>
	</mediaobject>
	</figure>
	</para>
	<para>
	<table id="table.ugr.tools.ruta.language.seeding.basic_token.abstract"
	frame="all">
	<title>Abstract annotations</title>
	<tgroup cols="3" colsep="1" rowsep="1">
	<colspec colname="c1" colwidth="1*" />
	<colspec colname="c2" colwidth="1*" />
	<colspec colname="c3" colwidth="3*" />
	<thead>
	<row>
	<entry align="center">Annotation</entry>
	<entry align="center">Parent</entry>
	<entry align="center">Description</entry>
	</row>
	</thead>
	<tbody>
	<row>
	<entry>ALL</entry>
	<entry>-</entry>
	<entry>parent type of all tokens</entry>
	</row>
	<row>
	<entry>ANY</entry>
	<entry>ALL</entry>
	<entry>all tokens except for markup</entry>
	</row>
	<row>
	<entry>W</entry>
	<entry>ANY</entry>
	<entry>all kinds of words</entry>
	</row>
	<row>
	<entry>PM</entry>
	<entry>ANY</entry>
	<entry>all kinds of punctuation marks</entry>
	</row>
	<row>
	<entry>WS</entry>
	<entry>ANY</entry>
	<entry>all kinds of white spaces</entry>
	</row>
	<row>
	<entry>SENTENCEEND</entry>
	<entry>PM</entry>
	<entry>all kinds of punctuation marks that indicate the end of a
	sentence
	</entry>
	</row>
	</tbody>
	</tgroup>
	</table>
	</para>
	<para>
	<table id="table.ugr.tools.ruta.language.seeding.basic_token.created"
	frame="all">
	<title>Annotations created by lexer</title>
	<tgroup cols="4" colsep="1" rowsep="1">
	<colspec colname="c1" colwidth="1*" />
	<colspec colname="c2" colwidth="1*" />
	<colspec colname="c3" colwidth="1*" />
	<colspec colname="c4" colwidth="1*" />

	<thead>
	<row>
	<entry align="center">Annotation</entry>
	<entry align="center">Parent</entry>
	<entry align="center">Description</entry>
	<entry align="center">Example</entry>
	</row>
	</thead>
	<tbody>
	<row>
	<entry>MARKUP</entry>
	<entry>ALL</entry>
	<entry>HTML and XML elements</entry>
	<entry><![CDATA[<p class="Headline">]]></entry>
	</row>
	<row>
	<entry>NBSP</entry>
	<entry>ANY</entry>
	<entry>non breaking space</entry>
	<entry><![CDATA[" "]]></entry>
	</row>
	<row>
	<entry>AMP</entry>
	<entry>ANY</entry>
	<entry>ampersand expression</entry>
	<entry><![CDATA[&]]></entry>
	</row>
	<row>
	<entry>BREAK</entry>
	<entry>WS</entry>
	<entry>line break</entry>
	<entry><![CDATA[\n]]></entry>
	</row>
	<row>
	<entry>SPACE</entry>
	<entry>WS</entry>
	<entry>spaces</entry>
	<entry><![CDATA[" "]]></entry>
	</row>
	<row>
	<entry>COLON</entry>
	<entry>PM</entry>
	<entry>colon</entry>
	<entry><![CDATA[:]]></entry>
	</row>
	<row>
	<entry>COMMA</entry>
	<entry>PM</entry>
	<entry>comma</entry>
	<entry><![CDATA[,]]></entry>
	</row>
	<row>
	<entry>PERIOD</entry>
	<entry>SENTENCEEND</entry>
	<entry>period</entry>
	<entry><![CDATA[.]]></entry>
	</row>
	<row>
	<entry>EXCLAMATION</entry>
	<entry>SENTENCEEND</entry>
	<entry>exclamation mark</entry>
	<entry><![CDATA[!]]></entry>
	</row>
	<row>
	<entry>SEMICOLON</entry>
	<entry>PM</entry>
	<entry>semicolon</entry>
	<entry><![CDATA[;]]></entry>
	</row>
	<row>
	<entry>QUESTION</entry>
	<entry>SENTENCEEND</entry>
	<entry>question mark</entry>
	<entry><![CDATA[?]]></entry>
	</row>
	<row>
	<entry>SW</entry>
	<entry>W</entry>
	<entry>lower case work</entry>
	<entry><![CDATA[annotation]]></entry>
	</row>
	<row>
	<entry>CW</entry>
	<entry>W</entry>
	<entry>work starting with one capitalized letter</entry>
	<entry><![CDATA[Annotation]]></entry>
	</row>
	<row>
	<entry>CAP</entry>
	<entry>W</entry>
	<entry>word only containing capitalized letters</entry>
	<entry><![CDATA[ANNOTATION]]></entry>
	</row>
	<row>
	<entry>NUM</entry>
	<entry>ANY</entry>
	<entry>sequence of digits</entry>
	<entry><![CDATA[0123]]></entry>
	</row>
	<row>
	<entry>SPECIAL</entry>
	<entry>ANY</entry>
	<entry>all other tokens and symbols</entry>
	<entry><![CDATA[/]]></entry>
	</row>
	</tbody>
	</tgroup>
	</table>
	</para>
	</section>