ruta-docbook/src/docbook/tools.ruta.howtos.xml - uima-ruta - Git at Google

 <?xml version="1.0" encoding="UTF-8"?>
 <!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN"
 "http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd"[
 <!ENTITY imgroot "images/tools/tools.ruta/" >
 <!ENTITY % uimaents SYSTEM "../../target/docbook-shared/entities.ent" >
 %uimaents;
 ]>
 <!-- Licensed to the Apache Software Foundation (ASF) under one or more contributor
 	license agreements. See the NOTICE file distributed with this work for additional
 	information regarding copyright ownership. The ASF licenses this file to
 	you under the Apache License, Version 2.0 (the "License"); you may not use
 	this file except in compliance with the License. You may obtain a copy of
 	the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required
 	by applicable law or agreed to in writing, software distributed under the
 	License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS
 	OF ANY KIND, either express or implied. See the License for the specific
 	language governing permissions and limitations under the License. -->
 <chapter id="ugr.tools.ruta.howtos">
 	<title>Apache UIMA Ruta HowTos</title>
 	<para>This chapter contains a selection of some use cases and HowTos
 		for UIMA Ruta.
 	</para>
 	<section id="ugr.tools.ruta.ae.basic.apply">
 		<title>Apply UIMA Ruta Analysis Engine in plain Java</title>
 		<para>
 			Let us assume that the reader wrote the UIMA Ruta rules using
 			the UIMA
 			Ruta Workbench, which already creates correctly configured
 			descriptors.
 			In this case, the following java code can be used to
 			apply the UIMA
 			Ruta script.
 		</para>
 		<programlisting><![CDATA[File specFile = new File("pathToMyWorkspace/MyProject/descriptor/"+
     "my/package/MyScriptEngine.xml");
 XMLInputSource in = new XMLInputSource(specFile);
 ResourceSpecifier specifier = UIMAFramework.getXMLParser().
     parseResourceSpecifier(in);
 // for import by name... set the datapath in the ResourceManager
 AnalysisEngine ae = UIMAFramework.produceAnalysisEngine(specifier);
 CAS cas = ae.newCAS();
 cas.setDocumentText("This is my document.");
 ae.process(cas);]]></programlisting>
 		<note>
 			<para>
 				The UIMA Ruta Analysis Engine utilizes type priorities. If the
 				CAS
 				object is
 				not created using the UIMA Ruta Analysis Engine
 				descriptor by other
 				means, then please
 				provide the necessary type
 				priorities for a valid execution of the UIMA
 				Ruta rules.
 			</para>
 		</note>
 		<para>
 			If the UIMA Ruta script was written, for example, with a common text
 			editor and no configured descriptors are yet available,
 			then the
 			following java code can be used, which, however, is only
 			applicable
 			for executing single script files that do not import
 			additional
 			components or scripts. In this case the other parameters,
 			e.g.,
 			<quote>additionalScripts</quote>
 			, need to be configured correctly.
 		</para>
 		<programlisting><![CDATA[URL aedesc = RutaEngine.class.getResource("BasicEngine.xml");
 XMLInputSource inae = new XMLInputSource(aedesc);
 ResourceSpecifier specifier = UIMAFramework.getXMLParser().
     parseResourceSpecifier(inae);
 ResourceManager resMgr = UIMAFramework.newDefaultResourceManager();
 AnalysisEngineDescription aed = (AnalysisEngineDescription) specifier;
 TypeSystemDescription basicTypeSystem = aed.getAnalysisEngineMetaData().
     getTypeSystem();

 Collection<TypeSystemDescription> tsds =
     new ArrayList<TypeSystemDescription>();
 tsds.add(basicTypeSystem);
 // add some other type system descriptors
 // that are needed by your script file
 TypeSystemDescription mergeTypeSystems = CasCreationUtils.
     mergeTypeSystems(tsds);
 aed.getAnalysisEngineMetaData().setTypeSystem(mergeTypeSystems);
 aed.resolveImports(resMgr);

 AnalysisEngine ae = UIMAFramework.produceAnalysisEngine(aed,
     resMgr, null);
 File scriptFile = new File("path/to/file/MyScript.ruta");
 ae.setConfigParameterValue(RutaEngine.PARAM_SCRIPT_PATHS,
     new String[] { scriptFile.getParentFile().getAbsolutePath() });
 String name = scriptFile.getName().substring(0,
     scriptFile.getName().length() - 5);
 ae.setConfigParameterValue(RutaEngine.PARAM_MAIN_SCRIPT, name);
 ae.reconfigure();
 CAS cas = ae.newCAS();
 cas.setDocumentText("This is my document.");
 ae.process(cas);]]></programlisting>
 		<para>
 			There is also a convenience implementation for applying simple
 			scripts, which do not introduce new types. The following java code
 			applies a simple rule
 			<quote>T1 SW{-> MARK(T2)};</quote>
 			on the given CAS. Note that the types need to be already defined in
 			the type system of the CAS.
 		</para>
 		<programlisting><![CDATA[Ruta.apply(cas, "T1 SW{-> MARK(T2)};");]]></programlisting>
 	</section>
 	<section id="ugr.tools.ruta.integration">
 		<title>Integrating UIMA Ruta in an existing UIMA Annotator</title>
 		<para>This section provides a walk-through tutorial on integrating
 			Ruta in an existing UIMA annotator. In our artificial example
 			we will
 			use Ruta rules to post-process the output of a
 			Part-of-Speech tagger.
 			The POS tagger is a UIMA annotator that iterates over
 			sentences
 			and
 			tokens and updates the posTag field of each Token with a part of
 			speech. For example, given this text...
 		</para>
 		<programlisting>The quick brown fox receives many bets.
 The fox places many bets.
 The fox gets up early.
 The rabbit made up this story.</programlisting>
 		<para>...it assigns the posTag JJ (adjective) to the token
 			&quot;brown&quot; , the posTag NN (common noun) to the
 			token
 			&quot;fox&quot; and the tag VBZ (verb, 3rd person
 			singular present) to
 			the token &quot;receives&quot; in the first sentence.
 		</para>
 		<para>We have noticed that the tagger sometimens fails to disambiguate
 			NNS (common noun plural) and
 			VBZ tags, as in the second sentence. The
 			word &quot;up&quot; also seems
 			to confuse the tagger,
 			which always
 			assigns it an RB (adverb) tag, even when it is a particle
 			(RP)
 			following a verb, as in
 			the third and fourth sentences:
 		</para>
 		<programlisting>The|DT quick|JJ brown|JJ fox|NN receives|VBZ many|JJ bets|NNS .|.
 The|DT fox|NN places|NNS many|JJ bets|NNS .|.
 The|DT fox|NN gets|VBZ up|RB early|RB .|.
 The|DT rabbit|NN made|VBD up|RB this|DT story|NN .|.</programlisting>
 		<para>Let&apos;s imagine that after applying every possible approach
 			available in the POS tagging literature, our tagger still generates
 			these and some other errors. We decide to write a few Ruta
 			rules to
 			post-process the output of the tagger.
 		</para>
 		<section id="ugr.tools.ruta.ae.integration.mvn">
 			<title>Adding Ruta to our Annotator</title>
 			<para>The POS tagger is being developed as a Maven-based project.
 				Since Ruta maven artifacts are available on Maven Central, we
 				add the
 				following dependency to the project&apos;s pom.xml. The
 				functionalities described in this section
 				require a version of Ruta
 				equal to or greater than 2.1.0.
 			</para>
 			<programlisting>&lt;dependency&gt;
   &lt;groupId&gt;org.apache.uima&lt;/groupId&gt;
   &lt;artifactId&gt;ruta-core&lt;/artifactId&gt;
   &lt;version&gt;[2.0.2,)&lt;/version&gt;
 &lt;/dependency&gt;</programlisting>
 			<para>We also take care that the Ruta basic typesystem is loaded when
 				our annotator is initialized. The Ruta typesystem descriptors are
 				available from
 				ruta-core/src/main/resources/org/apache/uima/ruta/engine/
 			</para>
 		</section>
 		<section id="ugr.tools.ruta.ae.integration.loading">
 			<title>Developing Ruta rules and applying them from inside Java code
 			</title>
 			<para>We are now ready to write some rules. The ones we develop for
 				fixing the two errors look like this:
 			</para>
 			<programlisting>Token.posTag =="NN" Token.posTag=="NNS"{-> Token.posTag="VBZ"}
     Token.posTag=="JJ";
 Token{REGEXP(Token.posTag, "VB(.?)")}
     Token.posTag=="RB"{REGEXP("up")-> Token.posTag="RP"};  </programlisting>
 			<para>That
 				is, we change a Token's NNS tag to VBZ, if it is surrounded by a
 				Token tagged as NN and a Token tagged as JJ. We also change an RB
 				tag for an &quot;up&quot; token to RP, if &quot;up&quot; is preceded
 				by any verbal tag (VB, VBZ, etc.) matched with the help of the
 				<link linkend='ugr.tools.ruta.language.conditions.regexp'>REGEXP</link>
 				condition.
 			</para>
 			<para>We test our rules in the Ruta Workbench and see that they
 				indeed fix most of our problems. We save those and some more rules
 				in a text file
 				src/main/resources/ruta.txt.
 			</para>
 			<para>We declare the file with our rules as an external resource and
 				we load it during initialization.
 				Here's a way to do it using
 				uimaFIT:
 			</para>
 			<programlisting>/**
  * Ruta rules for post-processing the tagger's output
  */
 public static final String RUTA_RULES_PARA = "RutaRules";
 ExternalResource(key = RUTA_RULES_PARA, mandatory=false)
 ...
 File rutaRulesF = new File((String)
     aContext.getConfigParameterValue(RUTA_RULES_PARA));
 </programlisting>
 			<para>After our CAS has been populated with posTag annotations from
 				the main algorithm, we post-process the CAS using
 				Ruta.apply():
 			</para>
 			<programlisting>String rutaRules = org.apache.commons.io.FileUtils.readFileToString(
     rutaRulesF, "UTF-8");
 Ruta.apply(cas,  rutaRules);
 </programlisting>
 			<para>We are now happy to see that the final output of our annotator
 				now
 				looks much better:
 			</para>
 			<programlisting>The|DT quick|JJ brown|JJ fox|NN receives|VBZ many|JJ bets|NNS .|.
 The|DT fox|NN places|VBZ many|JJ bets|NNS .|.
 The|DT fox|NN gets|VBZ up|RP early|RB .|.
 The|DT rabbit|NN made|VBD up|RP this|DT story|NN .|.</programlisting>
 		</section>
 	</section>


 	<section id="ugr.tools.ruta.maven">
 		<title>UIMA Ruta Maven Plugin</title>
 		<para>UIMA Ruta provides a maven plugin for building analysis engine
 			and type system descriptors for rule scripts.
 			Additionally, this maven plugin is able able to compile word list (gazetteers) to
 			the more efficient structures, tree word list and multi tree word
 			list. The usage and configuration is shortly summarized in the
 			following. An exemplary maven project for UIMA Ruta is given here:
 			<code>https://svn.apache.org/repos/asf/uima/ruta/trunk/example-projects/ruta-maven-example</code>
 		</para>
 		<section>
 		<title>generate goal</title>
 		<para>
 		The generate goal can be utilized to create xml descriptors for the UIMA Ruta script files.
 		Its usage and configuration is summarized in the following example:
 		</para>
 		<programlisting><![CDATA[<plugin>
 <groupId>org.apache.uima</groupId>
 <artifactId>ruta-maven-plugin</artifactId>
 <version>2.3.0</version>
 <configuration>

  <!-- This is a exemplary configuration, which explicitly specifies the
   default configuration values if not mentioned otherwise. -->

  <!--
  The following parameter is optional and should only be specified
  if the structure (e.g., classpath/resources) of the project requires it.

  A FileSet specifying the UIMA Ruta script files that should be built.

  If this parameter is not specified, then all UIMA Ruta script files
  in the output directory (e.g., target/classes) of the project will
  be built.

  default value: none
  <scriptFiles>
    <directory>${basedir}/some/folder</directory>
    <includes>
      <include>*.ruta</include>
    </includes>
  </scriptFiles>
  -->

  <!-- The directory where the generated type system descriptors will
   be written stored. -->
  <!-- default value: ${project.build.directory}/generated-sources/
    ruta/descriptor -->
  <typeSystemOutputDirectory>${project.build.directory}/generated-sources/
    ruta/descriptor</typeSystemOutputDirectory>

  <!-- The directory where the generated analysis engine descriptors will
   be stored. -->
  <!-- default value: ${project.build.directory}/generated-sources/ruta/
    descriptor -->
  <analysisEngineOutputDirectory>${project.build.directory}/
   generated-sources/ruta/descriptor</analysisEngineOutputDirectory>

  <!-- The template descriptor for the generated type system.
   By default the descriptor of the maven dependency is loaded. -->
  <!-- default value: none -->
  <!-- not used in this example <typeSystemTemplate>...
    </typeSystemTemplate> -->

  <!-- The template descriptor for the generated analysis engine.
    By default the descriptor of the maven dependency is loaded. -->
  <!-- default value: none -->
  <!-- not used in this example <analysisEngineTemplate>...
    </analysisEngineTemplate> -->

  <!-- Script paths of the generated analysis engine descriptor. -->
  <!-- default value: none -->
  <scriptPaths>
   <scriptPath>${basedir}/src/main/ruta/</scriptPath>
  </scriptPaths>

  <!-- Descriptor paths of the generated analysis engine descriptor. -->
  <!-- default value: none -->
  <descriptorPaths>
   <descriptorPath>${project.build.directory}/generated-sources/ruta/
    descriptor</descriptorPath>
  </descriptorPaths>

  <!-- Resource paths of the generated analysis engine descriptor. -->
  <!-- default value: none -->
  <resourcePaths>
   <resourcePath>${basedir}/src/main/resources/</resourcePath>
   <resourcePath>${project.build.directory}/generated-sources/ruta/
    resources/</resourcePath>
  </resourcePaths>

  <!-- Suffix used for the generated type system descriptors. -->
  <!-- default value: Engine -->
  <analysisEngineSuffix>Engine</analysisEngineSuffix>

  <!-- Suffix used for the generated analysis engine descriptors. -->
  <!-- default value: TypeSystem -->
  <typeSystemSuffix>TypeSystem</typeSystemSuffix>

  <!-- Source file encoding. -->
  <!-- default value: ${project.build.sourceEncoding} -->
  <encoding>UTF-8</encoding>

  <!-- Type of type system imports. false = import by location. -->
  <!-- default value: false -->
  <importByName>false</importByName>

  <!-- Option to resolve imports while building. -->
  <!-- default value: false -->
  <resolveImports>false</resolveImports>

  <!-- Amount of retries for building dependent descriptors. Default value
   -1 leads to three retires for each script. -->
   <!-- default value: -1 -->
  <maxBuildRetries>-1</maxBuildRetries>

  <!-- List of packages with language extensions -->
  <!-- default value: none -->
  <extensionPackages>
   <extensionPackage>org.apache.uima.ruta</extensionPackage>
  </extensionPackages>

  <!-- Add UIMA Ruta nature to .project -->
  <!-- default value: false -->
  <addRutaNature>true</addRutaNature>


  <!-- Buildpath of the UIMA Ruta Workbench (IDE) for this project -->
  <!-- default value: none -->
  <buildPaths>
   <buildPath>script:src/main/ruta/</buildPath>
   <buildPath>descriptor:target/generated-sources/ruta/descriptor/
   </buildPath>
   <buildPath>resources:src/main/resources/</buildPath>
  </buildPaths>

 </configuration>
 <executions>
  <execution>
   <id>default</id>
   <phase>process-classes</phase>
   <goals>
    <goal>generate</goal>
   </goals>
  </execution>
 </executions>
 </plugin>
 ]]>		</programlisting>
     <para>The configuration parameters for this goal either define the build behavior,
     e.g., where the generated descriptor should be placed or which suffix the files should get,
     or the configuration of the generated analysis engine descriptor, e.g.,
     the values of the configuration parameter scriptPaths.
     However, there are also other parameters: addRutaNature and buildPaths.
     Both can be utilzed to configure the current Eclipse project (due to the missing m2e connector).
     This is required if the functionality of the UIMA Ruta Workbench, e.g., syntax checking or auto-completeion,
      should be available in the maven project. If the parameter addRutaNature is set to true, then
      the UIMA Ruta Workbench will recognize the project as a script project. Only then,
      the buildpath of the UIMA Ruta project can be configured using the buildPaths parameter, which specifies
      the three important source folders of the UIMA Ruta project. In normal UIMA Ruta Workbnech projects,
      these are script, descriptor and resources.
     </para>
 		</section>

 		<section>
 		<title>twl goal</title>
     <para>
     The twl goal can be utilized to create .twl files from .txt files.
     Its usage and configuration is summarized in the following example:
     </para>
     <programlisting><![CDATA[<plugin>
 <groupId>org.apache.uima</groupId>
 <artifactId>ruta-maven-plugin</artifactId>
 <version>2.3.0</version>
 <configuration></configuration>
 <executions>
 <execution>
  <id>default</id>
  <phase>process-classes</phase>
  <goals>
   <goal>twl</goal>
  </goals>
  <configuration>
   <!-- This is a exemplary configuration, which explicitly specifies
    the default configuration values if not mentioned otherwise. -->

   <!-- Compress resulting tree word list. -->
   <!-- default value: true -->
   <compress>true</compress>

   <!-- The source files for the tree word list. -->
   <!-- default value: none -->
   <inputFiles>
    <directory>${basedir}/src/main/resources</directory>
    <includes>
     <include>*.txt</include>
    </includes>
   </inputFiles>

   <!-- The directory where the generated tree word lists will be
     written to.-->
   <!-- default value: ${project.build.directory}/generated-sources/
     ruta/resources/ -->
   <outputDirectory>${project.build.directory}/generated-sources/ruta/
     resources/</outputDirectory>

   <!-- Source file encoding. -->
   <!-- default value: ${project.build.sourceEncoding} -->
   <encoding>UTF-8</encoding>

  </configuration>
 </execution>
 </executions>
 </plugin>
     ]]>   </programlisting>
 		</section>

 		<section>
     <title>mtwl goal</title>
     <para>
     The mtwl goal can be utilized to create a .mtwl file from multiple .txt files.
     Its usage and configuration is summarized in the following example:
     </para>
     <programlisting><![CDATA[<plugin>
 <groupId>org.apache.uima</groupId>
 <artifactId>ruta-maven-plugin</artifactId>
 <version>2.3.0</version>
 <configuration></configuration>
 <executions>
 <execution>
  <id>default</id>
  <phase>process-classes</phase>
  <goals>
   <goal>mtwl</goal>
  </goals>
  <configuration>
   <!-- This is a exemplary configuration, which explicitly specifies
    the default configuration values if not mentioned otherwise. -->

   <!-- Compress resulting tree word list. -->
   <!-- default value: true -->
   <compress>true</compress>

   <!-- The source files for the multi tree word list. -->
   <!-- default value: none -->
   <inputFiles>
    <directory>${basedir}/src/main/resources</directory>
    <includes>
     <include>*.txt</include>
    </includes>
   </inputFiles>

   <!-- The directory where the generated tree word list will be
     written to. -->
   <!-- default value: ${project.build.directory}/generated-sources/ruta/
     resources/generated.mtwl -->
   <outputFile>${project.build.directory}/generated-sources/ruta/resources/
     generated.mtwl</outputFile>

   <!-- Source file encoding. -->
   <!-- default value: ${project.build.sourceEncoding} -->
   <encoding>UTF-8</encoding>

  </configuration>
 </execution>
 </executions>
 </plugin>
     ]]>   </programlisting>
     </section>

 	</section>

   <section id="ugr.tools.ruta.archetype">
     <title>UIMA Ruta Maven Archetype</title>
     <para>
       UIMA Ruta provides a maven archetype for creating maven projects that preconfigured for
       building UIMA Ruta scripts with maven and contain already a minimal example with a unit test,
       which can be utilized as a starting point.
     </para>


     <para>
       A UIMA Ruta project are created with following command using the the archetype (in one line):
     </para>

     <programlisting><![CDATA[mvn archetype:generate
     -DarchetypeGroupId=org.apache.uima
     -DarchetypeArtifactId=ruta-maven-archetype
     -DarchetypeVersion=<ruta-version>
     -DgroupId=<package>
     -DartifactId=<project-name>]]></programlisting>

     <para>
       The placeholders need to be replaced with the corresponding values. This could look like:
     </para>

     <programlisting><![CDATA[mvn archetype:generate -DarchetypeGroupId=org.apache.uima
     -DarchetypeArtifactId=ruta-maven-archetype -DarchetypeVersion=2.6.0
     -DgroupId=my.domain -DartifactId=my-ruta-project]]></programlisting>

     <para>
       Using the archetype in Eclipse to create a project may result in some missing replacements
       of variables and thus to broken projects. Using the archetype on command line is recommended.
     </para>

     <para>
       In the creation process, several properties need to be defined.
       Their default values can be accepted by simply pressing the return key.
       After the project was created successfully, switch to the new folder and enter 'mvn install'.
       Now, the UIMA Ruta project is built: the descriptors for the UIMA Ruta script are created,
       the wordlist is compiled to a MTWL file, and the unit test verifies the overall functionality.
     </para>

   </section>

   <section id="section.tools.ruta.workbench.textruler.example">
    <title>Induce rules with the TextRuler framework</title>
       <para>
       This section gives a short example how the TextRuler framework is applied in order to induce annotation rules. We refer to the screenshot in <xref linkend="figure.tools.ruta.workbench.textruler.main"/>
       for the configuration and are using the exemplary UIMA Ruta project <quote>TextRulerExample</quote>, which is part of the source release of UIMA Ruta.
       After importing the project into your workspace, please rebuild all UIMA Ruta scripts in order to create the descriptors, e.g., by cleaning the project.
       </para>
       <para>
         In this example, we are using the <quote>KEP</quote> algorithm for learning annotation rules for identifying Bibtex entries in the reference section of scientific publications:
         <orderedlist>
         <listitem>
           <para>Select the folder <quote>single</quote> and drag and drop it to the <quote>Training Data</quote> text field. This folder contains one file with
           correct annotations and serves as gold standard data in our example.</para>
         </listitem>
         <listitem>
           <para>Select the file <quote>Feature.ruta</quote> and drag and drop it to the <quote>Preprocess Script</quote> text field. This UIMA Ruta script knows all necessary types, especially the types
           of the annotations we try the learn rules for, and additionally it contains rules that create useful annotations, which can be used by the algorithm in order to learn better rules.</para>
         </listitem>
         <listitem>
           <para>Select the file <quote>InfoTypes.txt</quote> and drag and drop it to the <quote>Information Types</quote> list. This specifies the goal of the learning process,
           which types of annotations should be annotated by the induced rules, respectively.</para>
         </listitem>
         <listitem>
           <para>Check the checkbox of the <quote>KEP</quote> algorithm and press the start button in the toolbar fo the view.</para>
         </listitem>
         <listitem>
           <para>The algorithm now tries to induce rules for the targeted types. The current result is displayed in the view <quote>KEP Results</quote> in the right part of the perspective.</para>
         </listitem>
         <listitem>
           <para>After the algorithms finished the learning process, create a new UIMA Ruta file in the <quote>uima.ruta.example</quote> package and copy the content of the result view
           to the new file. Now, the induced rules can be applied as a normal UIMA Ruta script file.</para>
         </listitem>
       </orderedlist>
       </para>
     </section>
     <section id="section.tools.ruta.howto.html">
      <title>HTML annotations in plain text</title>
       <para>
        The following script provides an example how to process HTML files with UIMA Ruta in order to get plain text documents
        that still contain information about the HTML tags in form of annotations. The analysis engine descriptor HtmlViewWriter is identical to the common ViewWriter,
        but additionally specifies a type system. More information about different options to configure the
        conversion can be found in <link linkend='ugr.tools.ruta.ae.htmlconverter'>here</link>.
       </para>
           <programlisting><![CDATA[PACKAGE uima.ruta.example;

 ENGINE utils.HtmlAnnotator;
 ENGINE utils.HtmlConverter;
 ENGINE HtmlViewWriter;
 TYPESYSTEM utils.HtmlTypeSystem;
 TYPESYSTEM utils.SourceDocumentInformation;

 Document{-> RETAINTYPE(SPACE,BREAK)};
 Document{-> EXEC(HtmlAnnotator)};

 Document { -> CONFIGURE(HtmlConverter, "inputView" = "_InitialView",
     "outputView" = "plain"),
       EXEC(HtmlConverter)};

 Document{ -> CONFIGURE(HtmlViewWriter, "inputView" = "plain",
     "outputView" = "_InitialView", "output" = "/../converted/"),
     EXEC(HtmlViewWriter)};
     ]]>   </programlisting>
     </section>

     <section id="section.tools.ruta.howto.sorter">
      <title>Sorting files with UIMA Ruta</title>
       <para>
        The following script provides an example how to utilize UIMA Ruta for sorting files.
       </para>
           <programlisting><![CDATA[ENGINE utils.XMIWriter;
 TYPESYSTEM utils.SourceDocumentInformation;

 DECLARE Pattern;

 // some rule-based pattern
 (NUM SPECIAL NUM SPECIAL NUM){-> Pattern};

 Document{CONTAINS(Pattern)->CONFIGURE(XMIWriter,
   "Output" = "../with/"), EXEC(XMIWriter)};
 Document{-CONTAINS(Pattern)->CONFIGURE(XMIWriter,
   "Output" = "../without/"), EXEC(XMIWriter)};
     ]]>   </programlisting>
     </section>
     <section id="section.tools.ruta.howto.xml">
      <title>Converting XML documents with UIMA Ruta</title>
       <para>
        The following script provides an example how to process XML files in order to retain only the text content. the removed XML elements should, howver, be available as annotations.
        This script can therefore be applied to create xmiCAS files from text document annotated with XML tags. The analysis engine descriptor TEIViewWriter is identical to the common ViewWriter,
        but additionally specifies a type system.
       </para>
           <programlisting><![CDATA[ENGINE utils.HtmlAnnotator;
 TYPESYSTEM utils.HtmlTypeSystem;
 ENGINE utils.HtmlConverter;
 ENGINE TEIViewWriter;
 TYPESYSTEM utils.SourceDocumentInformation;

 DECLARE PersName, LastName, FirstName, AddName;

 Document{->EXEC(HtmlAnnotator, {TAG})};
 Document{-> RETAINTYPE(MARKUP,SPACE)};
 TAG.name=="PERSNAME"{-> PersName};
 TAG.name=="SURNAME"{-> LastName};
 TAG.name=="FORENAME"{-> FirstName};
 TAG.name=="ADDNAME"{-> AddName};
 Document{-> RETAINTYPE};

 Document { -> CONFIGURE(HtmlConverter, "inputView" = "_InitialView",
     "outputView" = "plain", "skipWhitespaces" = false),
       EXEC(HtmlConverter)};

 Document{ -> CONFIGURE(TEIViewWriter, "inputView" = "plain", "outputView" =
     "_InitialView", "output" = "/../converted/"),
     EXEC(TEIViewWriter)};
     ]]>   </programlisting>
     </section>
 </chapter>