<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN"
"http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd"[
<!ENTITY imgroot "images/tools/tools.ruta/" >
<!ENTITY % uimaents SYSTEM "../../target/docbook-shared/entities.ent" >  
%uimaents;
]>
<!-- Licensed to the Apache Software Foundation (ASF) under one or more contributor 
	license agreements. See the NOTICE file distributed with this work for additional 
	information regarding copyright ownership. The ASF licenses this file to 
	you under the Apache License, Version 2.0 (the "License"); you may not use 
	this file except in compliance with the License. You may obtain a copy of 
	the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required 
	by applicable law or agreed to in writing, software distributed under the 
	License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS 
	OF ANY KIND, either express or implied. See the License for the specific 
	language governing permissions and limitations under the License. -->
<chapter id="ugr.tools.ruta.howtos">
	<title>Apache UIMA Ruta HowTos</title>
	<para>This chapter contains a selection of some use cases and HowTos
		for UIMA Ruta.
	</para>
	<section id="ugr.tools.ruta.ae.basic.apply">
		<title>Apply UIMA Ruta Analysis Engine in plain Java</title>
		<para>
			Let us assume that the reader wrote the UIMA Ruta rules using
			the UIMA
			Ruta Workbench, which already creates correctly configured
			descriptors.
			In this case, the following java code can be used to
			apply the UIMA
			Ruta script.
		</para>
		<programlisting><![CDATA[File specFile = new File("pathToMyWorkspace/MyProject/descriptor/"+
    "my/package/MyScriptEngine.xml");
XMLInputSource in = new XMLInputSource(specFile);
ResourceSpecifier specifier = UIMAFramework.getXMLParser().
    parseResourceSpecifier(in);
// for import by name... set the datapath in the ResourceManager
AnalysisEngine ae = UIMAFramework.produceAnalysisEngine(specifier);
CAS cas = ae.newCAS();
cas.setDocumentText("This is my document.");
ae.process(cas);]]></programlisting>
		<note>
			<para>
				The UIMA Ruta Analysis Engine utilizes type priorities. If the
				CAS
				object is
				not created using the UIMA Ruta Analysis Engine
				descriptor by other
				means, then please
				provide the necessary type
				priorities for a valid execution of the UIMA
				Ruta rules.
			</para>
		</note>
		<para>
			If the UIMA Ruta script was written, for example, with a common text
			editor and no configured descriptors are yet available,
			then the
			following java code can be used, which, however, is only
			applicable
			for executing single script files that do not import
			additional
			components or scripts. In this case the other parameters,
			e.g.,
			<quote>additionalScripts</quote>
			, need to be configured correctly.
		</para>
		<programlisting><![CDATA[URL aedesc = RutaEngine.class.getResource("BasicEngine.xml");
XMLInputSource inae = new XMLInputSource(aedesc);
ResourceSpecifier specifier = UIMAFramework.getXMLParser().
    parseResourceSpecifier(inae);
ResourceManager resMgr = UIMAFramework.newDefaultResourceManager();
AnalysisEngineDescription aed = (AnalysisEngineDescription) specifier;
TypeSystemDescription basicTypeSystem = aed.getAnalysisEngineMetaData().
    getTypeSystem();

Collection<TypeSystemDescription> tsds = 
    new ArrayList<TypeSystemDescription>();
tsds.add(basicTypeSystem);
// add some other type system descriptors 
// that are needed by your script file   
TypeSystemDescription mergeTypeSystems = CasCreationUtils.
    mergeTypeSystems(tsds);
aed.getAnalysisEngineMetaData().setTypeSystem(mergeTypeSystems);
aed.resolveImports(resMgr);
        
AnalysisEngine ae = UIMAFramework.produceAnalysisEngine(aed, 
    resMgr, null);
File scriptFile = new File("path/to/file/MyScript.ruta");
ae.setConfigParameterValue(RutaEngine.PARAM_SCRIPT_PATHS, 
    new String[] { scriptFile.getParentFile().getAbsolutePath() });
String name = scriptFile.getName().substring(0, 
    scriptFile.getName().length() - 5);
ae.setConfigParameterValue(RutaEngine.PARAM_MAIN_SCRIPT, name);
ae.reconfigure();
CAS cas = ae.newCAS();
cas.setDocumentText("This is my document.");
ae.process(cas);]]></programlisting>
		<para>
			There is also a convenience implementation for applying simple
			scripts, which do not introduce new types. The following java code
			applies a simple rule
			<quote>T1 SW{-> MARK(T2)};</quote>
			on the given CAS. Note that the types need to be already defined in
			the type system of the CAS.
		</para>
		<programlisting><![CDATA[Ruta.apply(cas, "T1 SW{-> MARK(T2)};");]]></programlisting>
	</section>
	<section id="ugr.tools.ruta.integration">
		<title>Integrating UIMA Ruta in an existing UIMA Annotator</title>
		<para>This section provides a walk-through tutorial on integrating
			Ruta in an existing UIMA annotator. In our artificial example
			we will
			use Ruta rules to post-process the output of a
			Part-of-Speech tagger.
			The POS tagger is a UIMA annotator that iterates over
			sentences
			and
			tokens and updates the posTag field of each Token with a part of
			speech. For example, given this text...
		</para>
		<programlisting>The quick brown fox receives many bets.
The fox places many bets.
The fox gets up early.
The rabbit made up this story.</programlisting>
		<para>...it assigns the posTag JJ (adjective) to the token
			&quot;brown&quot; , the posTag NN (common noun) to the
			token
			&quot;fox&quot; and the tag VBZ (verb, 3rd person
			singular present) to
			the token &quot;receives&quot; in the first sentence.
		</para>
		<para>We have noticed that the tagger sometimens fails to disambiguate
			NNS (common noun plural) and
			VBZ tags, as in the second sentence. The
			word &quot;up&quot; also seems
			to confuse the tagger,
			which always
			assigns it an RB (adverb) tag, even when it is a particle
			(RP)
			following a verb, as in
			the third and fourth sentences:
		</para>
		<programlisting>The|DT quick|JJ brown|JJ fox|NN receives|VBZ many|JJ bets|NNS .|.
The|DT fox|NN places|NNS many|JJ bets|NNS .|.
The|DT fox|NN gets|VBZ up|RB early|RB .|.
The|DT rabbit|NN made|VBD up|RB this|DT story|NN .|.</programlisting>
		<para>Let&apos;s imagine that after applying every possible approach
			available in the POS tagging literature, our tagger still generates
			these and some other errors. We decide to write a few Ruta
			rules to
			post-process the output of the tagger.
		</para>
		<section id="ugr.tools.ruta.ae.integration.mvn">
			<title>Adding Ruta to our Annotator</title>
			<para>The POS tagger is being developed as a Maven-based project.
				Since Ruta maven artifacts are available on Maven Central, we
				add the
				following dependency to the project&apos;s pom.xml. The
				functionalities described in this section
				require a version of Ruta
				equal to or greater than 2.1.0.
			</para>
			<programlisting>&lt;dependency&gt;
  &lt;groupId&gt;org.apache.uima&lt;/groupId&gt;
  &lt;artifactId&gt;ruta-core&lt;/artifactId&gt;
  &lt;version&gt;[2.0.2,)&lt;/version&gt;
&lt;/dependency&gt;</programlisting>
			<para>We also take care that the Ruta basic typesystem is loaded when
				our annotator is initialized. The Ruta typesystem descriptors are
				available from
				ruta-core/src/main/resources/org/apache/uima/ruta/engine/
			</para>
		</section>
		<section id="ugr.tools.ruta.ae.integration.loading">
			<title>Developing Ruta rules and applying them from inside Java code
			</title>
			<para>We are now ready to write some rules. The ones we develop for
				fixing the two errors look like this:
			</para>
			<programlisting>Token.posTag =="NN" Token.posTag=="NNS"{-> Token.posTag="VBZ"}
    Token.posTag=="JJ";
Token{REGEXP(Token.posTag, "VB(.?)")} 
    Token.posTag=="RB"{REGEXP("up")-> Token.posTag="RP"};  </programlisting>
			<para>That
				is, we change a Token's NNS tag to VBZ, if it is surrounded by a
				Token tagged as NN and a Token tagged as JJ. We also change an RB
				tag for an &quot;up&quot; token to RP, if &quot;up&quot; is preceded
				by any verbal tag (VB, VBZ, etc.) matched with the help of the
				<link linkend='ugr.tools.ruta.language.conditions.regexp'>REGEXP</link>
				condition.
			</para>
			<para>We test our rules in the Ruta Workbench and see that they
				indeed fix most of our problems. We save those and some more rules
				in a text file
				src/main/resources/ruta.txt.
			</para>
			<para>We declare the file with our rules as an external resource and
				we load it during initialization.
				Here's a way to do it using
				uimaFIT:
			</para>
			<programlisting>/**
 * Ruta rules for post-processing the tagger's output
 */
public static final String RUTA_RULES_PARA = "RutaRules";
ExternalResource(key = RUTA_RULES_PARA, mandatory=false)
...
File rutaRulesF = new File((String) 
    aContext.getConfigParameterValue(RUTA_RULES_PARA));
</programlisting>
			<para>After our CAS has been populated with posTag annotations from
				the main algorithm, we post-process the CAS using
				Ruta.apply():
			</para>
			<programlisting>String rutaRules = org.apache.commons.io.FileUtils.readFileToString(
    rutaRulesF, "UTF-8");
Ruta.apply(cas,  rutaRules);    
</programlisting>
			<para>We are now happy to see that the final output of our annotator
				now
				looks much better:
			</para>
			<programlisting>The|DT quick|JJ brown|JJ fox|NN receives|VBZ many|JJ bets|NNS .|.
The|DT fox|NN places|VBZ many|JJ bets|NNS .|.
The|DT fox|NN gets|VBZ up|RP early|RB .|.
The|DT rabbit|NN made|VBD up|RP this|DT story|NN .|.</programlisting>
		</section>
	</section>


	<section id="ugr.tools.ruta.maven">
		<title>UIMA Ruta Maven Plugin</title>
		<para>UIMA Ruta provides a maven plugin for building analysis engine
			and type system descriptors for rule scripts.
			Additionally, this maven plugin is able able to compile word list (gazetteers) to
			the more efficient structures, tree word list and multi tree word
			list. The usage and configuration is shortly summarized in the
			following. An exemplary maven project for UIMA Ruta is given here:
			<code>https://github.com/apache/uima-ruta/tree/master/example-projects/ruta-maven-example</code>
		</para>
		<section>
		<title>generate goal</title>
		<para>
		The generate goal can be utilized to create xml descriptors for the UIMA Ruta script files. 
		Its usage and configuration is summarized in the following example:
		</para>
		<programlisting><![CDATA[<plugin>
<groupId>org.apache.uima</groupId>
<artifactId>ruta-maven-plugin</artifactId>
<version>3.0.1</version>
<configuration>

 <!-- This is a exemplary configuration, which explicitly specifies the 
  default configuration values if not mentioned otherwise. -->

 <!--
 The following parameter is optional and should only be specified
 if the structure (e.g., classpath/resources) of the project requires it.
 
 A FileSet specifying the UIMA Ruta script files that should be built. 
 
 If this parameter is not specified, then all UIMA Ruta script files 
 in the output directory (e.g., target/classes) of the project will 
 be built. 
 
 default value: none
 <scriptFiles>
   <directory>${basedir}/some/folder</directory>
   <includes>
     <include>*.ruta</include>
   </includes>
 </scriptFiles>
 -->

 <!-- The directory where the generated type system descriptors will 
  be written stored. -->
 <!-- default value: ${project.build.directory}/generated-sources/
   ruta/descriptor -->
 <typeSystemOutputDirectory>${project.build.directory}/generated-sources/
   ruta/descriptor</typeSystemOutputDirectory>

 <!-- The directory where the generated analysis engine descriptors will 
  be stored. -->
 <!-- default value: ${project.build.directory}/generated-sources/ruta/
   descriptor -->
 <analysisEngineOutputDirectory>${project.build.directory}/
  generated-sources/ruta/descriptor</analysisEngineOutputDirectory>

 <!-- The template descriptor for the generated type system. 
  By default the descriptor of the maven dependency is loaded. -->
 <!-- default value: none -->
 <!-- not used in this example <typeSystemTemplate>...
   </typeSystemTemplate> -->

 <!-- The template descriptor for the generated analysis engine. 
   By default the descriptor of the maven dependency is loaded. -->
 <!-- default value: none -->
 <!-- not used in this example <analysisEngineTemplate>...
   </analysisEngineTemplate> -->

 <!-- Script paths of the generated analysis engine descriptor. -->
 <!-- default value: none -->
 <scriptPaths>
  <scriptPath>${basedir}/src/main/ruta/</scriptPath>
 </scriptPaths>

 <!-- Descriptor paths of the generated analysis engine descriptor. -->
 <!-- default value: none -->
 <descriptorPaths>
  <descriptorPath>${project.build.directory}/generated-sources/ruta/
   descriptor</descriptorPath>
 </descriptorPaths>

 <!-- Resource paths of the generated analysis engine descriptor. -->
 <!-- default value: none -->
 <resourcePaths>
  <resourcePath>${basedir}/src/main/resources/</resourcePath>
  <resourcePath>${project.build.directory}/generated-sources/ruta/
   resources/</resourcePath>
 </resourcePaths>

 <!-- Suffix used for the generated type system descriptors. -->
 <!-- default value: Engine -->
 <analysisEngineSuffix>Engine</analysisEngineSuffix>

 <!-- Suffix used for the generated analysis engine descriptors. -->
 <!-- default value: TypeSystem -->
 <typeSystemSuffix>TypeSystem</typeSystemSuffix>

 <!-- Source file encoding. -->
 <!-- default value: ${project.build.sourceEncoding} -->
 <encoding>UTF-8</encoding>

 <!-- Type of type system imports. false = import by location. -->
 <!-- default value: false -->
 <importByName>false</importByName>

 <!-- Option to resolve imports while building. -->
 <!-- default value: false -->
 <resolveImports>false</resolveImports>

 <!-- Amount of retries for building dependent descriptors. Default value 
  -1 leads to three retires for each script. -->
  <!-- default value: -1 -->
 <maxBuildRetries>-1</maxBuildRetries>

 <!-- List of packages with language extensions -->
 <!-- default value: none -->
 <extensionPackages>
  <extensionPackage>org.apache.uima.ruta</extensionPackage>
 </extensionPackages>

 <!-- Add UIMA Ruta nature to .project -->
 <!-- default value: false -->
 <addRutaNature>true</addRutaNature>


 <!-- Buildpath of the UIMA Ruta Workbench (IDE) for this project -->
 <!-- default value: none -->
 <buildPaths>
  <buildPath>script:src/main/ruta/</buildPath>
  <buildPath>descriptor:target/generated-sources/ruta/descriptor/
  </buildPath>
  <buildPath>resources:src/main/resources/</buildPath>
 </buildPaths>

</configuration>
<executions>
 <execution>
  <id>default</id>
  <phase>process-classes</phase>
  <goals>
   <goal>generate</goal>
  </goals>
 </execution>
</executions>
</plugin>
]]>		</programlisting>
    <para>The configuration parameters for this goal either define the build behavior, 
    e.g., where the generated descriptor should be placed or which suffix the files should get,
    or the configuration of the generated analysis engine descriptor, e.g., 
    the values of the configuration parameter scriptPaths.
    However, there are also other parameters: addRutaNature and buildPaths. 
    Both can be utilized to configure the current Eclipse project (due to the missing m2e connector).
    This is required if the functionality of the UIMA Ruta Workbench, e.g., syntax checking or auto-completion,
     should be available in the maven project. If the parameter addRutaNature is set to true, then
     the UIMA Ruta Workbench will recognize the project as a script project. Only then, 
     the buildpath of the UIMA Ruta project can be configured using the buildPaths parameter, which specifies 
     the three important source folders of the UIMA Ruta project. In normal UIMA Ruta Workbench projects, 
     these are script, descriptor and resources.
    </para>
		</section>
		
		<section>
		<title>twl goal</title>
    <para>
    The twl goal can be utilized to create .twl files from .txt files.
    Its usage and configuration is summarized in the following example:
    </para>
    <programlisting><![CDATA[<plugin>
<groupId>org.apache.uima</groupId>
<artifactId>ruta-maven-plugin</artifactId>
<version>3.0.1</version>
<configuration></configuration>
<executions>
<execution>
 <id>default</id>
 <phase>process-classes</phase>
 <goals>
  <goal>twl</goal>
 </goals>
 <configuration>
  <!-- This is a exemplary configuration, which explicitly specifies 
   the default configuration values if not mentioned otherwise. -->

  <!-- Compress resulting tree word list. -->
  <!-- default value: true -->
  <compress>true</compress>

  <!-- The source files for the tree word list. -->
  <!-- default value: none -->
  <inputFiles>
   <directory>${basedir}/src/main/resources</directory>
   <includes>
    <include>*.txt</include>
   </includes>
  </inputFiles>

  <!-- The directory where the generated tree word lists will be 
    written to.-->
  <!-- default value: ${project.build.directory}/generated-sources/
    ruta/resources/ -->
  <outputDirectory>${project.build.directory}/generated-sources/ruta/
    resources/</outputDirectory>

  <!-- Source file encoding. -->
  <!-- default value: ${project.build.sourceEncoding} -->
  <encoding>UTF-8</encoding>

 </configuration>
</execution>
</executions>
</plugin>
    ]]>   </programlisting>
		</section>
		
		<section>
    <title>mtwl goal</title>
    <para>
    The mtwl goal can be utilized to create a .mtwl file from multiple .txt files.
    Its usage and configuration is summarized in the following example:
    </para>
    <programlisting><![CDATA[<plugin>
<groupId>org.apache.uima</groupId>
<artifactId>ruta-maven-plugin</artifactId>
<version>3.0.1</version>
<configuration></configuration>
<executions>
<execution>
 <id>default</id>
 <phase>process-classes</phase>
 <goals>
  <goal>mtwl</goal>
 </goals>
 <configuration>
  <!-- This is a exemplary configuration, which explicitly specifies 
   the default configuration values if not mentioned otherwise. -->

  <!-- Compress resulting tree word list. -->
  <!-- default value: true -->
  <compress>true</compress>

  <!-- The source files for the multi tree word list. -->
  <!-- default value: none -->
  <inputFiles>
   <directory>${basedir}/src/main/resources</directory>
   <includes>
    <include>*.txt</include>
   </includes>
  </inputFiles>

  <!-- The directory where the generated tree word list will be 
    written to. -->
  <!-- default value: ${project.build.directory}/generated-sources/ruta/
    resources/generated.mtwl -->
  <outputFile>${project.build.directory}/generated-sources/ruta/resources/
    generated.mtwl</outputFile>

  <!-- Source file encoding. -->
  <!-- default value: ${project.build.sourceEncoding} -->
  <encoding>UTF-8</encoding>
  
 </configuration>
</execution>
</executions>
</plugin>
    ]]>   </programlisting>
    </section>
		
	</section>
  
  <section id="ugr.tools.ruta.archetype">
    <title>UIMA Ruta Maven Archetype</title>
    <para>
      UIMA Ruta provides a maven archetype for creating maven projects that preconfigured for 
      building UIMA Ruta scripts with maven and contain already a minimal example with a unit test, 
      which can be utilized as a starting point.
    </para>
       
       
    <para>
      A UIMA Ruta project are created with following command using the the archetype (in one line): 
    </para>  
    
    <programlisting><![CDATA[mvn archetype:generate 
    -DarchetypeGroupId=org.apache.uima 
    -DarchetypeArtifactId=ruta-maven-archetype 
    -DarchetypeVersion=<ruta-version>
    -DgroupId=<package> 
    -DartifactId=<project-name>]]></programlisting>
    
    <para>
      The placeholders need to be replaced with the corresponding values. This could look like:
    </para>  
    
    <programlisting><![CDATA[mvn archetype:generate -DarchetypeGroupId=org.apache.uima 
    -DarchetypeArtifactId=ruta-maven-archetype -DarchetypeVersion=3.0.1
    -DgroupId=my.domain -DartifactId=my-ruta-project]]></programlisting>
    
    <para>
      Using the archetype in Eclipse to create a project may result in some missing replacements 
      of variables and thus to broken projects. Using the archetype on command line is recommended.
    </para>  
    
    <para>
      In the creation process, several properties need to be defined. 
      Their default values can be accepted by simply pressing the return key.
      After the project was created successfully, switch to the new folder and enter 'mvn install'.
      Now, the UIMA Ruta project is built: the descriptors for the UIMA Ruta script are created, 
      the wordlist is compiled to a MTWL file, and the unit test verifies the overall functionality. 
    </para>  
    
  </section>

  <section id="section.tools.ruta.workbench.textruler.example">
   <title>Induce rules with the TextRuler framework</title>
      <para> 
      This section gives a short example how the TextRuler framework is applied in order to induce annotation rules. We refer to the screenshot in <xref linkend="figure.tools.ruta.workbench.textruler.main"/>
      for the configuration and are using the exemplary UIMA Ruta project <quote>TextRulerExample</quote>, which is part of the source release of UIMA Ruta.
      After importing the project into your workspace, please rebuild all UIMA Ruta scripts in order to create the descriptors, e.g., by cleaning the project.
      </para>
      <para> 
        In this example, we are using the <quote>KEP</quote> algorithm for learning annotation rules for identifying Bibtex entries in the reference section of scientific publications:
        <orderedlist>
        <listitem>
          <para>Select the folder <quote>single</quote> and drag and drop it to the <quote>Training Data</quote> text field. This folder contains one file with 
          correct annotations and serves as gold standard data in our example.</para>
        </listitem>
        <listitem>
          <para>Select the file <quote>Feature.ruta</quote> and drag and drop it to the <quote>Preprocess Script</quote> text field. This UIMA Ruta script knows all necessary types, especially the types
          of the annotations we try the learn rules for, and additionally it contains rules that create useful annotations, which can be used by the algorithm in order to learn better rules.</para>
        </listitem>
        <listitem>
          <para>Select the file <quote>InfoTypes.txt</quote> and drag and drop it to the <quote>Information Types</quote> list. This specifies the goal of the learning process, 
          which types of annotations should be annotated by the induced rules, respectively.</para>
        </listitem>
        <listitem>
          <para>Check the checkbox of the <quote>KEP</quote> algorithm and press the start button in the toolbar fo the view.</para>
        </listitem>
        <listitem>
          <para>The algorithm now tries to induce rules for the targeted types. The current result is displayed in the view <quote>KEP Results</quote> in the right part of the perspective.</para>
        </listitem>
        <listitem>
          <para>After the algorithms finished the learning process, create a new UIMA Ruta file in the <quote>uima.ruta.example</quote> package and copy the content of the result view
          to the new file. Now, the induced rules can be applied as a normal UIMA Ruta script file.</para>
        </listitem>
      </orderedlist>
      </para>
    </section>
    <section id="section.tools.ruta.howto.html">
     <title>HTML annotations in plain text</title>
      <para> 
       The following script provides an example how to process HTML files with UIMA Ruta in order to get plain text documents 
       that still contain information about the HTML tags in form of annotations. The analysis engine descriptor HtmlViewWriter is identical to the common ViewWriter, 
       but additionally specifies a type system. More information about different options to configure the
       conversion can be found in <link linkend='ugr.tools.ruta.ae.htmlconverter'>here</link>.
      </para>
          <programlisting><![CDATA[PACKAGE uima.ruta.example;

ENGINE utils.HtmlAnnotator;
ENGINE utils.HtmlConverter;
ENGINE HtmlViewWriter;
TYPESYSTEM utils.HtmlTypeSystem;
TYPESYSTEM utils.SourceDocumentInformation;

Document{-> RETAINTYPE(SPACE,BREAK)};
Document{-> EXEC(HtmlAnnotator)};

Document { -> CONFIGURE(HtmlConverter, "inputView" = "_InitialView", 
    "outputView" = "plain"), 
      EXEC(HtmlConverter)};

Document{ -> CONFIGURE(HtmlViewWriter, "inputView" = "plain", 
    "outputView" = "_InitialView", "output" = "/../converted/"), 
    EXEC(HtmlViewWriter)};
    ]]>   </programlisting>
    </section>
    
    <section id="section.tools.ruta.howto.sorter">
     <title>Sorting files with UIMA Ruta</title>
      <para> 
       The following script provides an example how to utilize UIMA Ruta for sorting files.
      </para>
          <programlisting><![CDATA[ENGINE utils.XMIWriter;
TYPESYSTEM utils.SourceDocumentInformation;

DECLARE Pattern;

// some rule-based pattern
(NUM SPECIAL NUM SPECIAL NUM){-> Pattern};

Document{CONTAINS(Pattern)->CONFIGURE(XMIWriter, 
  "Output" = "../with/"), EXEC(XMIWriter)};
Document{-CONTAINS(Pattern)->CONFIGURE(XMIWriter, 
  "Output" = "../without/"), EXEC(XMIWriter)};
    ]]>   </programlisting>
    </section>
    <section id="section.tools.ruta.howto.xml">
     <title>Converting XML documents with UIMA Ruta</title>
      <para> 
       The following script provides an example how to process XML files in order to retain only the text content. the removed XML elements should, howver, be available as annotations. 
       This script can therefore be applied to create xmiCAS files from text document annotated with XML tags. The analysis engine descriptor TEIViewWriter is identical to the common ViewWriter, 
       but additionally specifies a type system.
      </para>
          <programlisting><![CDATA[ENGINE utils.HtmlAnnotator;
TYPESYSTEM utils.HtmlTypeSystem;
ENGINE utils.HtmlConverter;
ENGINE TEIViewWriter;
TYPESYSTEM utils.SourceDocumentInformation;

DECLARE PersName, LastName, FirstName, AddName;

Document{->EXEC(HtmlAnnotator, {TAG})};
Document{-> RETAINTYPE(MARKUP,SPACE)};
TAG.name=="PERSNAME"{-> PersName};
TAG.name=="SURNAME"{-> LastName};
TAG.name=="FORENAME"{-> FirstName};
TAG.name=="ADDNAME"{-> AddName};
Document{-> RETAINTYPE};

Document { -> CONFIGURE(HtmlConverter, "inputView" = "_InitialView", 
    "outputView" = "plain", "skipWhitespaces" = false), 
      EXEC(HtmlConverter)};

Document{ -> CONFIGURE(TEIViewWriter, "inputView" = "plain", "outputView" = 
    "_InitialView", "output" = "/../converted/"), 
    EXEC(TEIViewWriter)};
    ]]>   </programlisting>
    </section>
</chapter>
