| <?xml version="1.0" encoding="UTF-8"?> |
| <!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN" |
| "http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd"[ |
| <!ENTITY imgroot "images/tools/tools.ruta/" > |
| <!ENTITY % uimaents SYSTEM "../../target/docbook-shared/entities.ent" > |
| %uimaents; |
| ]> |
| <!-- Licensed to the Apache Software Foundation (ASF) under one or more contributor |
| license agreements. See the NOTICE file distributed with this work for additional |
| information regarding copyright ownership. The ASF licenses this file to |
| you under the Apache License, Version 2.0 (the "License"); you may not use |
| this file except in compliance with the License. You may obtain a copy of |
| the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required |
| by applicable law or agreed to in writing, software distributed under the |
| License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS |
| OF ANY KIND, either express or implied. See the License for the specific |
| language governing permissions and limitations under the License. --> |
| <chapter id="ugr.tools.ruta.howtos"> |
| <title>Apache UIMA Ruta HowTos</title> |
| <para>This chapter contains a selection of some use cases and HowTos |
| for UIMA Ruta. |
| </para> |
| <section id="ugr.tools.ruta.ae.basic.apply"> |
| <title>Apply UIMA Ruta Analysis Engine in plain Java</title> |
| <para> |
| Let us assume that the reader wrote the UIMA Ruta rules using |
| the UIMA |
| Ruta Workbench, which already creates correctly configured |
| descriptors. |
| In this case, the following java code can be used to |
| apply the UIMA |
| Ruta script. |
| </para> |
| <programlisting><![CDATA[File specFile = new File("pathToMyWorkspace/MyProject/descriptor/"+ |
| "my/package/MyScriptEngine.xml"); |
| XMLInputSource in = new XMLInputSource(specFile); |
| ResourceSpecifier specifier = UIMAFramework.getXMLParser(). |
| parseResourceSpecifier(in); |
| // for import by name... set the datapath in the ResourceManager |
| AnalysisEngine ae = UIMAFramework.produceAnalysisEngine(specifier); |
| CAS cas = ae.newCAS(); |
| cas.setDocumentText("This is my document."); |
| ae.process(cas);]]></programlisting> |
| <note> |
| <para> |
| The UIMA Ruta Analysis Engine utilizes type priorities. If the |
| CAS |
| object is |
| not created using the UIMA Ruta Analysis Engine |
| descriptor by other |
| means, then please |
| provide the necessary type |
| priorities for a valid execution of the UIMA |
| Ruta rules. |
| </para> |
| </note> |
| <para> |
| If the UIMA Ruta script was written, for example, with a common text |
| editor and no configured descriptors are yet available, |
| then the |
| following java code can be used, which, however, is only |
| applicable |
| for executing single script files that do not import |
| additional |
| components or scripts. In this case the other parameters, |
| e.g., |
| <quote>additionalScripts</quote> |
| , need to be configured correctly. |
| </para> |
| <programlisting><![CDATA[URL aedesc = RutaEngine.class.getResource("BasicEngine.xml"); |
| XMLInputSource inae = new XMLInputSource(aedesc); |
| ResourceSpecifier specifier = UIMAFramework.getXMLParser(). |
| parseResourceSpecifier(inae); |
| ResourceManager resMgr = UIMAFramework.newDefaultResourceManager(); |
| AnalysisEngineDescription aed = (AnalysisEngineDescription) specifier; |
| TypeSystemDescription basicTypeSystem = aed.getAnalysisEngineMetaData(). |
| getTypeSystem(); |
| |
| Collection<TypeSystemDescription> tsds = |
| new ArrayList<TypeSystemDescription>(); |
| tsds.add(basicTypeSystem); |
| // add some other type system descriptors |
| // that are needed by your script file |
| TypeSystemDescription mergeTypeSystems = CasCreationUtils. |
| mergeTypeSystems(tsds); |
| aed.getAnalysisEngineMetaData().setTypeSystem(mergeTypeSystems); |
| aed.resolveImports(resMgr); |
| |
| AnalysisEngine ae = UIMAFramework.produceAnalysisEngine(aed, |
| resMgr, null); |
| File scriptFile = new File("path/to/file/MyScript.ruta"); |
| ae.setConfigParameterValue(RutaEngine.PARAM_SCRIPT_PATHS, |
| new String[] { scriptFile.getParentFile().getAbsolutePath() }); |
| String name = scriptFile.getName().substring(0, |
| scriptFile.getName().length() - 5); |
| ae.setConfigParameterValue(RutaEngine.PARAM_MAIN_SCRIPT, name); |
| ae.reconfigure(); |
| CAS cas = ae.newCAS(); |
| cas.setDocumentText("This is my document."); |
| ae.process(cas);]]></programlisting> |
| <para> |
| There is also a convenience implementation for applying simple |
| scripts, which do not introduce new types. The following java code |
| applies a simple rule |
| <quote>T1 SW{-> MARK(T2)};</quote> |
| on the given CAS. Note that the types need to be already defined in |
| the type system of the CAS. |
| </para> |
| <programlisting><![CDATA[Ruta.apply(cas, "T1 SW{-> MARK(T2)};");]]></programlisting> |
| </section> |
| <section id="ugr.tools.ruta.integration"> |
| <title>Integrating UIMA Ruta in an existing UIMA Annotator</title> |
| <para>This section provides a walk-through tutorial on integrating |
| Ruta in an existing UIMA annotator. In our artificial example |
| we will |
| use Ruta rules to post-process the output of a |
| Part-of-Speech tagger. |
| The POS tagger is a UIMA annotator that iterates over |
| sentences |
| and |
| tokens and updates the posTag field of each Token with a part of |
| speech. For example, given this text... |
| </para> |
| <programlisting>The quick brown fox receives many bets. |
| The fox places many bets. |
| The fox gets up early. |
| The rabbit made up this story.</programlisting> |
| <para>...it assigns the posTag JJ (adjective) to the token |
| "brown" , the posTag NN (common noun) to the |
| token |
| "fox" and the tag VBZ (verb, 3rd person |
| singular present) to |
| the token "receives" in the first sentence. |
| </para> |
| <para>We have noticed that the tagger sometimens fails to disambiguate |
| NNS (common noun plural) and |
| VBZ tags, as in the second sentence. The |
| word "up" also seems |
| to confuse the tagger, |
| which always |
| assigns it an RB (adverb) tag, even when it is a particle |
| (RP) |
| following a verb, as in |
| the third and fourth sentences: |
| </para> |
| <programlisting>The|DT quick|JJ brown|JJ fox|NN receives|VBZ many|JJ bets|NNS .|. |
| The|DT fox|NN places|NNS many|JJ bets|NNS .|. |
| The|DT fox|NN gets|VBZ up|RB early|RB .|. |
| The|DT rabbit|NN made|VBD up|RB this|DT story|NN .|.</programlisting> |
| <para>Let's imagine that after applying every possible approach |
| available in the POS tagging literature, our tagger still generates |
| these and some other errors. We decide to write a few Ruta |
| rules to |
| post-process the output of the tagger. |
| </para> |
| <section id="ugr.tools.ruta.ae.integration.mvn"> |
| <title>Adding Ruta to our Annotator</title> |
| <para>The POS tagger is being developed as a Maven-based project. |
| Since Ruta maven artifacts are available on Maven Central, we |
| add the |
| following dependency to the project's pom.xml. The |
| functionalities described in this section |
| require a version of Ruta |
| equal to or greater than 2.1.0. |
| </para> |
| <programlisting><dependency> |
| <groupId>org.apache.uima</groupId> |
| <artifactId>ruta-core</artifactId> |
| <version>[2.0.2,)</version> |
| </dependency></programlisting> |
| <para>We also take care that the Ruta basic typesystem is loaded when |
| our annotator is initialized. The Ruta typesystem descriptors are |
| available from |
| ruta-core/src/main/resources/org/apache/uima/ruta/engine/ |
| </para> |
| </section> |
| <section id="ugr.tools.ruta.ae.integration.loading"> |
| <title>Developing Ruta rules and applying them from inside Java code |
| </title> |
| <para>We are now ready to write some rules. The ones we develop for |
| fixing the two errors look like this: |
| </para> |
| <programlisting>Token.posTag =="NN" Token.posTag=="NNS"{-> Token.posTag="VBZ"} |
| Token.posTag=="JJ"; |
| Token{REGEXP(Token.posTag, "VB(.?)")} |
| Token.posTag=="RB"{REGEXP("up")-> Token.posTag="RP"}; </programlisting> |
| <para>That |
| is, we change a Token's NNS tag to VBZ, if it is surrounded by a |
| Token tagged as NN and a Token tagged as JJ. We also change an RB |
| tag for an "up" token to RP, if "up" is preceded |
| by any verbal tag (VB, VBZ, etc.) matched with the help of the |
| <link linkend='ugr.tools.ruta.language.conditions.regexp'>REGEXP</link> |
| condition. |
| </para> |
| <para>We test our rules in the Ruta Workbench and see that they |
| indeed fix most of our problems. We save those and some more rules |
| in a text file |
| src/main/resources/ruta.txt. |
| </para> |
| <para>We declare the file with our rules as an external resource and |
| we load it during initialization. |
| Here's a way to do it using |
| uimaFIT: |
| </para> |
| <programlisting>/** |
| * Ruta rules for post-processing the tagger's output |
| */ |
| public static final String RUTA_RULES_PARA = "RutaRules"; |
| ExternalResource(key = RUTA_RULES_PARA, mandatory=false) |
| ... |
| File rutaRulesF = new File((String) |
| aContext.getConfigParameterValue(RUTA_RULES_PARA)); |
| </programlisting> |
| <para>After our CAS has been populated with posTag annotations from |
| the main algorithm, we post-process the CAS using |
| Ruta.apply(): |
| </para> |
| <programlisting>String rutaRules = org.apache.commons.io.FileUtils.readFileToString( |
| rutaRulesF, "UTF-8"); |
| Ruta.apply(cas, rutaRules); |
| </programlisting> |
| <para>We are now happy to see that the final output of our annotator |
| now |
| looks much better: |
| </para> |
| <programlisting>The|DT quick|JJ brown|JJ fox|NN receives|VBZ many|JJ bets|NNS .|. |
| The|DT fox|NN places|VBZ many|JJ bets|NNS .|. |
| The|DT fox|NN gets|VBZ up|RP early|RB .|. |
| The|DT rabbit|NN made|VBD up|RP this|DT story|NN .|.</programlisting> |
| </section> |
| </section> |
| |
| |
| <section id="ugr.tools.ruta.maven"> |
| <title>UIMA Ruta Maven Plugin</title> |
| <para>UIMA Ruta provides a maven plugin for building analysis engine |
| and type system descriptors for rule scripts. |
| Additionally, this maven plugin is able able to compile word list (gazetteers) to |
| the more efficient structures, tree word list and multi tree word |
| list. The usage and configuration is shortly summarized in the |
| following. An exemplary maven project for UIMA Ruta is given here: |
| <code>https://svn.apache.org/repos/asf/uima/ruta/trunk/example-projects/ruta-maven-example</code> |
| </para> |
| <section> |
| <title>generate goal</title> |
| <para> |
| The generate goal can be utilized to create xml descriptors for the UIMA Ruta script files. |
| Its usage and configuration is summarized in the following example: |
| </para> |
| <programlisting><![CDATA[<plugin> |
| <groupId>org.apache.uima</groupId> |
| <artifactId>ruta-maven-plugin</artifactId> |
| <version>2.3.0</version> |
| <configuration> |
| |
| <!-- This is a exemplary configuration, which explicitly specifies the |
| default configuration values if not mentioned otherwise. --> |
| |
| <!-- |
| The following parameter is optional and should only be specified |
| if the structure (e.g., classpath/resources) of the project requires it. |
| |
| A FileSet specifying the UIMA Ruta script files that should be built. |
| |
| If this parameter is not specified, then all UIMA Ruta script files |
| in the output directory (e.g., target/classes) of the project will be built. |
| |
| default value: none |
| <scriptFiles> |
| <directory>${basedir}/some/folder</directory> |
| <includes> |
| <include>*.ruta</include> |
| </includes> |
| </scriptFiles> |
| --> |
| |
| <!-- The directory where the generated type system descriptors will |
| be written stored. --> |
| <!-- default value: ${project.build.directory}/generated-sources/ |
| ruta/descriptor --> |
| <typeSystemOutputDirectory>${project.build.directory}/generated-sources/ |
| ruta/descriptor</typeSystemOutputDirectory> |
| |
| <!-- The directory where the generated analysis engine descriptors will |
| be stored. --> |
| <!-- default value: ${project.build.directory}/generated-sources/ruta/ |
| descriptor --> |
| <analysisEngineOutputDirectory>${project.build.directory}/ |
| generated-sources/ruta/descriptor</analysisEngineOutputDirectory> |
| |
| <!-- The template descriptor for the generated type system. By default the |
| descriptor of the maven dependency is loaded. --> |
| <!-- default value: none --> |
| <!-- not used in this example <typeSystemTemplate>... |
| </typeSystemTemplate> --> |
| |
| <!-- The template descriptor for the generated analysis engine. |
| By default the descriptor of the maven dependency is loaded. --> |
| <!-- default value: none --> |
| <!-- not used in this example <analysisEngineTemplate>... |
| </analysisEngineTemplate> --> |
| |
| <!-- Script paths of the generated analysis engine descriptor. --> |
| <!-- default value: none --> |
| <scriptPaths> |
| <scriptPath>${basedir}/src/main/ruta/</scriptPath> |
| </scriptPaths> |
| |
| <!-- Descriptor paths of the generated analysis engine descriptor. --> |
| <!-- default value: none --> |
| <descriptorPaths> |
| <descriptorPath>${project.build.directory}/generated-sources/ruta/ |
| descriptor</descriptorPath> |
| </descriptorPaths> |
| |
| <!-- Resource paths of the generated analysis engine descriptor. --> |
| <!-- default value: none --> |
| <resourcePaths> |
| <resourcePath>${basedir}/src/main/resources/</resourcePath> |
| <resourcePath>${project.build.directory}/generated-sources/ruta/ |
| resources/</resourcePath> |
| </resourcePaths> |
| |
| <!-- Suffix used for the generated type system descriptors. --> |
| <!-- default value: Engine --> |
| <analysisEngineSuffix>Engine</analysisEngineSuffix> |
| |
| <!-- Suffix used for the generated analysis engine descriptors. --> |
| <!-- default value: TypeSystem --> |
| <typeSystemSuffix>TypeSystem</typeSystemSuffix> |
| |
| <!-- Source file encoding. --> |
| <!-- default value: ${project.build.sourceEncoding} --> |
| <encoding>UTF-8</encoding> |
| |
| <!-- Type of type system imports. false = import by location. --> |
| <!-- default value: false --> |
| <importByName>false</importByName> |
| |
| <!-- Option to resolve imports while building. --> |
| <!-- default value: false --> |
| <resolveImports>false</resolveImports> |
| |
| <!-- Amount of retries for building dependent descriptors. Default value |
| -1 leads to three retires for each script. --> |
| <!-- default value: -1 --> |
| <maxBuildRetries>-1</maxBuildRetries> |
| |
| <!-- List of packages with language extensions --> |
| <!-- default value: none --> |
| <extensionPackages> |
| <extensionPackage>org.apache.uima.ruta</extensionPackage> |
| </extensionPackages> |
| |
| <!-- Add UIMA Ruta nature to .project --> |
| <!-- default value: false --> |
| <addRutaNature>true</addRutaNature> |
| |
| |
| <!-- Buildpath of the UIMA Ruta Workbench (IDE) for this project --> |
| <!-- default value: none --> |
| <buildPaths> |
| <buildPath>script:src/main/ruta/</buildPath> |
| <buildPath>descriptor:target/generated-sources/ruta/descriptor/ |
| </buildPath> |
| <buildPath>resources:src/main/resources/</buildPath> |
| </buildPaths> |
| |
| </configuration> |
| <executions> |
| <execution> |
| <id>default</id> |
| <phase>process-classes</phase> |
| <goals> |
| <goal>generate</goal> |
| </goals> |
| </execution> |
| </executions> |
| </plugin> |
| ]]> </programlisting> |
| <para>The configuration parameters for this goal either define the build behavior, |
| e.g., where the generated descriptor should be placed or which suffix the files should get, |
| or the configuration of the generated analysis engine descriptor, e.g., |
| the values of the configuration parameter scriptPaths. |
| However, there are also other parameters: addRutaNature and buildPaths. |
| Both can be utilzed to configure the current Eclipse project (due to the missing m2e connector). |
| This is required if the functionality of the UIMA Ruta Workbench, e.g., syntax checking or auto-completeion, |
| should be available in the maven project. If the parameter addRutaNature is set to true, then |
| the UIMA Ruta Workbench will recognize the project as a script project. Only then, |
| the buildpath of the UIMA Ruta project can be configured using the buildPaths parameter, which specifies |
| the three important source folders of the UIMA Ruta project. In normal UIMA Ruta Workbnech projects, |
| these are script, descriptor and resources. |
| </para> |
| </section> |
| |
| <section> |
| <title>twl goal</title> |
| <para> |
| The twl goal can be utilized to create .twl files from .txt files. |
| Its usage and configuration is summarized in the following example: |
| </para> |
| <programlisting><![CDATA[<plugin> |
| <groupId>org.apache.uima</groupId> |
| <artifactId>ruta-maven-plugin</artifactId> |
| <version>2.3.0</version> |
| <configuration></configuration> |
| <executions> |
| <execution> |
| <id>default</id> |
| <phase>process-classes</phase> |
| <goals> |
| <goal>twl</goal> |
| </goals> |
| <configuration> |
| <!-- This is a exemplary configuration, which explicitly specifies |
| the default configuration values if not mentioned otherwise. --> |
| |
| <!-- Compress resulting tree word list. --> |
| <!-- default value: true --> |
| <compress>true</compress> |
| |
| <!-- The source files for the tree word list. --> |
| <!-- default value: none --> |
| <inputFiles> |
| <directory>${basedir}/src/main/resources</directory> |
| <includes> |
| <include>*.txt</include> |
| </includes> |
| </inputFiles> |
| |
| <!-- The directory where the generated tree word lists will be |
| written to.--> |
| <!-- default value: ${project.build.directory}/generated-sources/ |
| ruta/resources/ --> |
| <outputDirectory>${project.build.directory}/generated-sources/ruta/ |
| resources/</outputDirectory> |
| |
| <!-- Source file encoding. --> |
| <!-- default value: ${project.build.sourceEncoding} --> |
| <encoding>UTF-8</encoding> |
| |
| </configuration> |
| </execution> |
| </executions> |
| </plugin> |
| ]]> </programlisting> |
| </section> |
| |
| <section> |
| <title>mtwl goal</title> |
| <para> |
| The mtwl goal can be utilized to create a .mtwl file from multiple .txt files. |
| Its usage and configuration is summarized in the following example: |
| </para> |
| <programlisting><![CDATA[<plugin> |
| <groupId>org.apache.uima</groupId> |
| <artifactId>ruta-maven-plugin</artifactId> |
| <version>2.3.0</version> |
| <configuration></configuration> |
| <executions> |
| <execution> |
| <id>default</id> |
| <phase>process-classes</phase> |
| <goals> |
| <goal>mtwl</goal> |
| </goals> |
| <configuration> |
| <!-- This is a exemplary configuration, which explicitly specifies |
| the default configuration values if not mentioned otherwise. --> |
| |
| <!-- Compress resulting tree word list. --> |
| <!-- default value: true --> |
| <compress>true</compress> |
| |
| <!-- The source files for the multi tree word list. --> |
| <!-- default value: none --> |
| <inputFiles> |
| <directory>${basedir}/src/main/resources</directory> |
| <includes> |
| <include>*.txt</include> |
| </includes> |
| </inputFiles> |
| |
| <!-- The directory where the generated tree word list will be |
| written to. --> |
| <!-- default value: ${project.build.directory}/generated-sources/ruta/ |
| resources/generated.mtwl --> |
| <outputFile>${project.build.directory}/generated-sources/ruta/resources/ |
| generated.mtwl</outputFile> |
| |
| <!-- Source file encoding. --> |
| <!-- default value: ${project.build.sourceEncoding} --> |
| <encoding>UTF-8</encoding> |
| |
| </configuration> |
| </execution> |
| </executions> |
| </plugin> |
| ]]> </programlisting> |
| </section> |
| |
| </section> |
| |
| <section id="section.tools.ruta.workbench.textruler.example"> |
| <title>Induce rules with the TextRuler framework</title> |
| <para> |
| This section gives a short example how the TextRuler framework is applied in order to induce annotation rules. We refer to the screenshot in <xref linkend="figure.tools.ruta.workbench.textruler.main"/> |
| for the configuration and are using the exemplary UIMA Ruta project <quote>TextRulerExample</quote>, which is part of the source release of UIMA Ruta. |
| After importing the project into your workspace, please rebuild all UIMA Ruta scripts in order to create the descriptors, e.g., by cleaning the project. |
| </para> |
| <para> |
| In this example, we are using the <quote>KEP</quote> algorithm for learning annotation rules for identifying Bibtex entries in the reference section of scientific publications: |
| <orderedlist> |
| <listitem> |
| <para>Select the folder <quote>single</quote> and drag and drop it to the <quote>Training Data</quote> text field. This folder contains one file with |
| correct annotations and serves as gold standard data in our example.</para> |
| </listitem> |
| <listitem> |
| <para>Select the file <quote>Feature.ruta</quote> and drag and drop it to the <quote>Preprocess Script</quote> text field. This UIMA Ruta script knows all necessary types, especially the types |
| of the annotations we try the learn rules for, and additionally it contains rules that create useful annotations, which can be used by the algorithm in order to learn better rules.</para> |
| </listitem> |
| <listitem> |
| <para>Select the file <quote>InfoTypes.txt</quote> and drag and drop it to the <quote>Information Types</quote> list. This specifies the goal of the learning process, |
| which types of annotations should be annotated by the induced rules, respectively.</para> |
| </listitem> |
| <listitem> |
| <para>Check the checkbox of the <quote>KEP</quote> algorithm and press the start button in the toolbar fo the view.</para> |
| </listitem> |
| <listitem> |
| <para>The algorithm now tries to induce rules for the targeted types. The current result is displayed in the view <quote>KEP Results</quote> in the right part of the perspective.</para> |
| </listitem> |
| <listitem> |
| <para>After the algorithms finished the learning process, create a new UIMA Ruta file in the <quote>uima.ruta.example</quote> package and copy the content of the result view |
| to the new file. Now, the induced rules can be applied as a normal UIMA Ruta script file.</para> |
| </listitem> |
| </orderedlist> |
| </para> |
| </section> |
| <section id="section.tools.ruta.howto.html"> |
| <title>HTML annotations in plain text</title> |
| <para> |
| The following script provides an example how to process HTML files with UIMA Ruta in order to get plain text documents |
| that still contain information about the HTML tags in form of annotations. The analysis engine descriptor HtmlViewWriter is identical to the common ViewWriter, |
| but additionally specifies a type system. More information about different options to configure the |
| conversion can be found in <link linkend='ugr.tools.ruta.ae.htmlconverter'>here</link>. |
| </para> |
| <programlisting><![CDATA[PACKAGE uima.ruta.example; |
| |
| ENGINE utils.HtmlAnnotator; |
| ENGINE utils.HtmlConverter; |
| ENGINE HtmlViewWriter; |
| TYPESYSTEM utils.HtmlTypeSystem; |
| TYPESYSTEM utils.SourceDocumentInformation; |
| |
| Document{-> RETAINTYPE(SPACE,BREAK)}; |
| Document{-> EXEC(HtmlAnnotator)}; |
| |
| Document { -> CONFIGURE(HtmlConverter, "inputView" = "_InitialView", |
| "outputView" = "plain"), |
| EXEC(HtmlConverter)}; |
| |
| Document{ -> CONFIGURE(ViewWriter, "inputView" = "plain", |
| "outputView" = "_InitialView", "output" = "/../converted/"), |
| EXEC(HtmlViewWriter)}; |
| ]]> </programlisting> |
| </section> |
| |
| <section id="section.tools.ruta.howto.sorter"> |
| <title>Sorting files with UIMA Ruta</title> |
| <para> |
| The following script provides an example how to utilize UIMA Ruta for sorting files. |
| </para> |
| <programlisting><![CDATA[ENGINE utils.XMIWriter; |
| TYPESYSTEM utils.SourceDocumentInformation; |
| |
| DECLARE Pattern; |
| |
| // some rule-based pattern |
| (NUM SPECIAL NUM SPECIAL NUM){-> Pattern}; |
| |
| Document{CONTAINS(Pattern)->CONFIGURE(XMIWriter, |
| "Output" = "../with/"), EXEC(XMIWriter)}; |
| Document{-CONTAINS(Pattern)->CONFIGURE(XMIWriter, |
| "Output" = "../without/"), EXEC(XMIWriter)}; |
| ]]> </programlisting> |
| </section> |
| <section id="section.tools.ruta.howto.xml"> |
| <title>Converting XML documents with UIMA Ruta</title> |
| <para> |
| The following script provides an example how to process XML files in order to retain only the text content. the removed XML elements should, howver, be available as annotations. |
| This script can therefore be applied to create xmiCAS files from text document annotated with XML tags. The analysis engine descriptor TEIViewWriter is identical to the common ViewWriter, |
| but additionally specifies a type system. |
| </para> |
| <programlisting><![CDATA[ENGINE utils.HtmlAnnotator; |
| TYPESYSTEM utils.HtmlTypeSystem; |
| ENGINE utils.HtmlConverter; |
| ENGINE TEIViewWriter; |
| TYPESYSTEM utils.SourceDocumentInformation; |
| |
| DECLARE PersName, LastName, FirstName, AddName; |
| |
| Document{->EXEC(HtmlAnnotator, {TAG})}; |
| Document{-> RETAINTYPE(MARKUP,SPACE)}; |
| TAG.name=="PERSNAME"{-> PersName}; |
| TAG.name=="SURNAME"{-> LastName}; |
| TAG.name=="FORENAME"{-> FirstName}; |
| TAG.name=="ADDNAME"{-> AddName}; |
| Document{-> RETAINTYPE}; |
| |
| Document { -> CONFIGURE(HtmlConverter, "inputView" = "_InitialView", |
| "outputView" = "plain", "skipWhitespaces" = false), |
| EXEC(HtmlConverter)}; |
| |
| Document{ -> CONFIGURE(ViewWriter, "inputView" = "plain", "outputView" = |
| "_InitialView", "output" = "/../converted/"), |
| EXEC(TEIViewWriter)}; |
| ]]> </programlisting> |
| </section> |
| </chapter> |