| <!-- |
| Licensed to the Apache Software Foundation (ASF) under one |
| or more contributor license agreements. See the NOTICE file |
| distributed with this work for additional information |
| regarding copyright ownership. The ASF licenses this file |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this file except in compliance |
| with the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, |
| software distributed under the License is distributed on an |
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| KIND, either express or implied. See the License for the |
| specific language governing permissions and limitations |
| under the License. |
| --> |
| <chapter id="ugr.tools.uimafit.pipelines"> |
| <title>Pipelines</title> |
| <para>UIMA is a component-based architecture that allows composing various processing components |
| into a complex processing pipeline. A pipeline typically involves a <emphasis>collection |
| reader</emphasis> which ingests documents and <emphasis>analysis engines</emphasis> that do |
| the actual processing.</para> |
| <para>Normally, you would run a pipeline using a UIMA Collection Processing Engine or using UIMA |
| AS. uimaFIT offers a third alternative that is much simpler to use and well suited for embedding |
| UIMA pipelines into applications or for writing tests.</para> |
| <para>As uimaFIT does not supply any readers or processing components, we just assume that we have |
| written three components:</para> |
| <itemizedlist> |
| <listitem> |
| <para><classname>TextReader</classname> - reads text files from a directory</para> |
| </listitem> |
| <listitem> |
| <para><classname>Tokenizer</classname> - annotates tokens</para> |
| </listitem> |
| <listitem> |
| <para><classname>TokenFrequencyWriter</classname> - writes a list of tokens and their |
| frequency to a file</para> |
| </listitem> |
| </itemizedlist> |
| <para>We create descriptors for all components and run them as a pipeline:</para> |
| <programlisting>CollectionReaderDescription reader = |
| CollectionReaderFactory.createReaderDescription( |
| TextReader.class, |
| TextReader.PARAM_INPUT, "/home/uimafit/documents"); |
| |
| AnalysisEngineDescription tokenizer = |
| AnalysisEngineFactory.createEngineDescription( |
| Tokenizer.class); |
| |
| AnalysisEngineDescription tokenFrequencyWriter = |
| AnalysisEngineFactory.createEngineDescription( |
| TokenFrequencyWriter.class, |
| TokenFrequencyWriter.PARAM_OUTPUT, "counts.txt"); |
| |
| SimplePipeline.runPipeline(reader, tokenizer, writer);</programlisting> |
| <para>Instead of running the full pipeline end-to-end, we can also process one document at a time |
| and inspect the analysis results:</para> |
| <programlisting>CollectionReaderDescription reader = |
| CollectionReaderFactory.createReaderDescription( |
| TextReader.class, |
| TextReader.PARAM_INPUT, "/home/uimafit/documents"); |
| |
| AnalysisEngineDescription tokenizer = |
| AnalysisEngineFactory.createEngineDescription( |
| Tokenizer.class); |
| |
| for (JCas jcas : SimplePipeline.iteratePipeline(reader, tokenizer)) { |
| System.out.printf("Found %d tokens%n", |
| JCasUtil.select(jcas, Token.class).size()); |
| }</programlisting> |
| </chapter> |