blob: 6a89eb4dad8b8d0970a24a087dba89b7ea3393cb [file] [log] [blame]
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
<chapter id="ugr.tools.uimafit.pipelines">
<title>Pipelines</title>
<para>UIMA is a component-based architecture that allows composing various processing components
into a complex processing pipeline. A pipeline typically involves a <emphasis>collection
reader</emphasis> which ingests documents and <emphasis>analysis engines</emphasis> that do
the actual processing.</para>
<para>Normally, you would run a pipeline using a UIMA Collection Processing Engine or using UIMA
AS. uimaFIT offers a third alternative that is much simpler to use and well suited for embedding
UIMA pipelines into applications or for writing tests.</para>
<para>As uimaFIT does not supply any readers or processing components, we just assume that we have
written three components:</para>
<itemizedlist>
<listitem>
<para><classname>TextReader</classname> - reads text files from a directory</para>
</listitem>
<listitem>
<para><classname>Tokenizer</classname> - annotates tokens</para>
</listitem>
<listitem>
<para><classname>TokenFrequencyWriter</classname> - writes a list of tokens and their
frequency to a file</para>
</listitem>
</itemizedlist>
<para>We create descriptors for all components and run them as a pipeline:</para>
<programlisting>CollectionReaderDescription reader =
CollectionReaderFactory.createReaderDescription(
TextReader.class,
TextReader.PARAM_INPUT, "/home/uimafit/documents");
AnalysisEngineDescription tokenizer =
AnalysisEngineFactory.createEngineDescription(
Tokenizer.class);
AnalysisEngineDescription tokenFrequencyWriter =
AnalysisEngineFactory.createEngineDescription(
TokenFrequencyWriter.class,
TokenFrequencyWriter.PARAM_OUTPUT, "counts.txt");
SimplePipeline.runPipeline(reader, tokenizer, writer);</programlisting>
<para>Instead of running the full pipeline end-to-end, we can also process one document at a time
and inspect the analysis results:</para>
<programlisting>CollectionReaderDescription reader =
CollectionReaderFactory.createReaderDescription(
TextReader.class,
TextReader.PARAM_INPUT, "/home/uimafit/documents");
AnalysisEngineDescription tokenizer =
AnalysisEngineFactory.createEngineDescription(
Tokenizer.class);
for (JCas jcas : SimplePipeline.iteratePipeline(reader, tokenizer)) {
System.out.printf("Found %d tokens%n",
JCasUtil.select(jcas, Token.class).size());
}</programlisting>
</chapter>