uimafit-docbook/src/docbook/tools.uimafit.pipelines.xml - uima-uimafit - Git at Google

 <!--
 	Licensed to the Apache Software Foundation (ASF) under one
 	or more contributor license agreements. See the NOTICE file
 	distributed with this work for additional information
 	regarding copyright ownership. The ASF licenses this file
 	to you under the Apache License, Version 2.0 (the
 	"License"); you may not use this file except in compliance
 	with the License. You may obtain a copy of the License at

 	http://www.apache.org/licenses/LICENSE-2.0

 	Unless required by applicable law or agreed to in writing,
 	software distributed under the License is distributed on an
 	"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 	KIND, either express or implied. See the License for the
 	specific language governing permissions and limitations
 	under the License.
 -->
 <chapter id="ugr.tools.uimafit.pipelines">
   <title>Pipelines</title>
   <para>UIMA is a component-based architecture that allows composing various processing components
     into a complex processing pipeline. A pipeline typically involves a <emphasis>collection
       reader</emphasis> which ingests documents and <emphasis>analysis engines</emphasis> that do
     the actual processing.</para>
   <para>Normally, you would run a pipeline using a UIMA Collection Processing Engine or using UIMA
     AS. uimaFIT offers a third alternative that is much simpler to use and well suited for embedding
     UIMA pipelines into applications or for writing tests.</para>
   <para>As uimaFIT does not supply any readers or processing components, we just assume that we have
     written three components:</para>
   <itemizedlist>
     <listitem>
       <para><classname>TextReader</classname> - reads text files from a directory</para>
     </listitem>
     <listitem>
       <para><classname>Tokenizer</classname> - annotates tokens</para>
     </listitem>
     <listitem>
       <para><classname>TokenFrequencyWriter</classname> - writes a list of tokens and their
         frequency to a file</para>
     </listitem>
   </itemizedlist>
   <para>We create descriptors for all components and run them as a pipeline:</para>
   <programlisting>CollectionReaderDescription reader =
   CollectionReaderFactory.createReaderDescription(
     TextReader.class,
     TextReader.PARAM_INPUT, "/home/uimafit/documents");

 AnalysisEngineDescription tokenizer =
   AnalysisEngineFactory.createEngineDescription(
     Tokenizer.class);

 AnalysisEngineDescription tokenFrequencyWriter =
   AnalysisEngineFactory.createEngineDescription(
     TokenFrequencyWriter.class,
     TokenFrequencyWriter.PARAM_OUTPUT, "counts.txt");

 SimplePipeline.runPipeline(reader, tokenizer, writer);</programlisting>
   <para>Instead of running the full pipeline end-to-end, we can also process one document at a time
     and inspect the analysis results:</para>
   <programlisting>CollectionReaderDescription reader =
   CollectionReaderFactory.createReaderDescription(
     TextReader.class,
     TextReader.PARAM_INPUT, "/home/uimafit/documents");

 AnalysisEngineDescription tokenizer =
   AnalysisEngineFactory.createEngineDescription(
     Tokenizer.class);

 for (JCas jcas : SimplePipeline.iteratePipeline(reader, tokenizer)) {
   System.out.printf("Found %d tokens%n",
     JCasUtil.select(jcas, Token.class).size());
 }</programlisting>
 </chapter>
	<!--
	Licensed to the Apache Software Foundation (ASF) under one
	or more contributor license agreements. See the NOTICE file
	distributed with this work for additional information
	regarding copyright ownership. The ASF licenses this file
	to you under the Apache License, Version 2.0 (the
	"License"); you may not use this file except in compliance
	with the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing,
	software distributed under the License is distributed on an
	"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	KIND, either express or implied. See the License for the
	specific language governing permissions and limitations
	under the License.
	-->
	<chapter id="ugr.tools.uimafit.pipelines">
	<title>Pipelines</title>
	<para>UIMA is a component-based architecture that allows composing various processing components
	into a complex processing pipeline. A pipeline typically involves a <emphasis>collection
	reader</emphasis> which ingests documents and <emphasis>analysis engines</emphasis> that do
	the actual processing.</para>
	<para>Normally, you would run a pipeline using a UIMA Collection Processing Engine or using UIMA
	AS. uimaFIT offers a third alternative that is much simpler to use and well suited for embedding
	UIMA pipelines into applications or for writing tests.</para>
	<para>As uimaFIT does not supply any readers or processing components, we just assume that we have
	written three components:</para>
	<itemizedlist>
	<listitem>
	<para><classname>TextReader</classname> - reads text files from a directory</para>
	</listitem>
	<listitem>
	<para><classname>Tokenizer</classname> - annotates tokens</para>
	</listitem>
	<listitem>
	<para><classname>TokenFrequencyWriter</classname> - writes a list of tokens and their
	frequency to a file</para>
	</listitem>
	</itemizedlist>
	<para>We create descriptors for all components and run them as a pipeline:</para>
	<programlisting>CollectionReaderDescription reader =
	CollectionReaderFactory.createReaderDescription(
	TextReader.class,
	TextReader.PARAM_INPUT, "/home/uimafit/documents");

	AnalysisEngineDescription tokenizer =
	AnalysisEngineFactory.createEngineDescription(
	Tokenizer.class);

	AnalysisEngineDescription tokenFrequencyWriter =
	AnalysisEngineFactory.createEngineDescription(
	TokenFrequencyWriter.class,
	TokenFrequencyWriter.PARAM_OUTPUT, "counts.txt");

	SimplePipeline.runPipeline(reader, tokenizer, writer);</programlisting>
	<para>Instead of running the full pipeline end-to-end, we can also process one document at a time
	and inspect the analysis results:</para>
	<programlisting>CollectionReaderDescription reader =
	CollectionReaderFactory.createReaderDescription(
	TextReader.class,
	TextReader.PARAM_INPUT, "/home/uimafit/documents");

	AnalysisEngineDescription tokenizer =
	AnalysisEngineFactory.createEngineDescription(
	Tokenizer.class);

	for (JCas jcas : SimplePipeline.iteratePipeline(reader, tokenizer)) {
	System.out.printf("Found %d tokens%n",
	JCasUtil.select(jcas, Token.class).size());
	}</programlisting>
	</chapter>