blob: 88714e84de5d4b07f288e629207f846b067f3dc4 [file] [log] [blame]
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN"
"http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd" [
<!ENTITY % uimaents SYSTEM "../../target/docbook-shared/entities.ent" >
%uimaents;
]>
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
<chapter id="ugr.async.camel.driver">
<title>Asynchronous Scaleout Camel Driver</title>
<section id="ugr.async.camel.driver.component.overview">
<title>Overview</title>
<para>
<ulink url="http://camel.apache.org">Apache Camel</ulink> is an integration framework based on
<ulink url="http://camel.apache.org/enterprise-integration-patterns.html">Enterprise Integration Patterns</ulink>
which uses routes for rule-based message routing and mediation. The camel project
has a large number of components which provide access to a wide variety of
technologies and are the building blocks of the routes. The Asynchronous Scaleout Camel Driver
is a component to integrate UIMA AS into Camel.
</para>
</section>
<section id="ugr.async.camel.driver.component">
<title>How does it work?</title>
<para>
The Asynchronous Scaleout Camel Driver sends the camel message body (without headers) to
a specified UIMA AS processing pipeline. Accessing the analysis results which
are written into the CAS is not possible from a camel route.
There are basically two usage scenarios. The Camel Driver can be used to drive the processing
of a UIMA AS cluster in which each server instance runs a cas multiplier to fetch the actual
document from a database. In this scenario the camel route only sends an ID of the document to
the cas multiplier which does the actual fetching of the document. In the second usage scenario
the Camel Driver is used to send a document in a one way fashion to a UIMA AS processing
pipeline which then takes care of processing it. In case an error occurs inside the processing
pipeline the exception is forwarded to camel and set on the message as response. Error handling
is described in the
<ulink url="http://camel.apache.org/error-handling-in-camel.html">Error handling in Camel</ulink>
documentation.
</para>
<para>
The camel driver expects a string message body; if it is not of the type string
it might be automatically converted by camel type converters. The string message
body is set as the CAS's document text. An Analysis Engine calls
CAS CAS.getDocumentText() to retrieve the string.
</para>
</section>
<section id="ugr.async.camel.driver.uri.format">
<title>URI Format</title>
<para>The Asynchronous Scaleout Camel Driver is configured with a configuration string. The
configuration string must contain the broker location and name of the JMS queue used to
communicate with UIMA AS. It has the following format
<programlisting>uimadriver:brokerURL?queue=nameofqueue&amp;CasPoolSize=n&amp;Timeout=t</programlisting>
which could for example be specified as
<programlisting>uimadriver:tcp://localhost:61616?queue=TextAnalysisQueue</programlisting>.
The CasPoolSize parameter is optional but if it is present n must be an integer
which is larger than zero, otherwise the UIMA AS default will be used.
The Timeout parameter is optional but if present t in milliseconds must zero or
larger.
</para>
</section>
<section id="ugr.async.camel.driver.sample">
<title>Sample</title>
<para>Camel enables a developer to create quickly all kinds of applications out of
existing and custom components. The sample demonstrates how UIMA AS is integrated
with other technologies. Readers who are new to camel should read
the <ulink url="http://camel.apache.org/book-getting-started.html">Getting Started</ulink>
chapter in the camel documentation.</para>
<para>
First a simple sample: A user wants to test a UIMA AS processing pipeline, sending it
a set of test documents to process. The plain text test documents
are located in a folder "/test-data". A camel route for this defined with
<ulink url="http://camel.apache.org/dsl.html">Java DSL</ulink> could look like this:
<programlisting>
from("file://test-data?noop=true").
to("uimadriver:tcp://localhost:61616?queue=TextAnalysisQueue");
</programlisting>
In the route above the file component sends a message for every file to the
uimadriver component. The message contains a reference to the file but not the
content of the file itself. The uimadriver component expects a message with string body as input.
An internal camel type converter will read in the bytes of the file, decode them
into characters with the default platform encoding and then create a string object
which is passed to the uimadriver component. The uimadriver then puts the
string into a CAS and sends it via the UIMA AS Client API to a processing pipeline.
Note that results from the returned CAS cannot be retrieved in a camel route.
</para>
<para>
A more complex sample. A web site has an area where people can upload pictures.
The pictures must be checked for appropriate content. The pictures are pushed to
the site via http, stored in a database and assigned to the human controllers to
classify them either as appropriate or non-appropriate. That is achieved with an
existing camel route and a servlet which receives the images and sends them to the camel route.
<programlisting>
from("direct:start").
to("imagewriter").
to("jms:queue:HumanPictureAnalysisQueue");
</programlisting>
The message containing the image is received by the direct:start endpoint,
the image is written to a database and replaced with a string identifier
by the "imagewriter" component, in the last step the camel jms component posts
the identifier on a JMS queue to notify the reception of a new image. The notification
is received by a client tool which the human controllers use to classify
an image.
</para>
<para>
To lessen the workload on the human reviewers, a system should automatically classify the pictures and
only assign questionable cases to human reviewers. The automatic classification is
done by an UIMA Analysis Engine. The AE can mark an image with one of three classes
appropriate, non-appropriate and unknown. In the case of unknown the AE is not confident
enough which of the first two classes is correct. To be scalable, the processing pipeline is hosted
by UIMA AS and contains three AEs, one to fetch the image from
the database, a classification AE and an AE to write the class of the image back to the database.
The first AE is typical a cas multiplier and receives a CAS which only contains the string identifier
but not the actual image. The cas muliplier uses the identifer to fetch the image from
the database and outputs a new CAS with the actual image. The Camel route blocks until the CAS is processed
by the following two AEs and depending on the class in the database the picture is assigned to a human
controller or not.
<programlisting>
from("direct:start").
to("imagewriter").
to("uimadriver:tcp://localhost:61616?queue=UimaPictureAnalysisQueue").
to("class-retriever").
// filters messages with class appropriate and non-appropriate
filter(header("picture-class").isEqualTo("unkown")).
to("jms:queue:HumanPictureAnalysisQueue");
</programlisting>
The first part is identical, after the imagewriter the string identifier is
send to the UIMA AS processing pipeline which writes the image class back to the database.
The class is retrieved with the custom class-retriever component and written to a message header
field, only if the class is unknown the image is assigned for human classification.
</para>
<para>Note: instead of using a CAS multiplier, a more straight-forward approach would use just one CAS,
having 2 views: one view would contain the string identifier of the image, and the other view would
have the image to be analyzed (or a reference to it in the DB).</para>
</section>
<section id="ugr.async.camel.driver.implementation">
<title>Implementation</title>
<para>The Asynchronous Scaleout Camel Driver is a typical camel component. The camel
documentation <ulink url="http://camel.apache.org/writing-components.html">
Writing Components</ulink> describes how camel components are
written. The source code can be found in the uimaj-as-camel project. The implementation
defines an asynchronous producer endpoint, which is implemented in the
<code>org.apache.uima.camel.UimaAsProducer</code> class. The <code>UimaAsProducer.process</code> method gets
the string body of the message, wraps it in a CAS object and sends it to UIMA AS.
Since the producer is asynchronous the camel message is registered with the reference id of
the sent CAS in an intermediate map, when the CAS comes back from UIMA AS, the camel message is looked up
with the reference id of the CAS and the processing of the camel message is completed.
For further details please read the <code>UimaAsProducer</code> implementation code.
</para>
</section>
</chapter>