blob: d2a51606c2eeb4118d868219bf9bc1f928319dcf [file] [log] [blame]
<?xml version="1.0" encoding="UTF-8"?>
<!-- Licensed to the Apache Software Foundation (ASF) under one or more contributor
license agreements. See the NOTICE file distributed with this work for additional
information regarding copyright ownership. The ASF licenses this file to
you under the Apache License, Version 2.0 (the "License"); you may not use
this file except in compliance with the License. You may obtain a copy of
the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required
by applicable law or agreed to in writing, software distributed under the
License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS
OF ANY KIND, either express or implied. See the License for the specific
language governing permissions and limitations under the License. -->
<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN"
"http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd" [
<!ENTITY imgroot "images/" >
]>
<book lang="en">
<title>
Apache UIMA Lucene CAS Indexer Documentation
</title>
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude"
href="../../target/docbook-shared/common_book_info.xml" />
<preface>
<title>Introduction</title>
<para>
The Lucene CAS Indexer (Lucas) is a UIMA CAS consumer that
stores CAS
data in a Lucene index. Lucas allows to exploit the results
of
collection
processing for information retrieval purposes in a fast
and flexible way.
The consumer transforms annotation objects from
annotation indexes into
Lucene token objects and creates token streams
from them. Token
streams can
be further processed by token filters
before they are stored into a
certain
field of an index document.
The
mapping between UIMA annotations and Lucene tokens and token
filtering is configurable
by an XML file, whereas the index writer is
configured by a properties
file.
</para>
<para>
To use Lucas, at first a mapping file must be created. You have
to
decide
which annotation types should be present in the index and
how
your
index layout should look like, or more precisely, which
fields
should
be
contained in the index. Optionally you can add token
filters
for
further
processing. Its also possible to deploy your own
token
filters.
</para>
<para>
Lucas can run in multiple deployment scenarios where different
instances share
one index writer. This shared index writer instance is
configured via a properties file
and managed by the resource manager.
</para>
</preface>
<chapter id="sandbox.luceneCasConsumer.mapping">
<title>Mapping Configuration</title>
<para>
This chapter discusses the mapping between UIMA annotations and
Lucene tokens in detail.
</para>
<section id="sandbox.luceneCasConsumer.mapping.tokenSources">
<title>Token Sources</title>
<para>
The mapping file describes the structure and contents of the
generated Lucene index. Each CAS
in a collection is mapped to a
Lucene document. A Lucene document
consists of fields, whereas a CAS
contains multiple annotation
indexes on different sofas. An
annotation object can mark a text,
can hold feature values or
reference
other feature structures. For instance,
an annotation
created by an entity mapper
marks a text area and may
additionally
contain an identifier for the mapped entity.
For this reason Lucas
knows
three different
sources of Lucene token values:
</para>
<itemizedlist>
<listitem>
<para>
The covered text of a annotation object.
</para>
</listitem>
<listitem>
<para>
One or more feature values of a annotation object.
</para>
</listitem>
<listitem>
<para>
One or more feature values of a feature structure directly
or
indirectly referenced
by an annotation object.
</para>
</listitem>
</itemizedlist>
<para>
If a feature has multiple values, that means it references a FSArray
instance, then one token is generated for each value. In the same
manner tokens are generated from each feature value, if more then
one
feature is provided. Alternatively, you can provide a
<emphasis>featureValueDelimiterString
</emphasis>
which is used to concatenate different feature values
from one
annotation object to generate only one token.
Each generated
Lucene
token has the same offset as the source annotation feature
structure.
</para>
<section id="sandbox.luceneCasConsumer.mapping.types.coveredText">
<title>Covered Text</title>
<para>
As mentioned above, the text covered by annotation objects
represents one possible source for Lucene token values.
The
following
example creates an
index with one
<emphasis>title</emphasis>
field which contains
covered texts from all
token annotations which
are stored in the
<emphasis>title</emphasis>
sofa.
<programlisting><![CDATA[<fields>
<field name=“title” index=“yes”>
<annotations>
<annotation sofa=“title” type=“de.julielab.types.Token”/>
</annotations>
</field>
</fields>]]></programlisting>
</para>
</section>
<section id="sandbox.luceneCasConsumer.mapping.types.feature">
<title>Feature Values</title>
<para>
The feature values of annotation objects are another source
for
token values. Consider the example below.
<programlisting><![CDATA[<fields>
<field name=“cells” index=“yes”>
<annotations>
<annotation sofa=“text” type=“de.julielab.types.Cell”>
<features>
<feature name=“specificType”>
</features>
</annotation>
</annotations>
</field>
</fields>]]></programlisting>
The field
<emphasis>cells</emphasis>
contains a token stream generated from the
annotation index of type
<emphasis>de.julielab.types.Cell</emphasis>
. Each generated
token will contain the value of the feature
<emphasis>specificType</emphasis>
of the
enclosing
annotation object.
</para>
<para>
The next example illustrates how multiple feature values can be
combined by using a
<emphasis>featureValueDelimiterString</emphasis>
.
If no
<emphasis>featureValueDelimiterString</emphasis>
is provided,
a single token is generated from
each feature value.
<programlisting><![CDATA[<fields>
<field name=“authors” index=“no” stored="yes">
<annotations>
<annotation sofa=“text” type=“de.julielab.types.Author”
featureValueDelimiterString=", ">
<features>
<feature name=“firstname”/>
<feature name=“lastname”/>
</features>
</annotation>
</annotations>
</field>
</fields>]]></programlisting>
</para>
</section>
<section id="sandbox.luceneCasConsumer.mapping.types.featureStructures">
<title>Feature Values of referenced Feature Structures
</title>
<para>
Since annotation objects may reference other feature
structures, it
may be desirable to use these feature structures as
source for
Lucene token values.
To achieve this, we utilize feature
paths to
address these feature structures.
Consider the example
below.
</para>
<beginpage />
<para>
<programlisting><![CDATA[<fields>
<field name=“cities” index=“yes”>
<annotations>
<annotation sofa=“text” type=“de.julielab.types.Author”
featurePath="affiliation.address">
<features>
<feature name=“city”>
</features>
</annotation>
</annotations>
</field>
</fields>]]></programlisting>
</para>
<para>
The type
<emphasis>de.julielab.types.Author
</emphasis>
has a feature
<emphasis>affiliation</emphasis>
which points to a
<emphasis>affiliation</emphasis>
feature structure.
This
<emphasis>affiliation</emphasis>
feature structure in turn has a feature
<emphasis>address</emphasis>
which points to an
<emphasis>address</emphasis>
feature structure. This
path of
references is expressed as
the feature
path
<emphasis>affiliation.address</emphasis>
.
A feature path consists of feature names
separated by colons (".").
Please
consider that the
<emphasis>city</emphasis>
feature is a feature of
the "address"
feature structure and not of
the
<emphasis>author</emphasis>
annotation object.
</para>
</section>
<section
id="sandbox.luceneCasConsumer.mapping.types.supportedFeatureTypes">
<title>Supported feature types
</title>
<para>
Currently, not all feature types are supported. Supported
feature types are the following:
</para>
<itemizedlist>
<listitem>
<para>String</para>
</listitem>
<listitem>
<para>String Array</para>
</listitem>
<listitem>
<para>Number Types: Double, Float, Long, Integer, Short
</para>
</listitem>
</itemizedlist>
<para>
Consider that you need to provide a number format string if
you
want to use
number types.
</para>
</section>
</section>
<section id="sandbox.luceneCasConsumer.mapping.alignment">
<title>Token Stream Alignment</title>
<para>
In the examples above all defined Lucene fields contain only one
annotation based
token stream. There are a couple of reasons for the
fact that the simple mapping
of each annotation index to separate
Lucene fields is not an optimal
strategy.
One practical reason is that
the Lucene highlighting will not work for
scenarios
where more than
one annotation type is involved.
Additionally, the TF-IDF weighting
of terms does not work properly
if
annotations are separated from
their corresponding
text fragments.
Lucas is able to merge token
streams and align them according
to their
token offsets.
The resulting
merged token stream is then
stored in a
field.
The next example
demonstrates this merging feature.
<programlisting><![CDATA[<fields>
<field name=“text” index=“yes” merge=“true”>
<annotations>
<annotation sofa=“text” type=“de.julielab.types.Token”/>
<annotation sofa=“text” type=“de.julielab.types.Cell”>
<features>
<feature name=“specificType”>
</features>
</annotation>
</annotations>
</field>
</fields>]]></programlisting>
Consider the merge attribute of the field tag. It causes the
alignment of the two
token streams generated from the
<emphasis>de.julielab.types.Token</emphasis>
and
<emphasis>de.julielab.types.Cell</emphasis>
annotations. If
this
attribute is set
to false or is left out, the
annotation token
streams are concatenated.
</para>
</section>
<section id="sandbox.luceneCasConsumer.mapping.tokenfilters">
<title>
Token Filters
</title>
<para>
Token filters are the Lucene approach to enable operations on
token streams. In typical Lucene applications token filters
are
combined with a tokenizer to build analyzers. In a typical Lucas
application the tokenization is already given by annotation indexes.
Lucas allows to apply token filters to certain annotation token
streams or
to the merged or concatenated field token stream as whole.
The following
example demonstrates how token filters are defined in
the mapping file.
</para>
<programlisting><![CDATA[<fields>
<field name=“text” index=“yes” merge=“true”>
<filters>
<filter name="lowercase"/>
</filters>
<annotations>
<annotation sofa=“text” type=“de.julielab.types.Token”>
<filters>
<filter name="stopwords"
filePath="resources/stopwords.txt"/>
</filters>
</annotation>
<annotation sofa=“text” type=“de.julielab.types.Cell”>
<features>
<feature name=“specificType”>
</features>
</annotation>
</annotations>
</field>
</fields>]]></programlisting>
<para>
The lowercase token filter is applied to the complete field
content
and
the stopword
filter is only applied to the annotation
token stream
which is
generated from the de.julielab.types.Token
annotation index.
Both filters are
predefined filters which are
included in the Lucas
distribution. A
reference of all
predefined
token filters is covered in
<xref linkend="sandbox.luceneCasConsumer.mapping.reference" />
.
</para>
<section id="sandbox.luceneCasConsumer.mapping.tokenfilters.selfdefined">
<title>
Deploying your own Token Filters
</title>
<para>
For scenarios where the built in token filters where not
sufficient,
you can
provide your own token filters. Simple token
filters which do not need
any further parameterization, are required
to define
a public constructor, which
takes a
token stream as the only
parameter. The next example shows how a such a
token
filter is
referenced in the mapping file.
</para>
<programlisting><![CDATA[<fields>
<field name=“text” index=“yes”>
<annotations>
<annotation sofa=“text” type=“de.julielab.types.Cell”>
<filters>
<filter className="org.example.MyFilter"/>
</filters>
<features>
<feature name=“specificType”>
</features>
</annotation>
</annotations>
</field>
</fields>]]></programlisting>
<para>
The attribute
<emphasis>className</emphasis>
must reference the canonical class
name of
the the filter.
In cases
where the token filter has
parameters we need to provide a
factory
for it.
This factory must
implement the
<emphasis>org.apache.uima.indexer.analysis.TokenFilterFactory
</emphasis>
interface. This interface defines a method createTokenFilter which
takes a
token
stream and a java.util.Properties object as parameters.
The properties
object will
include all attribute names as keys and
their values which are
additionally defined
in the filter tag.
Consider the example below for a demonstration.
</para>
<programlisting><![CDATA[<fields>
<field name=“text” index=“yes”>
<annotations>
<annotation sofa=“text” type=“de.julielab.types.Cell”>
<filters>
<filter factoryClassName="org.example.MyTokenFilterFactory"
parameter1="value1" parameter2="value2"/>
</filters>
<features>
<feature name=“specificType”>
</features>
</annotation>
</annotations>
</field>
</fields>]]></programlisting>
<para>
In the example above the token filter factory is new
instantiated for
every
occurrence in the mapping file. In scenarios
where token filters use large
resources,
this will be a waste of
memory and time. To reuse a factory instance
we need to provide a
name and a reuse attribute.
The example below demonstrate how we can
reuse a factory
instance.
</para>
<programlisting><![CDATA[<fields>
<field name=“text” index=“yes”>
<annotations>
<annotation sofa=“text” type=“de.julielab.types.Cell”>
<filters>
<filter factoryClassName="org.example.MyTokenFilterFactory"
name="myFactory" reuse="true"
myResourceFilePath="pathToResource"/>
</filters>
<features>
<feature name=“specificType”>
</features>
</annotation>
</annotations>
</field>
</fields>]]></programlisting>
</section>
</section>
<section id="sandbox.luceneCasConsumer.mapping.termcover">
<title>Defining term covers</title>
<para>
When
defining a normal field in the ways described in the above
sections, the
term set <emphasis>T</emphasis> resulting from the processing defined by the
<emphasis>annotation</emphasis>
and
<emphasis>filter</emphasis>
elements will all be added to the respective field. It is possible
to automatically distribute these terms onto multiple fields which
are dynamically created. Each term may be included in zero or
multiple fields.
Which term is to be added to which field(s) is defined
by a
<emphasis>termCoverDefinition</emphasis>
file. The idea is that the whole term set <emphasis>T</emphasis> is
<emphasis>covered</emphasis>
by several subsets <emphasis>S1,S2,...,SN</emphasis> where each subset corresponds to a field for all
terms in this subset. The result is not necessarily a partition,
that is,
one term may be accepted to multiple fields. Furthermore, to
keep true to the notion
of a
<emphasis>cover</emphasis>
,
terms which don't belong to any subset are not considered to
belong
to the field definition at all and would be filtered out
anyway (as
an assumption; this is theoretically motivated and has no practical
consequences here).
</para>
<para>
This mechanism is useful whenever the
<emphasis>token source</emphasis>
of a field emmits tokens (thus, eventually terms) the user would
wish assign to different categories and express this categorization
by modelling the terms into one field per category. As an example, a
shop system could be considered. An annotation type
<emphasis>de.julielab.types.ArticleName</emphasis>
would be annotated in
<emphasis>CAS</emphasis>
objects. Among the text snippets annotated this way one would find
<emphasis>light bulb</emphasis>
,
<emphasis>electric shaver</emphasis>
and
<emphasis>smartphone</emphasis>
for example (the three terms are considered to be article names for
this example, even if they are chosen generally enough to be
categories themselves). The shop system would have - among others
- three article categories
<emphasis>electronics</emphasis>
,
<emphasis>sanitaryArticles</emphasis>
and
<emphasis>computers</emphasis>
. The goal will be to assign the article names to the fields
corresponding to their category. Since this information is not given
implicitely by different annotation objects (the assumption is that there are far
too many categories which could even change over time; this would
make maintaining the type system rather tedious), an explicit
definition must be delivered. This is achieved using a
<emphasis>termCoverDefinitionFile</emphasis>
. It is required to expose the following format:
<informalequation>
<mediaobject>
<textobject>
<phrase>
&lt;term&gt;=&lt;S1&gt;|&lt;S2&gt;|...|&lt;SN&gt;
</phrase>
</textobject>
</mediaobject>
</informalequation>
That is, one term per line, the categories of a term assigned by a
=
sign and multiple categories separated by the
|
character. An example
file would read as follows:
<programlisting><![CDATA[light bulb=electronics
electric shaver=electronics|sanitaryArticles
smartphone=electronics|computers]]>
</programlisting>
</para>
<para>
To create fields according to a cover set definition as described
above, the element
<emphasis>termSetCoverDefinition</emphasis>
is introduced into the
<emphasis>field</emphasis>
element. An example would look like this:
<programlisting><![CDATA[<fields>
<field name="articlecategory_" ...>
<termSetCoverDefinition coverDefinitionFile="pathToCoverDefinitionFile"
generateFieldNameMethod="append|prepend|replace"
ignoreCaseOfSelectedTerms="true|false" />
<annotations>
<annotation type="de.julielab.types.ArticleName" />
</annotations>
</field>
</fields>]]></programlisting>
Here,
<emphasis>pathToCoverDefinitionFile</emphasis>
points to a file as described above. The
<emphasis>generateFieldNameMethod</emphasis>
attributes takes one of
<emphasis>append</emphasis>
,
<emphasis>prepend</emphasis>
or
<emphasis>replace</emphasis>
. It is used to define the method of how to name the dynamically
created category fields. The name will be derived from the value of
the
<emphasis>name</emphasis>
attribute of the
<emphasis>field</emphasis>
element by appending, prepending or replacing it by the respective
category name. If, in the above example,
<emphasis>append</emphasis>
would be used, the eventual field names would be
<emphasis>articlecategory_electronics</emphasis>
,
<emphasis>articlecategory_sanitaryArticles</emphasis>
and
<emphasis>articleCategory_computers</emphasis>
. Each field would only contain terms defined for it in the
<emphasis>termCoverDefinitionFile</emphasis>
. The attribute
<emphasis>ignoreCaseOfSelectedTerms</emphasis>
is used to switch on or off case normalization when checking whether
a particular term is allowed for a particular field. When switched
off, the term
<emphasis>smartphone</emphasis>
would be allowed for the fields
<emphasis>articlecategory_electronics</emphasis>
and
<emphasis>articlecategory_computers</emphasis>
while
<emphasis>SMARTPHONE</emphasis>
would not. Setting the attribute to
<emphasis>true</emphasis>
would lead to the acceptance of both variants into both fields. It
is not possible to set this parameter to different values for
different cover subset fields of the same cover.
</para>
</section>
</chapter>
<chapter id="sandbox.luceneCasConsumer.mapping.reference">
<title>Mapping File Reference</title>
<para>
After introducing the basic concepts and functions this
chapter
offers a complete reference of the mapping
file elements.
</para>
<section id="sandbox.luceneCasConsumer.mapping.reference.structure">
<title>Mapping File Structure</title>
<para>
The raw mapping file structure is sketched below.
</para>
<programlisting><![CDATA[<fields>
<field ..>
<termSetCoverDefinition ../>
<filters>
<filter ../>
...
</filters>
<annotations>
<annotation ..>
<filters>
<filter ../>
...
</filters>
<features>
<feature ..>
...
</features>
</annotation>
...
</annotations>
</field>
...
</fields>]]></programlisting>
</section>
<section id="sandbox.luceneCasConsumer.mapping.reference.elements">
<title>Mapping File Elements</title>
<para>
This section describes the mapping file
elements and their
attributes.
</para>
<para>
<itemizedlist>
<listitem>
<para>
<emphasis>fields element</emphasis>
<itemizedlist>
<listitem>
<para>
fields container element
</para>
</listitem>
<listitem>
<para>
contains:
<code>field+</code>
</para>
</listitem>
</itemizedlist>
</para>
</listitem>
<listitem>
<para>
<emphasis>field element</emphasis>
<itemizedlist>
<listitem>
<para>
describes a Lucene
<ulink
url="http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/document/Field.html">field</ulink>
</para>
</listitem>
<listitem>
<para>
contains:
<code>termSetCoverDefinition?, filters?, annotations</code>
</para>
</listitem>
</itemizedlist>
</para>
<para>
<table>
<title>field element attributes</title>
<tgroup cols="5">
<thead>
<row>
<entry>name</entry>
<entry>allowed values</entry>
<entry>default value</entry>
<entry>mandatory</entry>
<entry>description</entry>
</row>
</thead>
<tbody>
<row>
<entry>name</entry>
<entry>string</entry>
<entry>-</entry>
<entry>yes</entry>
<entry>
the name of the
<ulink
url="http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/document/Field.html">field</ulink>
</entry>
</row>
<row>
<entry>index</entry>
<entry>yes|no|no_norms|no_tf|no_norms_tf
</entry>
<entry>no</entry>
<entry>no</entry>
<entry>
See
<ulink
url="http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/document/Field.Index.html">Field.Index</ulink>
</entry>
</row>
<row>
<entry>termVector</entry>
<entry>no|positions|offsets|positions_offsets
</entry>
<entry>no</entry>
<entry>no</entry>
<entry>
See
<ulink
url="http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/document/Field.TermVector.html">Field.TermVector</ulink>
</entry>
</row>
<row>
<entry>stored</entry>
<entry>yes|no|compress</entry>
<entry>no</entry>
<entry>no</entry>
<entry>
See
<ulink
url="http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/document/Field.Store">Field.Store</ulink>
</entry>
</row>
<row>
<entry>merge</entry>
<entry>boolean</entry>
<entry>false</entry>
<entry>no</entry>
<entry>If this attribute is set to true, all contained
annotation token streams are merged according to their
offset.
The tokens position increment are adopted in the
case
of
overlapping.
</entry>
</row>
<row>
<entry>unique</entry>
<entry>boolean</entry>
<entry>false</entry>
<entry>no</entry>
<entry>If this attribute is set to true, there will be only
one field instance with this field's name be added to
resulting Lucene documents. This is required e.g. by Apache
Solr for primary key fields. You must not define multiple
fields with the same name to be unique, this would break the
unique property.
</entry>
</row>
</tbody>
</tgroup>
</table>
</para>
</listitem>
<listitem>
<para>
<emphasis>termSetCoverDefinition element</emphasis>
<itemizedlist>
<listitem>
<para>element to define the automatical distribution of terms
to multiple fields
</para>
</listitem>
<listitem>
<para>
contains:
<code>nothing</code>
</para>
</listitem>
</itemizedlist>
</para>
<para>
<table>
<title>termSetCoverDefinition element attributes</title>
<tgroup cols="5">
<thead>
<row>
<entry>name</entry>
<entry>allowed values</entry>
<entry>default value</entry>
<entry>mandatory</entry>
<entry>description</entry>
</row>
</thead>
<tbody>
<row>
<entry>coverDefinition
File</entry>
<entry>string</entry>
<entry>-</entry>
<entry>yes</entry>
<entry>Path to a file defining the term to category
assignment (which term belongs to which cover subset).
</entry>
</row>
<row>
<entry>generateField
NameMethod</entry>
<entry>append|prepend|replace</entry>
<entry>append</entry>
<entry>no</entry>
<entry>Determines the name of the cover subset fields. To the
original field name, the subset (or category) name is
appended or prepended to the field name or replaces it
completely.
</entry>
</row>
<row>
<entry>ignoreCaseOf
SelectedTerms</entry>
<entry>boolean</entry>
<entry>true</entry>
<entry>no</entry>
<entry>
For each subset field, there is a list of allowed term
values defined in the
<emphasis>coverDefinitionFile</emphasis>
. This parameter determines whether the case of term strings
is
ignored for the membership-check or not.
</entry>
</row>
</tbody>
</tgroup>
</table>
</para>
</listitem>
<listitem>
<para>
<emphasis>filters element</emphasis>
<itemizedlist>
<listitem>
<para>
container element for filters
</para>
</listitem>
<listitem>
<para>
contains:
<code>filter+</code>
</para>
</listitem>
</itemizedlist>
</para>
</listitem>
<listitem>
<para>
<emphasis>filter element</emphasis>
<itemizedlist>
<listitem>
<para>
Describes a
<ulink
url="http://lucene.apache.org/java/2_4_1/api/org/apache/lucene/analysis/TokenFilter.html">token filter</ulink>
instance.
Token filters can either be predefined or
self-provided.
</para>
</listitem>
</itemizedlist>
</para>
<para>
<table>
<title>filter element attributes</title>
<tgroup cols="5">
<thead>
<row>
<entry>name</entry>
<entry>allowed values</entry>
<entry>default value</entry>
<entry>mandatory</entry>
<entry>description</entry>
</row>
</thead>
<tbody>
<row>
<entry>name</entry>
<entry>string</entry>
<entry>-</entry>
<entry>no</entry>
<entry>
the name to reference either a predefined filter (see
predefined filter reference)
or a reused filter
</entry>
</row>
<row>
<entry>className</entry>
<entry>string</entry>
<entry>-</entry>
<entry>no</entry>
<entry>
The canonical class name of a token filter. the token
filter class must provide a
single argument constructor which
takes the token stream as parameter.
</entry>
</row>
<row>
<entry>factoryClassName</entry>
<entry>string</entry>
<entry>-</entry>
<entry>no</entry>
<entry>
The canonical class name of a token filter factory.
the
token
filter factory class must
implement the
org.apache.uima.indexer.analysis.TokenFilterFactory. See
<xref linkend="sandbox.luceneCasConsumer.mapping.tokenfilters" />
for
a example.
</entry>
</row>
<row>
<entry>reuse</entry>
<entry>boolean</entry>
<entry>-</entry>
<entry>false</entry>
<entry>
Enables token filter factory reuse. This makes sense
when a
token
filter uses resources which should be cached.
Because token
filters
are referenced by their names, you
also
need to provide
a name.
</entry>
</row>
<row>
<entry>*</entry>
<entry>string</entry>
<entry>-</entry>
<entry>-</entry>
<entry>
Filters may have their own parameter attributes which
are
explained
in the
<xref linkend="sandbox.luceneCasConsumer.mapping.reference" />
..
</entry>
</row>
</tbody>
</tgroup>
</table>
</para>
</listitem>
<listitem>
<para>
<emphasis>annotations element</emphasis>
<itemizedlist>
<listitem>
<para>
container element for annotations
</para>
</listitem>
<listitem>
<para>
contains:
<code>annotation+</code>
</para>
</listitem>
</itemizedlist>
</para>
</listitem>
<listitem>
<para>
<emphasis>annotation element</emphasis>
<itemizedlist>
<listitem>
<para>
Describes a token stream which is generated from a CAS
annotation index.
</para>
</listitem>
<listitem>
<para>
contains:
<code>features?</code>
</para>
</listitem>
</itemizedlist>
</para>
<para>
<table>
<title>annotation element attributes</title>
<tgroup cols="5">
<thead>
<row>
<entry>name</entry>
<entry>allowed values</entry>
<entry>default value</entry>
<entry>mandatory</entry>
<entry>description</entry>
</row>
</thead>
<tbody>
<row>
<entry>type</entry>
<entry>string</entry>
<entry>-</entry>
<entry>yes</entry>
<entry>
The canonical type name. E.g. "uima.cas.Annotation"
</entry>
</row>
<row>
<entry>sofa</entry>
<entry>string</entry>
<entry>InitialView</entry>
<entry>yes</entry>
<entry>
Determines from which sofa the annotation index is
taken
</entry>
</row>
<row>
<entry>featurePath</entry>
<entry>string</entry>
<entry>-</entry>
<entry>no</entry>
<entry>
Allows to address feature structures which are
associated
with the annotation object. Features are separated
by
a ".".
</entry>
</row>
<row>
<entry>tokenizer</entry>
<entry>cas|white_space|standard
</entry>
<entry>cas</entry>
<entry>no</entry>
<entry>
Determines which tokenization is used. "cas" uses the
tokenization given
by the contained annotation token streams,
"standard" uses the
<ulink
url="http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/analysis/standard/StandardTokenizer.html">standard tokenizer</ulink>
</entry>
</row>
<row>
<entry>featureValueDelimiterString
</entry>
<entry>string</entry>
<entry>-</entry>
<entry>no</entry>
<entry>
If this parameter is provided all feature values of
the
targeted
feature structure are concatenated and delimited
by this
string.
</entry>
</row>
</tbody>
</tgroup>
</table>
</para>
</listitem>
<listitem>
<para>
<emphasis>features element</emphasis>
<itemizedlist>
<listitem>
<para>
Container element for features.
</para>
</listitem>
<listitem>
<para>
contains:
<code>feature+</code>
</para>
</listitem>
</itemizedlist>
</para>
</listitem>
<listitem>
<para>
<emphasis>feature element</emphasis>
<itemizedlist>
<listitem>
<para>
Describes a certain feature of the addressed feature
structure. Values of this features serve as token
source.
</para>
</listitem>
</itemizedlist>
</para>
<para>
<table>
<title>feature element attributes</title>
<tgroup cols="5">
<thead>
<row>
<entry>name</entry>
<entry>allowed values</entry>
<entry>default value</entry>
<entry>mandatory</entry>
<entry>description</entry>
</row>
</thead>
<tbody>
<row>
<entry>name</entry>
<entry>string</entry>
<entry>-</entry>
<entry>yes</entry>
<entry>
The feature name.
</entry>
</row>
<row>
<entry>numberFormat</entry>
<entry>string</entry>
<entry>-</entry>
<entry>no</entry>
<entry>
Allows to convert number features to strings. See
<ulink
url="http://java.sun.com/javase/6/docs/api/java/text/DecimalFormat.html">DecimalNumberFormat</ulink>
.
</entry>
</row>
</tbody>
</tgroup>
</table>
</para>
</listitem>
</itemizedlist>
</para>
</section>
<section id="sandbox.luceneCasConsumer.mapping.reference.filters">
<title>
Filters Reference
</title>
<para>Lucas comes with a couple of predefined token filters.
This
section provides a complete
reference for these filters.
</para>
<section
id="sandbox.luceneCasConsumer.mapping.reference.filters.addition">
<title>
Addition Filter
</title>
<para>Adds suffixes or prefixes to tokens.</para>
<programlisting><![CDATA[<filter name="addition" prefix="PRE_"/>]]></programlisting>
<para>
<table>
<title>addition filter attributes</title>
<tgroup cols="5">
<thead>
<row>
<entry>name</entry>
<entry>allowed values</entry>
<entry>default value</entry>
<entry>mandatory</entry>
<entry>description</entry>
</row>
</thead>
<tbody>
<row>
<entry>prefix</entry>
<entry>string</entry>
<entry>-</entry>
<entry>no</entry>
<entry>
A pefix which is added to the front of each token.
</entry>
</row>
<row>
<entry>postfix</entry>
<entry>string</entry>
<entry>-</entry>
<entry>no</entry>
<entry>
A post which is added to the end of each token.
</entry>
</row>
</tbody>
</tgroup>
</table>
</para>
</section>
<section
id="sandbox.luceneCasConsumer.mapping.reference.filters.hypernyms">
<title>
Hypernyms Filter
</title>
<para>Adds hypernyms of a token with the same offset and
position
increment 0.
</para>
<programlisting><![CDATA[<filter name="hypernyms" filePath="/path/to/myFile.txt"/>]]></programlisting>
<para>
<table>
<title>hypernym filter attributes</title>
<tgroup cols="5">
<thead>
<row>
<entry>name</entry>
<entry>allowed values</entry>
<entry>default value</entry>
<entry>mandatory</entry>
<entry>description</entry>
</row>
</thead>
<tbody>
<row>
<entry>filePath</entry>
<entry>string</entry>
<entry>-</entry>
<entry>yes</entry>
<entry>
The hypernym file path. Each line of the file contains one
token
with its hypernyms.
The file must have the following
format:
<code>TOKEN_TEXT=HYPERNYM1|HYPERNYM2|..
</code>
.
</entry>
</row>
</tbody>
</tgroup>
</table>
</para>
</section>
<section
id="sandbox.luceneCasConsumer.mapping.reference.filters.position">
<title>
Position Filter
</title>
<para>Allows to select only the first or the last token of a
token
stream, all other tokens are discarded.
</para>
<programlisting><![CDATA[<filter name="position" position="last"/>]]></programlisting>
<para>
<table>
<title>position filter attributes</title>
<tgroup cols="5">
<thead>
<row>
<entry>name</entry>
<entry>allowed values</entry>
<entry>default value</entry>
<entry>mandatory</entry>
<entry>description</entry>
</row>
</thead>
<tbody>
<row>
<entry>position</entry>
<entry>first|last</entry>
<entry>-</entry>
<entry>yes</entry>
<entry>
If position is set to first the only the the first token
of the underlying token stream is returned,
all other tokens
are
discarded. Otherwise, if position is set to last, only the
last
token is returned.
</entry>
</row>
</tbody>
</tgroup>
</table>
</para>
</section>
<section
id="sandbox.luceneCasConsumer.mapping.reference.filters.replace">
<title>
Replace Filter
</title>
<para>Allows to replace token texts.</para>
<programlisting><![CDATA[<filter name="replace" filePath="/path/to/myFile.txt"/>]]></programlisting>
<para>
<table>
<title>replace filter attributes</title>
<tgroup cols="5">
<thead>
<row>
<entry>name</entry>
<entry>allowed values</entry>
<entry>default value</entry>
<entry>mandatory</entry>
<entry>description</entry>
</row>
</thead>
<tbody>
<row>
<entry>filePath</entry>
<entry>string</entry>
<entry>-</entry>
<entry>yes</entry>
<entry>
The token text replacement file path. Each line consists of
the
original token text and
the replacement and must have the
following format:
<code>
TOKEN_TEXT=REPLACEMENT_TEXT
</code>
.
</entry>
</row>
</tbody>
</tgroup>
</table>
</para>
</section>
<section
id="sandbox.luceneCasConsumer.mapping.reference.filters.snowball">
<title>
Snowball Filter
</title>
<para>
Integration of the
<ulink
url="http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/analysis/snowball/SnowballFilter.html">Lucene snowball filter</ulink>
</para>
<programlisting><![CDATA[<filter name="snowball" stemmerName="German"/>]]></programlisting>
<para>
<table>
<title>snowball filter attributes</title>
<tgroup cols="5">
<thead>
<row>
<entry>name</entry>
<entry>allowed values</entry>
<entry>default value</entry>
<entry>mandatory</entry>
<entry>description</entry>
</row>
</thead>
<tbody>
<row>
<entry>stemmerName</entry>
<entry>snowball stemmer names.</entry>
<entry>English</entry>
<entry>no</entry>
<entry>
See
<ulink
url="http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/analysis/snowball/SnowballFilter.html">snowball filter documentation</ulink>
.
</entry>
</row>
</tbody>
</tgroup>
</table>
</para>
</section>
<section
id="sandbox.luceneCasConsumer.mapping.reference.filters.splitter">
<title>
Splitter Filter
</title>
<para>Splits tokens at a certain string.</para>
<programlisting><![CDATA[<filter name="splitter" splitString=","/>]]></programlisting>
<para>
<table>
<title>splitter filter attributes</title>
<tgroup cols="5">
<thead>
<row>
<entry>name</entry>
<entry>allowed values</entry>
<entry>default value</entry>
<entry>mandatory</entry>
<entry>description</entry>
</row>
</thead>
<tbody>
<row>
<entry>splitString</entry>
<entry>string</entry>
<entry>-</entry>
<entry>yes</entry>
<entry>
The string on which tokens are split.
</entry>
</row>
</tbody>
</tgroup>
</table>
</para>
</section>
<section id="sandbox.luceneCasConsumer.mapping.reference.filters.concat">
<title>
Concatenate Filter
</title>
<para>Concatenates token texts with a certain delimiter
string.
</para>
<programlisting><![CDATA[<filter name="concatenate" concatString=";"/>]]></programlisting>
<para>
<table>
<title>concatenate filter attributes</title>
<tgroup cols="5">
<thead>
<row>
<entry>name</entry>
<entry>allowed values</entry>
<entry>default value</entry>
<entry>mandatory</entry>
<entry>description</entry>
</row>
</thead>
<tbody>
<row>
<entry>concatString</entry>
<entry>string</entry>
<entry>-</entry>
<entry>yes</entry>
<entry>
The string with which token texts are concatenated.
</entry>
</row>
</tbody>
</tgroup>
</table>
</para>
</section>
<section
id="sandbox.luceneCasConsumer.mapping.reference.filters.stopwords">
<title>
Stopword Filter
</title>
<para>
Integration of the
<ulink
url="http://lucene.apache.org/java/2_4_1/api/org/apache/lucene/analysis/StopFilter.html">Lucene stop filter</ulink>
</para>
<programlisting><![CDATA[<filter name="stopwords" filePath="/path/to/myStopwords.txt"/>]]></programlisting>
<para>
<table>
<title>stopword filter attributes</title>
<tgroup cols="5">
<thead>
<row>
<entry>name</entry>
<entry>allowed values</entry>
<entry>default value</entry>
<entry>mandatory</entry>
<entry>description</entry>
</row>
</thead>
<tbody>
<row>
<entry>filePath</entry>
<entry>string</entry>
<entry>-</entry>
<entry>no</entry>
<entry>
The stopword file path. Each line of the file contains a
single stopword.
</entry>
</row>
<row>
<entry>ignoreCase</entry>
<entry>boolean</entry>
<entry>false</entry>
<entry>no</entry>
<entry>
Defines if the stop filter ignores the case of stop
words.
</entry>
</row>
</tbody>
</tgroup>
</table>
</para>
</section>
<section id="sandbox.luceneCasConsumer.mapping.reference.filters.unique">
<title>
Unique Filter
</title>
<para>Filters tokens with the same token text. The resulting
token
stream contains only tokens with unique texts.
</para>
<programlisting><![CDATA[<filter name="unique"/>]]></programlisting>
</section>
<section
id="sandbox.luceneCasConsumer.mapping.reference.filters.uppercase">
<title>
Upper Case Filter
</title>
<para>Turns the text of each token into upper case.</para>
<programlisting><![CDATA[<filter name="uppercase"/>]]></programlisting>
</section>
<section
id="sandbox.luceneCasConsumer.mapping.reference.filters.lowercase">
<title>
Lower Case Filter
</title>
<para>Turns the text of each token into lower case.</para>
<programlisting><![CDATA[<filter name="lowercase"/>]]></programlisting>
</section>
</section>
</chapter>
<chapter id="sandbox.luceneCasConsumer.indexwriter">
<title>Index Writer Configuration</title>
<para>
The index writer used by Lucas can be configured separately. To allow
Lucas to run in
multiple deployment scenarios, different Lucas
instances can share one index writer
instance. This is handled by the
resource manager. To configure the resource manager and
the index
writer properly, the Lucas descriptor contains a resource binding
<code>IndexWriterProvider</code>
.
An
<code>IndexWriterProvider</code>
creates an index writer from a properties
file. The file path and the
name of this properties file must be set in the
<code>LucasIndexWriterProvider</code>
resource
section of the descriptor.
</para>
<para>
The properties file can contain the following properties.
<itemizedlist>
<listitem>
<para>
<code>indexPath</code>
- the path to the index directory
</para>
</listitem>
<listitem>
<para>
<code>RAMBufferSize</code>
- (number value), see
<ulink
url="http://lucene.apache.org/java/2_4_1/api/org/apache/lucene/index/IndexWriter.html#setRAMBufferSizeMB(double)">IndexWriter.ramBufferSize</ulink>
</para>
</listitem>
<listitem>
<para>
<code>useCompoundFileFormat</code>
- (boolean value), see
<ulink
url="http://lucene.apache.org/java/2_4_1/api/org/apache/lucene/index/IndexWriter.html#setUseCompoundFile(boolean)">IndexWriter.useCompoundFormat</ulink>
</para>
</listitem>
<listitem>
<para>
<code>maxFieldLength</code>
- (boolean value), see
<ulink
url="http://lucene.apache.org/java/2_4_1/api/org/apache/lucene/index/IndexWriter.html#setMaxFieldLength(int)">IndexWriter.maxFieldLength</ulink>
</para>
</listitem>
<listitem>
<para>
<code>uniqueIndex</code>
- (boolean value), if set to
<code>true</code>
, host name and process identifier are added to the index name.
(Only tested on linux systems)
</para>
</listitem>
</itemizedlist>
</para>
</chapter>
<chapter id="sandbox.luceneCasConsumer.descriptor">
<title>Descriptor Parameters
</title>
<para>
Because Lucas is configured by the mapping file, the descriptor has
only one parameter:
<itemizedlist>
<listitem>
<para>
<code>mappingFile</code>
- the file path to the mapping file.
</para>
</listitem>
</itemizedlist>
</para>
</chapter>
<chapter id="sandbox.luceneCasConsumer.prospectiveSearch">
<title>Prospective Search</title>
<para>
Prospective search is a search method where a set of search
queries are given
first
and then searched against a stream of
documents. A search query divides
the document
stream into a sub-stream
which only contains these document which match the
query.
Users usually
define a number of search queries and then subscribe to the
resulting
sub-streams. An example for prospective search is a news feed which
is monitored
for certain terms,
each time a term occurs a mail
notification is send.
</para>
<para>
The user must provide a set of search queries via a
<code>SearchQueryProvider</code>
,
these search queries are then search against the processed CAS as
defined
in the mapping file,
if a match occurs a feature structure is
inserted into the CAS.
Optionally highlighting is
supported,
annotations for the matchtng text areas are created and linked to
the
feature structure.
</para>
<para>
The implementation uses the Lucene
<ulink
url="http://lucene.apache.org/java/2_4_1/api/contrib-memory/org/apache/lucene/index/memory/MemoryIndex.html">MemoryIndex</ulink>
which is a fast one document in memory index. For performance notes
please consult the javadoc
of the
<code>MemoryIndex</code>
class.
</para>
<section
id="sandbox.luceneCasConsumer.prospectiveSearch.searchQueryProvider">
<title>Search Query Provider</title>
<para>
The Search Query Provider provides the Perspective Search Analysis
Engine
with a set of search queries which should be monitored. A
search
query is a combination of a Lucene query and an id. The id is
later
needed
to map a match to a specific search query. A user usually
has a set of
search queries which should be monitored, since there is
no
standardized
way to access the search queries the user must
implement the
<code>SearchQueryProvider</code>
interface and configure the thread-safe implementation as shared
resource object.
An example for such an implementation could be a
search query provider
which reads
search queries form a database or a
web service.
</para>
</section>
<section id="sandbox.luceneCasConsumer.prospectiveSearch.searchResults">
<title>Search Results</title>
<para>
The search results are written to the CAS, for each match one
Search
Result
feature structure is inserted into the CAS. The Search
Result feature
structure contains
the id and optionally links to an
array of annotations which mark the
matching
text in the CAS.
</para>
<para>
The Search Result type must be mapped to a defined type
in the
analysis engine descriptor with the following configuration
parameters:
<itemizedlist>
<listitem>
<para>
<code>String org.apache.uima.lucas.SearchResult</code>
- Maps the search result type to an actual type in the type
system.
</para>
</listitem>
<listitem>
<para>
<code>String org.apache.uima.lucas.SearchResultIdFeature</code>
- A long id feature, identifies the matching search query.
</para>
</listitem>
<listitem>
<para>
<code>String org.apache.uima.lucas.SearchResulMatchingTextFeature
</code>
- An optional
<code>ArrayFS</code>
feature, links to annotations which mark the matching text.
</para>
</listitem>
</itemizedlist>
</para>
</section>
</chapter>
</book>