| <?xml version="1.0" encoding="UTF-8"?> |
| |
| <!-- Licensed to the Apache Software Foundation (ASF) under one or more contributor |
| license agreements. See the NOTICE file distributed with this work for additional |
| information regarding copyright ownership. The ASF licenses this file to |
| you under the Apache License, Version 2.0 (the "License"); you may not use |
| this file except in compliance with the License. You may obtain a copy of |
| the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required |
| by applicable law or agreed to in writing, software distributed under the |
| License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS |
| OF ANY KIND, either express or implied. See the License for the specific |
| language governing permissions and limitations under the License. --> |
| |
| <!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN" |
| "http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd" [ |
| <!ENTITY imgroot "images/" > |
| ]> |
| <book lang="en"> |
| |
| <title> |
| Apache UIMA Lucene CAS Indexer Documentation |
| </title> |
| |
| <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" |
| href="../../target/docbook-shared/common_book_info.xml" /> |
| |
| <preface> |
| <title>Introduction</title> |
| <para> |
| The Lucene CAS Indexer (Lucas) is a UIMA CAS consumer that |
| stores CAS |
| data in a Lucene index. Lucas allows to exploit the results |
| of |
| collection |
| processing for information retrieval purposes in a fast |
| and flexible way. |
| |
| The consumer transforms annotation objects from |
| annotation indexes into |
| Lucene token objects and creates token streams |
| from them. Token |
| streams can |
| be further processed by token filters |
| before they are stored into a |
| certain |
| field of an index document. |
| |
| The |
| mapping between UIMA annotations and Lucene tokens and token |
| filtering is configurable |
| by an XML file, whereas the index writer is |
| configured by a properties |
| file. |
| </para> |
| <para> |
| To use Lucas, at first a mapping file must be created. You have |
| to |
| decide |
| which annotation types should be present in the index and |
| how |
| your |
| index layout should look like, or more precisely, which |
| fields |
| should |
| be |
| contained in the index. Optionally you can add token |
| filters |
| for |
| further |
| processing. Its also possible to deploy your own |
| token |
| filters. |
| </para> |
| <para> |
| Lucas can run in multiple deployment scenarios where different |
| instances share |
| one index writer. This shared index writer instance is |
| configured via a properties file |
| and managed by the resource manager. |
| </para> |
| </preface> |
| |
| <chapter id="sandbox.luceneCasConsumer.mapping"> |
| <title>Mapping Configuration</title> |
| <para> |
| This chapter discusses the mapping between UIMA annotations and |
| Lucene tokens in detail. |
| </para> |
| <section id="sandbox.luceneCasConsumer.mapping.tokenSources"> |
| <title>Token Sources</title> |
| <para> |
| The mapping file describes the structure and contents of the |
| generated Lucene index. Each CAS |
| in a collection is mapped to a |
| Lucene document. A Lucene document |
| consists of fields, whereas a CAS |
| contains multiple annotation |
| indexes on different sofas. An |
| annotation object can mark a text, |
| can hold feature values or |
| reference |
| other feature structures. For instance, |
| an annotation |
| created by an entity mapper |
| marks a text area and may |
| additionally |
| contain an identifier for the mapped entity. |
| For this reason Lucas |
| knows |
| three different |
| sources of Lucene token values: |
| </para> |
| <itemizedlist> |
| <listitem> |
| <para> |
| The covered text of a annotation object. |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| One or more feature values of a annotation object. |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| One or more feature values of a feature structure directly |
| or |
| indirectly referenced |
| by an annotation object. |
| </para> |
| </listitem> |
| </itemizedlist> |
| <para> |
| If a feature has multiple values, that means it references a FSArray |
| instance, then one token is generated for each value. In the same |
| manner tokens are generated from each feature value, if more then |
| one |
| feature is provided. Alternatively, you can provide a |
| <emphasis>featureValueDelimiterString |
| </emphasis> |
| which is used to concatenate different feature values |
| from one |
| annotation object to generate only one token. |
| Each generated |
| Lucene |
| token has the same offset as the source annotation feature |
| structure. |
| </para> |
| <section id="sandbox.luceneCasConsumer.mapping.types.coveredText"> |
| <title>Covered Text</title> |
| <para> |
| As mentioned above, the text covered by annotation objects |
| represents one possible source for Lucene token values. |
| The |
| following |
| example creates an |
| index with one |
| <emphasis>title</emphasis> |
| field which contains |
| covered texts from all |
| token annotations which |
| are stored in the |
| <emphasis>title</emphasis> |
| sofa. |
| <programlisting><![CDATA[<fields> |
| <field name=“title” index=“yes”> |
| <annotations> |
| <annotation sofa=“title” type=“de.julielab.types.Token”/> |
| </annotations> |
| </field> |
| </fields>]]></programlisting> |
| </para> |
| </section> |
| <section id="sandbox.luceneCasConsumer.mapping.types.feature"> |
| <title>Feature Values</title> |
| <para> |
| The feature values of annotation objects are another source |
| for |
| token values. Consider the example below. |
| <programlisting><![CDATA[<fields> |
| <field name=“cells” index=“yes”> |
| <annotations> |
| <annotation sofa=“text” type=“de.julielab.types.Cell”> |
| <features> |
| <feature name=“specificType”> |
| </features> |
| </annotation> |
| </annotations> |
| </field> |
| </fields>]]></programlisting> |
| The field |
| <emphasis>cells</emphasis> |
| contains a token stream generated from the |
| annotation index of type |
| <emphasis>de.julielab.types.Cell</emphasis> |
| . Each generated |
| token will contain the value of the feature |
| <emphasis>specificType</emphasis> |
| of the |
| enclosing |
| annotation object. |
| </para> |
| <para> |
| The next example illustrates how multiple feature values can be |
| combined by using a |
| <emphasis>featureValueDelimiterString</emphasis> |
| . |
| If no |
| <emphasis>featureValueDelimiterString</emphasis> |
| is provided, |
| a single token is generated from |
| each feature value. |
| <programlisting><![CDATA[<fields> |
| <field name=“authors” index=“no” stored="yes"> |
| <annotations> |
| <annotation sofa=“text” type=“de.julielab.types.Author” |
| featureValueDelimiterString=", "> |
| <features> |
| <feature name=“firstname”/> |
| <feature name=“lastname”/> |
| </features> |
| </annotation> |
| </annotations> |
| </field> |
| </fields>]]></programlisting> |
| </para> |
| </section> |
| <section id="sandbox.luceneCasConsumer.mapping.types.featureStructures"> |
| <title>Feature Values of referenced Feature Structures |
| </title> |
| <para> |
| Since annotation objects may reference other feature |
| structures, it |
| may be desirable to use these feature structures as |
| source for |
| Lucene token values. |
| To achieve this, we utilize feature |
| paths to |
| address these feature structures. |
| Consider the example |
| below. |
| </para> |
| |
| <beginpage /> |
| <para> |
| <programlisting><![CDATA[<fields> |
| <field name=“cities” index=“yes”> |
| <annotations> |
| <annotation sofa=“text” type=“de.julielab.types.Author” |
| featurePath="affiliation.address"> |
| <features> |
| <feature name=“city”> |
| </features> |
| </annotation> |
| </annotations> |
| </field> |
| </fields>]]></programlisting> |
| </para> |
| <para> |
| The type |
| <emphasis>de.julielab.types.Author |
| </emphasis> |
| has a feature |
| <emphasis>affiliation</emphasis> |
| which points to a |
| <emphasis>affiliation</emphasis> |
| feature structure. |
| This |
| <emphasis>affiliation</emphasis> |
| feature structure in turn has a feature |
| <emphasis>address</emphasis> |
| which points to an |
| <emphasis>address</emphasis> |
| feature structure. This |
| path of |
| references is expressed as |
| the feature |
| path |
| <emphasis>affiliation.address</emphasis> |
| . |
| A feature path consists of feature names |
| separated by colons ("."). |
| Please |
| consider that the |
| <emphasis>city</emphasis> |
| feature is a feature of |
| the "address" |
| feature structure and not of |
| the |
| <emphasis>author</emphasis> |
| annotation object. |
| </para> |
| </section> |
| <section |
| id="sandbox.luceneCasConsumer.mapping.types.supportedFeatureTypes"> |
| <title>Supported feature types |
| </title> |
| <para> |
| Currently, not all feature types are supported. Supported |
| feature types are the following: |
| </para> |
| <itemizedlist> |
| <listitem> |
| <para>String</para> |
| </listitem> |
| <listitem> |
| <para>String Array</para> |
| </listitem> |
| <listitem> |
| <para>Number Types: Double, Float, Long, Integer, Short |
| </para> |
| </listitem> |
| </itemizedlist> |
| <para> |
| Consider that you need to provide a number format string if |
| you |
| want to use |
| number types. |
| </para> |
| </section> |
| |
| </section> |
| <section id="sandbox.luceneCasConsumer.mapping.alignment"> |
| <title>Token Stream Alignment</title> |
| <para> |
| In the examples above all defined Lucene fields contain only one |
| annotation based |
| token stream. There are a couple of reasons for the |
| fact that the simple mapping |
| of each annotation index to separate |
| Lucene fields is not an optimal |
| strategy. |
| One practical reason is that |
| the Lucene highlighting will not work for |
| scenarios |
| where more than |
| one annotation type is involved. |
| Additionally, the TF-IDF weighting |
| of terms does not work properly |
| if |
| annotations are separated from |
| their corresponding |
| text fragments. |
| Lucas is able to merge token |
| streams and align them according |
| to their |
| token offsets. |
| The resulting |
| merged token stream is then |
| stored in a |
| field. |
| The next example |
| demonstrates this merging feature. |
| <programlisting><![CDATA[<fields> |
| <field name=“text” index=“yes” merge=“true”> |
| <annotations> |
| <annotation sofa=“text” type=“de.julielab.types.Token”/> |
| <annotation sofa=“text” type=“de.julielab.types.Cell”> |
| <features> |
| <feature name=“specificType”> |
| </features> |
| </annotation> |
| </annotations> |
| </field> |
| </fields>]]></programlisting> |
| Consider the merge attribute of the field tag. It causes the |
| alignment of the two |
| token streams generated from the |
| <emphasis>de.julielab.types.Token</emphasis> |
| and |
| <emphasis>de.julielab.types.Cell</emphasis> |
| annotations. If |
| this |
| attribute is set |
| to false or is left out, the |
| annotation token |
| streams are concatenated. |
| </para> |
| </section> |
| <section id="sandbox.luceneCasConsumer.mapping.tokenfilters"> |
| <title> |
| Token Filters |
| </title> |
| <para> |
| Token filters are the Lucene approach to enable operations on |
| token streams. In typical Lucene applications token filters |
| are |
| combined with a tokenizer to build analyzers. In a typical Lucas |
| application the tokenization is already given by annotation indexes. |
| Lucas allows to apply token filters to certain annotation token |
| streams or |
| to the merged or concatenated field token stream as whole. |
| The following |
| example demonstrates how token filters are defined in |
| the mapping file. |
| </para> |
| <programlisting><![CDATA[<fields> |
| <field name=“text” index=“yes” merge=“true”> |
| <filters> |
| <filter name="lowercase"/> |
| </filters> |
| <annotations> |
| <annotation sofa=“text” type=“de.julielab.types.Token”> |
| <filters> |
| <filter name="stopwords" |
| filePath="resources/stopwords.txt"/> |
| </filters> |
| </annotation> |
| <annotation sofa=“text” type=“de.julielab.types.Cell”> |
| <features> |
| <feature name=“specificType”> |
| </features> |
| </annotation> |
| </annotations> |
| </field> |
| </fields>]]></programlisting> |
| <para> |
| The lowercase token filter is applied to the complete field |
| content |
| and |
| the stopword |
| filter is only applied to the annotation |
| token stream |
| which is |
| generated from the de.julielab.types.Token |
| annotation index. |
| Both filters are |
| predefined filters which are |
| included in the Lucas |
| distribution. A |
| reference of all |
| predefined |
| token filters is covered in |
| <xref linkend="sandbox.luceneCasConsumer.mapping.reference" /> |
| . |
| </para> |
| <section id="sandbox.luceneCasConsumer.mapping.tokenfilters.selfdefined"> |
| <title> |
| Deploying your own Token Filters |
| </title> |
| <para> |
| For scenarios where the built in token filters where not |
| sufficient, |
| you can |
| provide your own token filters. Simple token |
| filters which do not need |
| any further parameterization, are required |
| to define |
| a public constructor, which |
| takes a |
| token stream as the only |
| parameter. The next example shows how a such a |
| token |
| filter is |
| referenced in the mapping file. |
| </para> |
| <programlisting><![CDATA[<fields> |
| <field name=“text” index=“yes”> |
| <annotations> |
| <annotation sofa=“text” type=“de.julielab.types.Cell”> |
| <filters> |
| <filter className="org.example.MyFilter"/> |
| </filters> |
| <features> |
| <feature name=“specificType”> |
| </features> |
| </annotation> |
| </annotations> |
| </field> |
| </fields>]]></programlisting> |
| <para> |
| The attribute |
| <emphasis>className</emphasis> |
| must reference the canonical class |
| name of |
| the the filter. |
| In cases |
| where the token filter has |
| parameters we need to provide a |
| factory |
| for it. |
| This factory must |
| implement the |
| <emphasis>org.apache.uima.indexer.analysis.TokenFilterFactory |
| </emphasis> |
| interface. This interface defines a method createTokenFilter which |
| takes a |
| token |
| stream and a java.util.Properties object as parameters. |
| The properties |
| object will |
| include all attribute names as keys and |
| their values which are |
| additionally defined |
| in the filter tag. |
| Consider the example below for a demonstration. |
| </para> |
| <programlisting><![CDATA[<fields> |
| <field name=“text” index=“yes”> |
| <annotations> |
| <annotation sofa=“text” type=“de.julielab.types.Cell”> |
| <filters> |
| <filter factoryClassName="org.example.MyTokenFilterFactory" |
| parameter1="value1" parameter2="value2"/> |
| </filters> |
| <features> |
| <feature name=“specificType”> |
| </features> |
| </annotation> |
| </annotations> |
| </field> |
| </fields>]]></programlisting> |
| <para> |
| In the example above the token filter factory is new |
| instantiated for |
| every |
| occurrence in the mapping file. In scenarios |
| where token filters use large |
| resources, |
| this will be a waste of |
| memory and time. To reuse a factory instance |
| we need to provide a |
| name and a reuse attribute. |
| The example below demonstrate how we can |
| reuse a factory |
| instance. |
| </para> |
| <programlisting><![CDATA[<fields> |
| <field name=“text” index=“yes”> |
| <annotations> |
| <annotation sofa=“text” type=“de.julielab.types.Cell”> |
| <filters> |
| <filter factoryClassName="org.example.MyTokenFilterFactory" |
| name="myFactory" reuse="true" |
| myResourceFilePath="pathToResource"/> |
| </filters> |
| <features> |
| <feature name=“specificType”> |
| </features> |
| </annotation> |
| </annotations> |
| </field> |
| </fields>]]></programlisting> |
| </section> |
| </section> |
| <section id="sandbox.luceneCasConsumer.mapping.termcover"> |
| <title>Defining term covers</title> |
| <para> |
| When |
| defining a normal field in the ways described in the above |
| sections, the |
| term set <emphasis>T</emphasis> resulting from the processing defined by the |
| <emphasis>annotation</emphasis> |
| and |
| <emphasis>filter</emphasis> |
| elements will all be added to the respective field. It is possible |
| to automatically distribute these terms onto multiple fields which |
| are dynamically created. Each term may be included in zero or |
| multiple fields. |
| Which term is to be added to which field(s) is defined |
| by a |
| <emphasis>termCoverDefinition</emphasis> |
| file. The idea is that the whole term set <emphasis>T</emphasis> is |
| <emphasis>covered</emphasis> |
| by several subsets <emphasis>S1,S2,...,SN</emphasis> where each subset corresponds to a field for all |
| terms in this subset. The result is not necessarily a partition, |
| that is, |
| one term may be accepted to multiple fields. Furthermore, to |
| keep true to the notion |
| of a |
| <emphasis>cover</emphasis> |
| , |
| terms which don't belong to any subset are not considered to |
| belong |
| to the field definition at all and would be filtered out |
| anyway (as |
| an assumption; this is theoretically motivated and has no practical |
| consequences here). |
| </para> |
| <para> |
| This mechanism is useful whenever the |
| <emphasis>token source</emphasis> |
| of a field emmits tokens (thus, eventually terms) the user would |
| wish assign to different categories and express this categorization |
| by modelling the terms into one field per category. As an example, a |
| shop system could be considered. An annotation type |
| <emphasis>de.julielab.types.ArticleName</emphasis> |
| would be annotated in |
| <emphasis>CAS</emphasis> |
| objects. Among the text snippets annotated this way one would find |
| <emphasis>light bulb</emphasis> |
| , |
| <emphasis>electric shaver</emphasis> |
| and |
| <emphasis>smartphone</emphasis> |
| for example (the three terms are considered to be article names for |
| this example, even if they are chosen generally enough to be |
| categories themselves). The shop system would have - among others |
| - three article categories |
| <emphasis>electronics</emphasis> |
| , |
| <emphasis>sanitaryArticles</emphasis> |
| and |
| <emphasis>computers</emphasis> |
| . The goal will be to assign the article names to the fields |
| corresponding to their category. Since this information is not given |
| implicitely by different annotation objects (the assumption is that there are far |
| too many categories which could even change over time; this would |
| make maintaining the type system rather tedious), an explicit |
| definition must be delivered. This is achieved using a |
| <emphasis>termCoverDefinitionFile</emphasis> |
| . It is required to expose the following format: |
| <informalequation> |
| <mediaobject> |
| <textobject> |
| <phrase> |
| <term>=<S1>|<S2>|...|<SN> |
| </phrase> |
| </textobject> |
| </mediaobject> |
| </informalequation> |
| That is, one term per line, the categories of a term assigned by a |
| = |
| sign and multiple categories separated by the |
| | |
| character. An example |
| file would read as follows: |
| <programlisting><![CDATA[light bulb=electronics |
| electric shaver=electronics|sanitaryArticles |
| smartphone=electronics|computers]]> |
| </programlisting> |
| </para> |
| <para> |
| To create fields according to a cover set definition as described |
| above, the element |
| <emphasis>termSetCoverDefinition</emphasis> |
| is introduced into the |
| <emphasis>field</emphasis> |
| element. An example would look like this: |
| <programlisting><![CDATA[<fields> |
| <field name="articlecategory_" ...> |
| <termSetCoverDefinition coverDefinitionFile="pathToCoverDefinitionFile" |
| generateFieldNameMethod="append|prepend|replace" |
| ignoreCaseOfSelectedTerms="true|false" /> |
| <annotations> |
| <annotation type="de.julielab.types.ArticleName" /> |
| </annotations> |
| </field> |
| </fields>]]></programlisting> |
| Here, |
| <emphasis>pathToCoverDefinitionFile</emphasis> |
| points to a file as described above. The |
| <emphasis>generateFieldNameMethod</emphasis> |
| attributes takes one of |
| <emphasis>append</emphasis> |
| , |
| <emphasis>prepend</emphasis> |
| or |
| <emphasis>replace</emphasis> |
| . It is used to define the method of how to name the dynamically |
| created category fields. The name will be derived from the value of |
| the |
| <emphasis>name</emphasis> |
| attribute of the |
| <emphasis>field</emphasis> |
| element by appending, prepending or replacing it by the respective |
| category name. If, in the above example, |
| <emphasis>append</emphasis> |
| would be used, the eventual field names would be |
| <emphasis>articlecategory_electronics</emphasis> |
| , |
| <emphasis>articlecategory_sanitaryArticles</emphasis> |
| and |
| <emphasis>articleCategory_computers</emphasis> |
| . Each field would only contain terms defined for it in the |
| <emphasis>termCoverDefinitionFile</emphasis> |
| . The attribute |
| <emphasis>ignoreCaseOfSelectedTerms</emphasis> |
| is used to switch on or off case normalization when checking whether |
| a particular term is allowed for a particular field. When switched |
| off, the term |
| <emphasis>smartphone</emphasis> |
| would be allowed for the fields |
| <emphasis>articlecategory_electronics</emphasis> |
| and |
| <emphasis>articlecategory_computers</emphasis> |
| while |
| <emphasis>SMARTPHONE</emphasis> |
| would not. Setting the attribute to |
| <emphasis>true</emphasis> |
| would lead to the acceptance of both variants into both fields. It |
| is not possible to set this parameter to different values for |
| different cover subset fields of the same cover. |
| </para> |
| </section> |
| </chapter> |
| <chapter id="sandbox.luceneCasConsumer.mapping.reference"> |
| <title>Mapping File Reference</title> |
| <para> |
| After introducing the basic concepts and functions this |
| chapter |
| offers a complete reference of the mapping |
| file elements. |
| </para> |
| <section id="sandbox.luceneCasConsumer.mapping.reference.structure"> |
| <title>Mapping File Structure</title> |
| <para> |
| The raw mapping file structure is sketched below. |
| </para> |
| <programlisting><![CDATA[<fields> |
| <field ..> |
| <termSetCoverDefinition ../> |
| <filters> |
| <filter ../> |
| ... |
| </filters> |
| |
| <annotations> |
| <annotation ..> |
| <filters> |
| <filter ../> |
| ... |
| </filters> |
| <features> |
| <feature ..> |
| ... |
| </features> |
| </annotation> |
| ... |
| </annotations> |
| </field> |
| ... |
| </fields>]]></programlisting> |
| </section> |
| <section id="sandbox.luceneCasConsumer.mapping.reference.elements"> |
| <title>Mapping File Elements</title> |
| <para> |
| This section describes the mapping file |
| elements and their |
| attributes. |
| </para> |
| <para> |
| <itemizedlist> |
| <listitem> |
| <para> |
| <emphasis>fields element</emphasis> |
| <itemizedlist> |
| <listitem> |
| <para> |
| fields container element |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| contains: |
| <code>field+</code> |
| </para> |
| </listitem> |
| </itemizedlist> |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| <emphasis>field element</emphasis> |
| <itemizedlist> |
| <listitem> |
| <para> |
| describes a Lucene |
| <ulink |
| url="http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/document/Field.html">field</ulink> |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| contains: |
| <code>termSetCoverDefinition?, filters?, annotations</code> |
| </para> |
| </listitem> |
| </itemizedlist> |
| </para> |
| <para> |
| <table> |
| <title>field element attributes</title> |
| <tgroup cols="5"> |
| <thead> |
| <row> |
| <entry>name</entry> |
| <entry>allowed values</entry> |
| <entry>default value</entry> |
| <entry>mandatory</entry> |
| <entry>description</entry> |
| </row> |
| </thead> |
| <tbody> |
| <row> |
| <entry>name</entry> |
| <entry>string</entry> |
| <entry>-</entry> |
| <entry>yes</entry> |
| <entry> |
| the name of the |
| <ulink |
| url="http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/document/Field.html">field</ulink> |
| </entry> |
| </row> |
| <row> |
| <entry>index</entry> |
| <entry>yes|no|no_norms|no_tf|no_norms_tf |
| </entry> |
| <entry>no</entry> |
| <entry>no</entry> |
| <entry> |
| See |
| <ulink |
| url="http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/document/Field.Index.html">Field.Index</ulink> |
| </entry> |
| </row> |
| <row> |
| <entry>termVector</entry> |
| <entry>no|positions|offsets|positions_offsets |
| </entry> |
| <entry>no</entry> |
| <entry>no</entry> |
| <entry> |
| See |
| <ulink |
| url="http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/document/Field.TermVector.html">Field.TermVector</ulink> |
| </entry> |
| </row> |
| <row> |
| <entry>stored</entry> |
| <entry>yes|no|compress</entry> |
| <entry>no</entry> |
| <entry>no</entry> |
| <entry> |
| See |
| <ulink |
| url="http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/document/Field.Store">Field.Store</ulink> |
| </entry> |
| </row> |
| <row> |
| <entry>merge</entry> |
| <entry>boolean</entry> |
| <entry>false</entry> |
| <entry>no</entry> |
| <entry>If this attribute is set to true, all contained |
| annotation token streams are merged according to their |
| offset. |
| The tokens position increment are adopted in the |
| case |
| of |
| overlapping. |
| </entry> |
| </row> |
| <row> |
| <entry>unique</entry> |
| <entry>boolean</entry> |
| <entry>false</entry> |
| <entry>no</entry> |
| <entry>If this attribute is set to true, there will be only |
| one field instance with this field's name be added to |
| resulting Lucene documents. This is required e.g. by Apache |
| Solr for primary key fields. You must not define multiple |
| fields with the same name to be unique, this would break the |
| unique property. |
| </entry> |
| </row> |
| </tbody> |
| </tgroup> |
| </table> |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| <emphasis>termSetCoverDefinition element</emphasis> |
| <itemizedlist> |
| <listitem> |
| <para>element to define the automatical distribution of terms |
| to multiple fields |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| contains: |
| <code>nothing</code> |
| </para> |
| </listitem> |
| </itemizedlist> |
| </para> |
| <para> |
| <table> |
| <title>termSetCoverDefinition element attributes</title> |
| <tgroup cols="5"> |
| <thead> |
| <row> |
| <entry>name</entry> |
| <entry>allowed values</entry> |
| <entry>default value</entry> |
| <entry>mandatory</entry> |
| <entry>description</entry> |
| </row> |
| </thead> |
| <tbody> |
| <row> |
| <entry>coverDefinition |
| File</entry> |
| <entry>string</entry> |
| <entry>-</entry> |
| <entry>yes</entry> |
| <entry>Path to a file defining the term to category |
| assignment (which term belongs to which cover subset). |
| </entry> |
| </row> |
| <row> |
| <entry>generateField |
| NameMethod</entry> |
| <entry>append|prepend|replace</entry> |
| <entry>append</entry> |
| <entry>no</entry> |
| <entry>Determines the name of the cover subset fields. To the |
| original field name, the subset (or category) name is |
| appended or prepended to the field name or replaces it |
| completely. |
| </entry> |
| </row> |
| <row> |
| <entry>ignoreCaseOf |
| SelectedTerms</entry> |
| <entry>boolean</entry> |
| <entry>true</entry> |
| <entry>no</entry> |
| <entry> |
| For each subset field, there is a list of allowed term |
| values defined in the |
| <emphasis>coverDefinitionFile</emphasis> |
| . This parameter determines whether the case of term strings |
| is |
| ignored for the membership-check or not. |
| </entry> |
| </row> |
| </tbody> |
| </tgroup> |
| </table> |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| <emphasis>filters element</emphasis> |
| <itemizedlist> |
| <listitem> |
| <para> |
| container element for filters |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| contains: |
| <code>filter+</code> |
| </para> |
| </listitem> |
| </itemizedlist> |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| <emphasis>filter element</emphasis> |
| <itemizedlist> |
| <listitem> |
| <para> |
| Describes a |
| <ulink |
| url="http://lucene.apache.org/java/2_4_1/api/org/apache/lucene/analysis/TokenFilter.html">token filter</ulink> |
| instance. |
| Token filters can either be predefined or |
| self-provided. |
| </para> |
| </listitem> |
| </itemizedlist> |
| </para> |
| <para> |
| <table> |
| <title>filter element attributes</title> |
| <tgroup cols="5"> |
| <thead> |
| <row> |
| <entry>name</entry> |
| <entry>allowed values</entry> |
| <entry>default value</entry> |
| <entry>mandatory</entry> |
| <entry>description</entry> |
| </row> |
| </thead> |
| <tbody> |
| <row> |
| <entry>name</entry> |
| <entry>string</entry> |
| <entry>-</entry> |
| <entry>no</entry> |
| <entry> |
| the name to reference either a predefined filter (see |
| predefined filter reference) |
| or a reused filter |
| </entry> |
| </row> |
| <row> |
| <entry>className</entry> |
| <entry>string</entry> |
| <entry>-</entry> |
| <entry>no</entry> |
| <entry> |
| The canonical class name of a token filter. the token |
| filter class must provide a |
| single argument constructor which |
| takes the token stream as parameter. |
| </entry> |
| </row> |
| <row> |
| <entry>factoryClassName</entry> |
| <entry>string</entry> |
| <entry>-</entry> |
| <entry>no</entry> |
| <entry> |
| The canonical class name of a token filter factory. |
| the |
| token |
| filter factory class must |
| implement the |
| org.apache.uima.indexer.analysis.TokenFilterFactory. See |
| <xref linkend="sandbox.luceneCasConsumer.mapping.tokenfilters" /> |
| for |
| a example. |
| </entry> |
| </row> |
| <row> |
| <entry>reuse</entry> |
| <entry>boolean</entry> |
| <entry>-</entry> |
| <entry>false</entry> |
| <entry> |
| Enables token filter factory reuse. This makes sense |
| when a |
| token |
| filter uses resources which should be cached. |
| Because token |
| filters |
| are referenced by their names, you |
| also |
| need to provide |
| a name. |
| </entry> |
| </row> |
| <row> |
| <entry>*</entry> |
| <entry>string</entry> |
| <entry>-</entry> |
| <entry>-</entry> |
| <entry> |
| Filters may have their own parameter attributes which |
| are |
| explained |
| in the |
| <xref linkend="sandbox.luceneCasConsumer.mapping.reference" /> |
| .. |
| </entry> |
| </row> |
| </tbody> |
| </tgroup> |
| </table> |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| <emphasis>annotations element</emphasis> |
| <itemizedlist> |
| <listitem> |
| <para> |
| container element for annotations |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| contains: |
| <code>annotation+</code> |
| </para> |
| </listitem> |
| </itemizedlist> |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| <emphasis>annotation element</emphasis> |
| <itemizedlist> |
| <listitem> |
| <para> |
| Describes a token stream which is generated from a CAS |
| annotation index. |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| contains: |
| <code>features?</code> |
| </para> |
| </listitem> |
| </itemizedlist> |
| </para> |
| <para> |
| <table> |
| <title>annotation element attributes</title> |
| <tgroup cols="5"> |
| <thead> |
| <row> |
| <entry>name</entry> |
| <entry>allowed values</entry> |
| <entry>default value</entry> |
| <entry>mandatory</entry> |
| <entry>description</entry> |
| </row> |
| </thead> |
| <tbody> |
| <row> |
| <entry>type</entry> |
| <entry>string</entry> |
| <entry>-</entry> |
| <entry>yes</entry> |
| <entry> |
| The canonical type name. E.g. "uima.cas.Annotation" |
| </entry> |
| </row> |
| <row> |
| <entry>sofa</entry> |
| <entry>string</entry> |
| <entry>InitialView</entry> |
| <entry>yes</entry> |
| <entry> |
| Determines from which sofa the annotation index is |
| taken |
| </entry> |
| </row> |
| <row> |
| <entry>featurePath</entry> |
| <entry>string</entry> |
| <entry>-</entry> |
| <entry>no</entry> |
| <entry> |
| Allows to address feature structures which are |
| associated |
| with the annotation object. Features are separated |
| by |
| a ".". |
| </entry> |
| </row> |
| <row> |
| <entry>tokenizer</entry> |
| <entry>cas|white_space|standard |
| </entry> |
| <entry>cas</entry> |
| <entry>no</entry> |
| <entry> |
| Determines which tokenization is used. "cas" uses the |
| tokenization given |
| by the contained annotation token streams, |
| "standard" uses the |
| <ulink |
| url="http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/analysis/standard/StandardTokenizer.html">standard tokenizer</ulink> |
| </entry> |
| </row> |
| <row> |
| <entry>featureValueDelimiterString |
| </entry> |
| <entry>string</entry> |
| <entry>-</entry> |
| <entry>no</entry> |
| <entry> |
| If this parameter is provided all feature values of |
| the |
| targeted |
| feature structure are concatenated and delimited |
| by this |
| string. |
| </entry> |
| </row> |
| </tbody> |
| </tgroup> |
| </table> |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| <emphasis>features element</emphasis> |
| <itemizedlist> |
| <listitem> |
| <para> |
| Container element for features. |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| contains: |
| <code>feature+</code> |
| </para> |
| </listitem> |
| </itemizedlist> |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| <emphasis>feature element</emphasis> |
| <itemizedlist> |
| <listitem> |
| <para> |
| Describes a certain feature of the addressed feature |
| structure. Values of this features serve as token |
| source. |
| </para> |
| </listitem> |
| </itemizedlist> |
| </para> |
| <para> |
| <table> |
| <title>feature element attributes</title> |
| <tgroup cols="5"> |
| <thead> |
| <row> |
| <entry>name</entry> |
| <entry>allowed values</entry> |
| <entry>default value</entry> |
| <entry>mandatory</entry> |
| <entry>description</entry> |
| </row> |
| </thead> |
| <tbody> |
| <row> |
| <entry>name</entry> |
| <entry>string</entry> |
| <entry>-</entry> |
| <entry>yes</entry> |
| <entry> |
| The feature name. |
| </entry> |
| </row> |
| <row> |
| <entry>numberFormat</entry> |
| <entry>string</entry> |
| <entry>-</entry> |
| <entry>no</entry> |
| <entry> |
| Allows to convert number features to strings. See |
| <ulink |
| url="http://java.sun.com/javase/6/docs/api/java/text/DecimalFormat.html">DecimalNumberFormat</ulink> |
| . |
| </entry> |
| </row> |
| </tbody> |
| </tgroup> |
| </table> |
| </para> |
| </listitem> |
| </itemizedlist> |
| </para> |
| </section> |
| <section id="sandbox.luceneCasConsumer.mapping.reference.filters"> |
| <title> |
| Filters Reference |
| </title> |
| <para>Lucas comes with a couple of predefined token filters. |
| This |
| section provides a complete |
| reference for these filters. |
| </para> |
| <section |
| id="sandbox.luceneCasConsumer.mapping.reference.filters.addition"> |
| <title> |
| Addition Filter |
| </title> |
| <para>Adds suffixes or prefixes to tokens.</para> |
| <programlisting><![CDATA[<filter name="addition" prefix="PRE_"/>]]></programlisting> |
| <para> |
| <table> |
| <title>addition filter attributes</title> |
| <tgroup cols="5"> |
| <thead> |
| <row> |
| <entry>name</entry> |
| <entry>allowed values</entry> |
| <entry>default value</entry> |
| <entry>mandatory</entry> |
| <entry>description</entry> |
| </row> |
| </thead> |
| <tbody> |
| <row> |
| <entry>prefix</entry> |
| <entry>string</entry> |
| <entry>-</entry> |
| <entry>no</entry> |
| <entry> |
| A pefix which is added to the front of each token. |
| </entry> |
| </row> |
| <row> |
| <entry>postfix</entry> |
| <entry>string</entry> |
| <entry>-</entry> |
| <entry>no</entry> |
| <entry> |
| A post which is added to the end of each token. |
| </entry> |
| </row> |
| </tbody> |
| </tgroup> |
| </table> |
| </para> |
| </section> |
| <section |
| id="sandbox.luceneCasConsumer.mapping.reference.filters.hypernyms"> |
| <title> |
| Hypernyms Filter |
| </title> |
| <para>Adds hypernyms of a token with the same offset and |
| position |
| increment 0. |
| </para> |
| <programlisting><![CDATA[<filter name="hypernyms" filePath="/path/to/myFile.txt"/>]]></programlisting> |
| <para> |
| <table> |
| <title>hypernym filter attributes</title> |
| <tgroup cols="5"> |
| <thead> |
| <row> |
| <entry>name</entry> |
| <entry>allowed values</entry> |
| <entry>default value</entry> |
| <entry>mandatory</entry> |
| <entry>description</entry> |
| </row> |
| </thead> |
| <tbody> |
| <row> |
| <entry>filePath</entry> |
| <entry>string</entry> |
| <entry>-</entry> |
| <entry>yes</entry> |
| <entry> |
| The hypernym file path. Each line of the file contains one |
| token |
| with its hypernyms. |
| The file must have the following |
| format: |
| <code>TOKEN_TEXT=HYPERNYM1|HYPERNYM2|.. |
| </code> |
| . |
| </entry> |
| </row> |
| </tbody> |
| </tgroup> |
| </table> |
| </para> |
| </section> |
| <section |
| id="sandbox.luceneCasConsumer.mapping.reference.filters.position"> |
| <title> |
| Position Filter |
| </title> |
| <para>Allows to select only the first or the last token of a |
| token |
| stream, all other tokens are discarded. |
| </para> |
| <programlisting><![CDATA[<filter name="position" position="last"/>]]></programlisting> |
| <para> |
| <table> |
| <title>position filter attributes</title> |
| <tgroup cols="5"> |
| <thead> |
| <row> |
| <entry>name</entry> |
| <entry>allowed values</entry> |
| <entry>default value</entry> |
| <entry>mandatory</entry> |
| <entry>description</entry> |
| </row> |
| </thead> |
| <tbody> |
| <row> |
| <entry>position</entry> |
| <entry>first|last</entry> |
| <entry>-</entry> |
| <entry>yes</entry> |
| <entry> |
| If position is set to first the only the the first token |
| of the underlying token stream is returned, |
| all other tokens |
| are |
| discarded. Otherwise, if position is set to last, only the |
| last |
| token is returned. |
| </entry> |
| </row> |
| </tbody> |
| </tgroup> |
| </table> |
| </para> |
| </section> |
| <section |
| id="sandbox.luceneCasConsumer.mapping.reference.filters.replace"> |
| <title> |
| Replace Filter |
| </title> |
| <para>Allows to replace token texts.</para> |
| <programlisting><![CDATA[<filter name="replace" filePath="/path/to/myFile.txt"/>]]></programlisting> |
| <para> |
| <table> |
| <title>replace filter attributes</title> |
| <tgroup cols="5"> |
| <thead> |
| <row> |
| <entry>name</entry> |
| <entry>allowed values</entry> |
| <entry>default value</entry> |
| <entry>mandatory</entry> |
| <entry>description</entry> |
| </row> |
| </thead> |
| <tbody> |
| <row> |
| <entry>filePath</entry> |
| <entry>string</entry> |
| <entry>-</entry> |
| <entry>yes</entry> |
| <entry> |
| The token text replacement file path. Each line consists of |
| the |
| original token text and |
| the replacement and must have the |
| following format: |
| <code> |
| TOKEN_TEXT=REPLACEMENT_TEXT |
| </code> |
| . |
| </entry> |
| </row> |
| </tbody> |
| </tgroup> |
| </table> |
| </para> |
| </section> |
| <section |
| id="sandbox.luceneCasConsumer.mapping.reference.filters.snowball"> |
| <title> |
| Snowball Filter |
| </title> |
| <para> |
| Integration of the |
| <ulink |
| url="http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/analysis/snowball/SnowballFilter.html">Lucene snowball filter</ulink> |
| </para> |
| <programlisting><![CDATA[<filter name="snowball" stemmerName="German"/>]]></programlisting> |
| <para> |
| <table> |
| <title>snowball filter attributes</title> |
| <tgroup cols="5"> |
| <thead> |
| <row> |
| <entry>name</entry> |
| <entry>allowed values</entry> |
| <entry>default value</entry> |
| <entry>mandatory</entry> |
| <entry>description</entry> |
| </row> |
| </thead> |
| <tbody> |
| <row> |
| <entry>stemmerName</entry> |
| <entry>snowball stemmer names.</entry> |
| <entry>English</entry> |
| <entry>no</entry> |
| <entry> |
| See |
| <ulink |
| url="http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/analysis/snowball/SnowballFilter.html">snowball filter documentation</ulink> |
| . |
| </entry> |
| </row> |
| </tbody> |
| </tgroup> |
| </table> |
| </para> |
| </section> |
| <section |
| id="sandbox.luceneCasConsumer.mapping.reference.filters.splitter"> |
| <title> |
| Splitter Filter |
| </title> |
| <para>Splits tokens at a certain string.</para> |
| <programlisting><![CDATA[<filter name="splitter" splitString=","/>]]></programlisting> |
| <para> |
| <table> |
| <title>splitter filter attributes</title> |
| <tgroup cols="5"> |
| <thead> |
| <row> |
| <entry>name</entry> |
| <entry>allowed values</entry> |
| <entry>default value</entry> |
| <entry>mandatory</entry> |
| <entry>description</entry> |
| </row> |
| </thead> |
| <tbody> |
| <row> |
| <entry>splitString</entry> |
| <entry>string</entry> |
| <entry>-</entry> |
| <entry>yes</entry> |
| <entry> |
| The string on which tokens are split. |
| </entry> |
| </row> |
| </tbody> |
| </tgroup> |
| </table> |
| </para> |
| </section> |
| <section id="sandbox.luceneCasConsumer.mapping.reference.filters.concat"> |
| <title> |
| Concatenate Filter |
| </title> |
| <para>Concatenates token texts with a certain delimiter |
| string. |
| </para> |
| <programlisting><![CDATA[<filter name="concatenate" concatString=";"/>]]></programlisting> |
| <para> |
| <table> |
| <title>concatenate filter attributes</title> |
| <tgroup cols="5"> |
| <thead> |
| <row> |
| <entry>name</entry> |
| <entry>allowed values</entry> |
| <entry>default value</entry> |
| <entry>mandatory</entry> |
| <entry>description</entry> |
| </row> |
| </thead> |
| <tbody> |
| <row> |
| <entry>concatString</entry> |
| <entry>string</entry> |
| <entry>-</entry> |
| <entry>yes</entry> |
| <entry> |
| The string with which token texts are concatenated. |
| </entry> |
| </row> |
| </tbody> |
| </tgroup> |
| </table> |
| </para> |
| </section> |
| <section |
| id="sandbox.luceneCasConsumer.mapping.reference.filters.stopwords"> |
| <title> |
| Stopword Filter |
| </title> |
| <para> |
| Integration of the |
| <ulink |
| url="http://lucene.apache.org/java/2_4_1/api/org/apache/lucene/analysis/StopFilter.html">Lucene stop filter</ulink> |
| </para> |
| <programlisting><![CDATA[<filter name="stopwords" filePath="/path/to/myStopwords.txt"/>]]></programlisting> |
| <para> |
| <table> |
| <title>stopword filter attributes</title> |
| <tgroup cols="5"> |
| <thead> |
| <row> |
| <entry>name</entry> |
| <entry>allowed values</entry> |
| <entry>default value</entry> |
| <entry>mandatory</entry> |
| <entry>description</entry> |
| </row> |
| </thead> |
| <tbody> |
| <row> |
| <entry>filePath</entry> |
| <entry>string</entry> |
| <entry>-</entry> |
| <entry>no</entry> |
| <entry> |
| The stopword file path. Each line of the file contains a |
| single stopword. |
| </entry> |
| </row> |
| <row> |
| <entry>ignoreCase</entry> |
| <entry>boolean</entry> |
| <entry>false</entry> |
| <entry>no</entry> |
| <entry> |
| Defines if the stop filter ignores the case of stop |
| words. |
| </entry> |
| </row> |
| </tbody> |
| </tgroup> |
| </table> |
| </para> |
| </section> |
| <section id="sandbox.luceneCasConsumer.mapping.reference.filters.unique"> |
| <title> |
| Unique Filter |
| </title> |
| <para>Filters tokens with the same token text. The resulting |
| token |
| stream contains only tokens with unique texts. |
| </para> |
| <programlisting><![CDATA[<filter name="unique"/>]]></programlisting> |
| </section> |
| <section |
| id="sandbox.luceneCasConsumer.mapping.reference.filters.uppercase"> |
| <title> |
| Upper Case Filter |
| </title> |
| <para>Turns the text of each token into upper case.</para> |
| <programlisting><![CDATA[<filter name="uppercase"/>]]></programlisting> |
| </section> |
| <section |
| id="sandbox.luceneCasConsumer.mapping.reference.filters.lowercase"> |
| <title> |
| Lower Case Filter |
| </title> |
| <para>Turns the text of each token into lower case.</para> |
| <programlisting><![CDATA[<filter name="lowercase"/>]]></programlisting> |
| </section> |
| </section> |
| </chapter> |
| <chapter id="sandbox.luceneCasConsumer.indexwriter"> |
| <title>Index Writer Configuration</title> |
| <para> |
| The index writer used by Lucas can be configured separately. To allow |
| Lucas to run in |
| multiple deployment scenarios, different Lucas |
| instances can share one index writer |
| instance. This is handled by the |
| resource manager. To configure the resource manager and |
| the index |
| writer properly, the Lucas descriptor contains a resource binding |
| <code>IndexWriterProvider</code> |
| . |
| An |
| <code>IndexWriterProvider</code> |
| creates an index writer from a properties |
| file. The file path and the |
| name of this properties file must be set in the |
| <code>LucasIndexWriterProvider</code> |
| resource |
| section of the descriptor. |
| </para> |
| <para> |
| The properties file can contain the following properties. |
| <itemizedlist> |
| <listitem> |
| <para> |
| <code>indexPath</code> |
| - the path to the index directory |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| <code>RAMBufferSize</code> |
| - (number value), see |
| <ulink |
| url="http://lucene.apache.org/java/2_4_1/api/org/apache/lucene/index/IndexWriter.html#setRAMBufferSizeMB(double)">IndexWriter.ramBufferSize</ulink> |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| <code>useCompoundFileFormat</code> |
| - (boolean value), see |
| <ulink |
| url="http://lucene.apache.org/java/2_4_1/api/org/apache/lucene/index/IndexWriter.html#setUseCompoundFile(boolean)">IndexWriter.useCompoundFormat</ulink> |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| <code>maxFieldLength</code> |
| - (boolean value), see |
| <ulink |
| url="http://lucene.apache.org/java/2_4_1/api/org/apache/lucene/index/IndexWriter.html#setMaxFieldLength(int)">IndexWriter.maxFieldLength</ulink> |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| <code>uniqueIndex</code> |
| - (boolean value), if set to |
| <code>true</code> |
| , host name and process identifier are added to the index name. |
| (Only tested on linux systems) |
| </para> |
| </listitem> |
| </itemizedlist> |
| </para> |
| </chapter> |
| <chapter id="sandbox.luceneCasConsumer.descriptor"> |
| <title>Descriptor Parameters |
| </title> |
| <para> |
| Because Lucas is configured by the mapping file, the descriptor has |
| only one parameter: |
| <itemizedlist> |
| <listitem> |
| <para> |
| <code>mappingFile</code> |
| - the file path to the mapping file. |
| </para> |
| </listitem> |
| </itemizedlist> |
| </para> |
| </chapter> |
| <chapter id="sandbox.luceneCasConsumer.prospectiveSearch"> |
| <title>Prospective Search</title> |
| <para> |
| Prospective search is a search method where a set of search |
| queries are given |
| first |
| and then searched against a stream of |
| documents. A search query divides |
| the document |
| stream into a sub-stream |
| which only contains these document which match the |
| query. |
| Users usually |
| define a number of search queries and then subscribe to the |
| resulting |
| sub-streams. An example for prospective search is a news feed which |
| is monitored |
| for certain terms, |
| each time a term occurs a mail |
| notification is send. |
| </para> |
| <para> |
| The user must provide a set of search queries via a |
| <code>SearchQueryProvider</code> |
| , |
| these search queries are then search against the processed CAS as |
| defined |
| in the mapping file, |
| if a match occurs a feature structure is |
| inserted into the CAS. |
| Optionally highlighting is |
| supported, |
| annotations for the matchtng text areas are created and linked to |
| the |
| feature structure. |
| </para> |
| <para> |
| The implementation uses the Lucene |
| <ulink |
| url="http://lucene.apache.org/java/2_4_1/api/contrib-memory/org/apache/lucene/index/memory/MemoryIndex.html">MemoryIndex</ulink> |
| which is a fast one document in memory index. For performance notes |
| please consult the javadoc |
| of the |
| <code>MemoryIndex</code> |
| class. |
| </para> |
| <section |
| id="sandbox.luceneCasConsumer.prospectiveSearch.searchQueryProvider"> |
| <title>Search Query Provider</title> |
| <para> |
| The Search Query Provider provides the Perspective Search Analysis |
| Engine |
| with a set of search queries which should be monitored. A |
| search |
| query is a combination of a Lucene query and an id. The id is |
| later |
| needed |
| to map a match to a specific search query. A user usually |
| has a set of |
| search queries which should be monitored, since there is |
| no |
| standardized |
| way to access the search queries the user must |
| implement the |
| <code>SearchQueryProvider</code> |
| interface and configure the thread-safe implementation as shared |
| resource object. |
| An example for such an implementation could be a |
| search query provider |
| which reads |
| search queries form a database or a |
| web service. |
| </para> |
| </section> |
| <section id="sandbox.luceneCasConsumer.prospectiveSearch.searchResults"> |
| <title>Search Results</title> |
| <para> |
| The search results are written to the CAS, for each match one |
| Search |
| Result |
| feature structure is inserted into the CAS. The Search |
| Result feature |
| structure contains |
| the id and optionally links to an |
| array of annotations which mark the |
| matching |
| text in the CAS. |
| </para> |
| <para> |
| The Search Result type must be mapped to a defined type |
| in the |
| analysis engine descriptor with the following configuration |
| parameters: |
| <itemizedlist> |
| <listitem> |
| <para> |
| <code>String org.apache.uima.lucas.SearchResult</code> |
| - Maps the search result type to an actual type in the type |
| system. |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| <code>String org.apache.uima.lucas.SearchResultIdFeature</code> |
| - A long id feature, identifies the matching search query. |
| </para> |
| </listitem> |
| <listitem> |
| <para> |
| <code>String org.apache.uima.lucas.SearchResulMatchingTextFeature |
| </code> |
| - An optional |
| <code>ArrayFS</code> |
| feature, links to annotations which mark the matching text. |
| </para> |
| </listitem> |
| </itemizedlist> |
| </para> |
| </section> |
| </chapter> |
| </book> |