solr/contrib/uima/README.txt - lucene-solr - Git at Google

 Apache Solr UIMA Metadata Extraction Library

 Introduction
 ------------
 This module is intended to be used both as an UpdateRequestProcessor while indexing documents and as a set of tokenizer/filters
 to be configured inside the schema.xml for use during analysis phase.
 UIMAUpdateRequestProcessor purpose is to provide additional on the fly automatically generated fields to the Solr index.
 Such fields could be language, concepts, keywords, sentences, named entities, etc.
 UIMA based tokenizers/filters can be used either inside plain Lucene or as index/query analyzers to be defined
 inside the schema.xml of a Solr core to create/filter tokens using specific UIMA annotations.


 Getting Started
 ---------------
 To start using Solr UIMA Metadata Extraction Library you should go through the following configuration steps:

 1. copy generated solr-uima jar and its libs (under contrib/uima/lib) inside a Solr libraries directory.
    or set <lib/> tags in solrconfig.xml appropriately to point those jar files.

    <lib dir="../../contrib/uima/lib" />
    <lib dir="../../contrib/uima/lucene-libs" />
    <lib dir="../../dist/" regex="solr-uima-\d.*\.jar" />

 2. modify your schema.xml adding the fields you want to be hold metadata specifying proper values for type, indexed, stored and multiValued options:

    for example you could specify the following

   <field name="language" type="string" indexed="true" stored="true" required="false"/>
   <field name="concept" type="string" indexed="true" stored="true" multiValued="true" required="false"/>
   <field name="sentence" type="text" indexed="true" stored="true" multiValued="true" required="false" />

 3. modify your solrconfig.xml adding the following snippet:

   <updateRequestProcessorChain name="uima">
     <processor class="org.apache.solr.uima.processor.UIMAUpdateRequestProcessorFactory">
       <lst name="uimaConfig">
         <lst name="runtimeParameters">
           <str name="keyword_apikey">VALID_ALCHEMYAPI_KEY</str>
           <str name="concept_apikey">VALID_ALCHEMYAPI_KEY</str>
           <str name="lang_apikey">VALID_ALCHEMYAPI_KEY</str>
           <str name="cat_apikey">VALID_ALCHEMYAPI_KEY</str>
           <str name="entities_apikey">VALID_ALCHEMYAPI_KEY</str>
           <str name="oc_licenseID">VALID_OPENCALAIS_KEY</str>
         </lst>
         <str name="analysisEngine">/org/apache/uima/desc/OverridingParamsExtServicesAE.xml</str>
         <!-- Set to true if you want to continue indexing even if text processing fails.
              Default is false. That is, Solr throws RuntimeException and
              never indexed documents entirely in your session. -->
         <bool name="ignoreErrors">true</bool>
         <!-- This is optional. It is used for logging when text processing fails.
              If logField is not specified, uniqueKey will be used as logField.
         <str name="logField">id</str>
         -->
         <lst name="analyzeFields">
           <bool name="merge">false</bool>
           <arr name="fields">
             <str>text</str>
           </arr>
         </lst>
         <lst name="fieldMappings">
           <lst name="type">
             <str name="name">org.apache.uima.alchemy.ts.concept.ConceptFS</str>
             <lst name="mapping">
               <str name="feature">text</str>
               <str name="field">concept</str>
             </lst>
           </lst>
           <lst name="type">
             <str name="name">org.apache.uima.alchemy.ts.language.LanguageFS</str>
             <lst name="mapping">
               <str name="feature">language</str>
               <str name="field">language</str>
             </lst>
           </lst>
           <lst name="type">
             <str name="name">org.apache.uima.SentenceAnnotation</str>
             <lst name="mapping">
               <str name="feature">coveredText</str>
               <str name="field">sentence</str>
             </lst>
           </lst>
         </lst>
       </lst>
     </processor>
     <processor class="solr.LogUpdateProcessorFactory" />
     <processor class="solr.RunUpdateProcessorFactory" />
   </updateRequestProcessorChain>

    where VALID_ALCHEMYAPI_KEY is your AlchemyAPI Access Key. You need to register AlchemyAPI Access
    key to exploit the AlchemyAPI services: http://www.alchemyapi.com/api/register.html

    where VALID_OPENCALAIS_KEY is your Calais Service Key. You need to register Calais Service
    key to exploit the Calais services: http://www.opencalais.com/apikey

    the analysisEngine must contain an AE descriptor inside the specified path in the classpath

    the analyzeFields must contain the input fields that need to be analyzed by UIMA,
    if merge=true then their content will be merged and analyzed only once

    field mapping describes which features of which types should go in a field

 4. in your solrconfig.xml replace the existing default (<requestHandler name="/update"...)  or create a new UpdateRequestHandler with the following:
   <requestHandler name="/update" class="solr.XmlUpdateRequestHandler">
     <lst name="defaults">
       <str name="update.processor">uima</str>
     </lst>
   </requestHandler>

 Once you're done with the configuration you can index documents which will be automatically enriched with the specified fields
	Apache Solr UIMA Metadata Extraction Library

	Introduction
	------------
	This module is intended to be used both as an UpdateRequestProcessor while indexing documents and as a set of tokenizer/filters
	to be configured inside the schema.xml for use during analysis phase.
	UIMAUpdateRequestProcessor purpose is to provide additional on the fly automatically generated fields to the Solr index.
	Such fields could be language, concepts, keywords, sentences, named entities, etc.
	UIMA based tokenizers/filters can be used either inside plain Lucene or as index/query analyzers to be defined
	inside the schema.xml of a Solr core to create/filter tokens using specific UIMA annotations.


	Getting Started
	---------------
	To start using Solr UIMA Metadata Extraction Library you should go through the following configuration steps:

	1. copy generated solr-uima jar and its libs (under contrib/uima/lib) inside a Solr libraries directory.
	or set <lib/> tags in solrconfig.xml appropriately to point those jar files.

	<lib dir="../../contrib/uima/lib" />
	<lib dir="../../contrib/uima/lucene-libs" />
	<lib dir="../../dist/" regex="solr-uima-\d.*\.jar" />

	2. modify your schema.xml adding the fields you want to be hold metadata specifying proper values for type, indexed, stored and multiValued options:

	for example you could specify the following

	<field name="language" type="string" indexed="true" stored="true" required="false"/>
	<field name="concept" type="string" indexed="true" stored="true" multiValued="true" required="false"/>
	<field name="sentence" type="text" indexed="true" stored="true" multiValued="true" required="false" />

	3. modify your solrconfig.xml adding the following snippet:

	<updateRequestProcessorChain name="uima">
	<processor class="org.apache.solr.uima.processor.UIMAUpdateRequestProcessorFactory">
	<lst name="uimaConfig">
	<lst name="runtimeParameters">
	<str name="keyword_apikey">VALID_ALCHEMYAPI_KEY</str>
	<str name="concept_apikey">VALID_ALCHEMYAPI_KEY</str>
	<str name="lang_apikey">VALID_ALCHEMYAPI_KEY</str>
	<str name="cat_apikey">VALID_ALCHEMYAPI_KEY</str>
	<str name="entities_apikey">VALID_ALCHEMYAPI_KEY</str>
	<str name="oc_licenseID">VALID_OPENCALAIS_KEY</str>
	</lst>
	<str name="analysisEngine">/org/apache/uima/desc/OverridingParamsExtServicesAE.xml</str>
	<!-- Set to true if you want to continue indexing even if text processing fails.
	Default is false. That is, Solr throws RuntimeException and
	never indexed documents entirely in your session. -->
	<bool name="ignoreErrors">true</bool>
	<!-- This is optional. It is used for logging when text processing fails.
	If logField is not specified, uniqueKey will be used as logField.
	<str name="logField">id</str>
	-->
	<lst name="analyzeFields">
	<bool name="merge">false</bool>
	<arr name="fields">
	<str>text</str>
	</arr>
	</lst>
	<lst name="fieldMappings">
	<lst name="type">
	<str name="name">org.apache.uima.alchemy.ts.concept.ConceptFS</str>
	<lst name="mapping">
	<str name="feature">text</str>
	<str name="field">concept</str>
	</lst>
	</lst>
	<lst name="type">
	<str name="name">org.apache.uima.alchemy.ts.language.LanguageFS</str>
	<lst name="mapping">
	<str name="feature">language</str>
	<str name="field">language</str>
	</lst>
	</lst>
	<lst name="type">
	<str name="name">org.apache.uima.SentenceAnnotation</str>
	<lst name="mapping">
	<str name="feature">coveredText</str>
	<str name="field">sentence</str>
	</lst>
	</lst>
	</lst>
	</lst>
	</processor>
	<processor class="solr.LogUpdateProcessorFactory" />
	<processor class="solr.RunUpdateProcessorFactory" />
	</updateRequestProcessorChain>

	where VALID_ALCHEMYAPI_KEY is your AlchemyAPI Access Key. You need to register AlchemyAPI Access
	key to exploit the AlchemyAPI services: http://www.alchemyapi.com/api/register.html

	where VALID_OPENCALAIS_KEY is your Calais Service Key. You need to register Calais Service
	key to exploit the Calais services: http://www.opencalais.com/apikey

	the analysisEngine must contain an AE descriptor inside the specified path in the classpath

	the analyzeFields must contain the input fields that need to be analyzed by UIMA,
	if merge=true then their content will be merged and analyzed only once

	field mapping describes which features of which types should go in a field

	4. in your solrconfig.xml replace the existing default (<requestHandler name="/update"...) or create a new UpdateRequestHandler with the following:
	<requestHandler name="/update" class="solr.XmlUpdateRequestHandler">
	<lst name="defaults">
	<str name="update.processor">uima</str>
	</lst>
	</requestHandler>

	Once you're done with the configuration you can index documents which will be automatically enriched with the specified fields