sandbox-2.3.0-01/TikaAnnotator/README.txt - uima-sandbox - Git at Google

 Apache UIMA TikaAnnotator README file

 INTRODUCTION

 Apache Tika is a toolkit for detecting and extracting metadata
 and structured text content from various documents
 using existing parser libraries.

 TikaAnnotator uses Tika to generate annotations representing
 the original markup of a document, extract its text and metadata.
 It consists of three resources (see /desc):

 - FileSystemCollectionReader : similar to the one in UIMA examples but uses
   TIKA to extract the text from binary documents and generates annotations
   to represent the markup
 - MarkupAnnotator : takes the original content from a view and generates
   a new view containing the extracted text with markup annotations
 - TikaWrapper : utility class which allows to populate a CAS
   from a binary document; used by the FileSystemCollectionReader

 VERSION

 This version wraps Tika 0.4.  In that version of Tika, the packaging
 for Tika was split into several parts.

 The tika-core jar contains only the core client-visible classes and
 interfaces and has zero dependencies beyond Java 5. All the actual
 parser implementations and external parser dependencies are in the
 tika-parsers jar.

 See http://lucene.apache.org/tika/gettingstarted.html for the full
 details.

 COMPILATION

 You can use the ANT script to compile the sources.
 Note that you need to add the Tika-jars in the /lib directory;
 it is recommended to use the Tika-*-standalone.jar
 which contains all the libraries used internally by Tika.

 For more information on UIMA, see:
   http://incubator.apache.org/uima

 For more information on Tika, see:
   http://incubator.apache.org/tika/
	Apache UIMA TikaAnnotator README file

	INTRODUCTION

	Apache Tika is a toolkit for detecting and extracting metadata
	and structured text content from various documents
	using existing parser libraries.

	TikaAnnotator uses Tika to generate annotations representing
	the original markup of a document, extract its text and metadata.
	It consists of three resources (see /desc):

	- FileSystemCollectionReader : similar to the one in UIMA examples but uses
	TIKA to extract the text from binary documents and generates annotations
	to represent the markup
	- MarkupAnnotator : takes the original content from a view and generates
	a new view containing the extracted text with markup annotations
	- TikaWrapper : utility class which allows to populate a CAS
	from a binary document; used by the FileSystemCollectionReader

	VERSION

	This version wraps Tika 0.4. In that version of Tika, the packaging
	for Tika was split into several parts.

	The tika-core jar contains only the core client-visible classes and
	interfaces and has zero dependencies beyond Java 5. All the actual
	parser implementations and external parser dependencies are in the
	tika-parsers jar.

	See http://lucene.apache.org/tika/gettingstarted.html for the full
	details.

	COMPILATION

	You can use the ANT script to compile the sources.
	Note that you need to add the Tika-jars in the /lib directory;
	it is recommended to use the Tika-*-standalone.jar
	which contains all the libraries used internally by Tika.

	For more information on UIMA, see:
	http://incubator.apache.org/uima

	For more information on Tika, see:
	http://incubator.apache.org/tika/