blob: 6dd7500652e95b2156aff9465dd09e53fbb3b783 [file] [log] [blame]
Apache UIMA TikaAnnotator README file
INTRODUCTION
Apache Tika is a toolkit for detecting and extracting metadata
and structured text content from various documents
using existing parser libraries.
TikaAnnotator uses Tika to generate annotations representing
the original markup of a document, extract its text and metadata.
It consists of three resources (see /desc):
- FileSystemCollectionReader : similar to the one in UIMA examples but uses
TIKA to extract the text from binary documents and generates annotations
to represent the markup
- MarkupAnnotator : takes the original content from a view and generates
a new view containing the extracted text with markup annotations
- TikaWrapper : utility class which allows to populate a CAS
from a binary document; used by the FileSystemCollectionReader
VERSION
This version wraps Tika 0.4. In that version of Tika, the packaging
for Tika was split into several parts.
The tika-core jar contains only the core client-visible classes and
interfaces and has zero dependencies beyond Java 5. All the actual
parser implementations and external parser dependencies are in the
tika-parsers jar.
See http://lucene.apache.org/tika/gettingstarted.html for the full
details.
COMPILATION
You can use the ANT script to compile the sources.
Note that you need to add the Tika-jars in the /lib directory;
it is recommended to use the Tika-*-standalone.jar
which contains all the libraries used internally by Tika.
For more information on UIMA, see:
http://incubator.apache.org/uima
For more information on Tika, see:
http://incubator.apache.org/tika/