| Apache UIMA TikaAnnotator README file |
| |
| INTRODUCTION |
| |
| Apache Tika is a toolkit for detecting and extracting metadata |
| and structured text content from various documents |
| using existing parser libraries. |
| |
| TikaAnnotator uses Tika to generate annotations representing |
| the original markup of a document, extract its text and metadata. |
| It consists of three resources (see /desc): |
| |
| - FileSystemCollectionReader : similar to the one in UIMA examples but uses |
| TIKA to extract the text from binary documents and generates annotations |
| to represent the markup |
| - MarkupAnnotator : takes the original content from a view and generates |
| a new view containing the extracted text with markup annotations |
| - TikaWrapper : utility class which allows to populate a CAS |
| from a binary document; used by the FileSystemCollectionReader |
| |
| VERSION |
| |
| This version wraps Tika 0.4. In that version of Tika, the packaging |
| for Tika was split into several parts. |
| |
| The tika-core jar contains only the core client-visible classes and |
| interfaces and has zero dependencies beyond Java 5. All the actual |
| parser implementations and external parser dependencies are in the |
| tika-parsers jar. |
| |
| See http://lucene.apache.org/tika/gettingstarted.html for the full |
| details. |
| |
| COMPILATION |
| |
| You can use the ANT script to compile the sources. |
| Note that you need to add the Tika-jars in the /lib directory; |
| it is recommended to use the Tika-*-standalone.jar |
| which contains all the libraries used internally by Tika. |
| |
| For more information on UIMA, see: |
| http://incubator.apache.org/uima |
| |
| For more information on Tika, see: |
| http://incubator.apache.org/tika/ |