blob: d692f98517c78d5e3337ac5f3e3c61bfe236fdf3 [file] [log] [blame]
--------------------------------
Getting Started with Apache Tika
--------------------------------
~~ Licensed to the Apache Software Foundation (ASF) under one or more
~~ contributor license agreements. See the NOTICE file distributed with
~~ this work for additional information regarding copyright ownership.
~~ The ASF licenses this file to You under the Apache License, Version 2.0
~~ (the "License"); you may not use this file except in compliance with
~~ the License. You may obtain a copy of the License at
~~
~~ http://www.apache.org/licenses/LICENSE-2.0
~~
~~ Unless required by applicable law or agreed to in writing, software
~~ distributed under the License is distributed on an "AS IS" BASIS,
~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~~ See the License for the specific language governing permissions and
~~ limitations under the License.
Getting Started with Apache Tika
This document describes how to build Apache Tika from sources and
how to start using Tika in an application.
Getting and building the sources
To build Tika from sources you first need to either
{{{../download.html}download}} a source release or
{{{../source-repository.html}checkout}} the latest sources from
version control.
Once you have the sources, you can build them using the
{{{http://maven.apache.org/}Maven 2}} build system. Executing the
following command in the base directory will build the sources
and install the resulting artifacts in your local Maven repository.
---
mvn install
---
See the Maven documentation for more information about the available
build options.
Note that you need Java 5 or higher to build Tika.
Build artifacts
The Tika 0.8 build consists of a number of components and produces
the following main binaries:
[tika-core/target/tika-core-0.8.jar]
Tika core library. Contains the core interfaces and classes of Tika,
but none of the parser implementations. Depends only on Java 5.
[tika-parsers/target/tika-parsers-0.8.jar]
Tika parsers. Collection of classes that implement the Tika Parser
interface based on various external parser libraries.
[tika-app/target/tika-app-0.8.jar]
Tika application. Combines the above libraries and all the external
parser libraries into a single runnable jar with a GUI and a command
line interface.
[tika-bundle/target/tika-bundle-0.8.jar]
Tika bundle. An OSGi bundle that includes everything you need to use all
Tika functionality in an OSGi environment.
Using Tika as a Maven dependency
The core library, tika-core, contains the key interfaces and classes of Tika
and can be used by itself if you don't need the full set of parsers from
the tika-parsers component. The tika-core dependency looks like this:
---
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-core</artifactId>
<version>0.8</version>
</dependency>
---
If you want to use Tika to parse documents (instead of simply detecting
document types, etc.), you'll want to depend on tika-parsers instead:
---
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers</artifactId>
<version>0.8</version>
</dependency>
---
Note that adding this dependency will introduce a number of
transitive dependencies to your project, including one on tika-core.
You need to make sure that these dependencies won't conflict with your
existing project dependencies. The listing below shows all the
compile-scope dependencies of tika-parsers in the Tika 0.8 release.
---
org.apache.tika:tika-parsers:bundle:0.8
+- org.apache.tika:tika-core:jar:0.8:compile
+- org.apache.commons:commons-compress:jar:1.0:compile
+- org.apache.pdfbox:pdfbox:jar:0.8.0-incubating:compile
| +- org.apache.pdfbox:fontbox:jar:0.8.0-incubator:compile
| \- org.apache.pdfbox:jempbox:jar:0.8.0-incubator:compile
+- org.apache.poi:poi:jar:3.6:compile
+- org.apache.poi:poi-scratchpad:jar:3.6:compile
+- org.apache.poi:poi-ooxml:jar:3.6:compile
| +- org.apache.poi:poi-ooxml-schemas:jar:3.6:compile
| | \- org.apache.xmlbeans:xmlbeans:jar:2.3.0:compile
| \- dom4j:dom4j:jar:1.6.1:compile
| \- xml-apis:xml-apis:jar:1.0.b2:compile
+- org.apache.geronimo.specs:geronimo-stax-api_1.0_spec:jar:1.0.1:compile
+- commons-logging:commons-logging:jar:1.1.1:compile
+- org.ccil.cowan.tagsoup:tagsoup:jar:1.2:compile
+- asm:asm:jar:3.1:compile
+- log4j:log4j:jar:1.2.14:compile
\- com.drewnoakes:metadata-extractor:jar:2.4.0-beta-1:compile
---
Using Tika in an Ant project
Unless you use a dependency manager tool like
{{{http://ant.apache.org/ivy/}Apache Ivy}}, to use Tika in you application
you can include the Tika jar files and the dependencies individually.
---
<classpath>
... <!-- your other classpath entries -->
<pathelement location="path/to/tika-core-0.8.jar"/>
<pathelement location="path/to/tika-parsers-0.8.jar"/>
<pathelement location="path/to/commons-logging-1.1.1.jar"/>
<pathelement location="path/to/commons-compress-1.0.jar"/>
<pathelement location="path/to/pdfbox-0.8.0-incubating.jar"/>
<pathelement location="path/to/fontbox-0.8.0-incubator.jar"/>
<pathelement location="path/to/jempbox-0.8.0-incubator.jar"/>
<pathelement location="path/to/poi-3.6.jar"/>
<pathelement location="path/to/poi-scratchpad-3.6.jar"/>
<pathelement location="path/to/poi-ooxml-3.6.jar"/>
<pathelement location="path/to/poi-ooxml-schemas-3.6.jar"/>
<pathelement location="path/to/xmlbeans-2.3.0.jar"/>
<pathelement location="path/to/dom4j-1.6.1.jar"/>
<pathelement location="path/to/xml-apis-1.0.b2.jar"/>
<pathelement location="path/to/geronimo-stax-api_1.0_spec-1.0.jar"/>
<pathelement location="path/to/tagsoup-1.2.jar"/>
<pathelement location="path/to/asm-3.1.jar"/>
<pathelement location="path/to/log4j-1.2.14.jar"/>
<pathelement location="path/to/metadata-extractor-2.4.0-beta-1.jar"/>
</classpath>
---
An easy way to gather all these libraries is to run
"mvn dependency:copy-dependencies" in the tika-parsers source directory.
This will copy all Tika dependencies to the <<<target/dependencies>>>
directory.
Alternatively you can simply drop the entire tika-app jar to your
classpath to get all of the above dependencies in a single archive.
Using Tika as a command line utility
The Tika application jar (tika-app-0.8.jar) can be used as a command
line utility for extracting text content and metadata from all sorts of
files. This runnable jar contains all the dependencies it needs, so
you don't need to worry about classpath settings to run it.
The usage instructions are shown below.
---
usage: java -jar tika-app-0.8.jar [option] [file]
Options:
-? or --help Print this usage message
-v or --verbose Print debug level messages
-g or --gui Start the Apache Tika GUI
-x or --xml Output XHTML content (default)
-h or --html Output HTML content
-t or --text Output plain text content
-m or --metadata Output only metadata
Description:
Apache Tika will parse the file(s) specified on the
command line and output the extracted text content
or metadata to standard output.
Instead of a file name you can also specify the URL
of a document to be parsed.
If no file name or URL is specified (or the special
name "-" is used), then the standard input stream
is parsed.
Use the "--gui" (or "-g") option to start
the Apache Tika GUI. You can drag and drop files
from a normal file explorer to the GUI window to
extract text content and metadata from the files.
---
You can also use the jar as a component in a Unix pipeline or
as an external tool in many scripting languages.
---
# Check if an Internet resource contains a specific keyword
curl http://.../document.doc \
| java -jar tika-app-0.8.jar --text \
| grep -q keyword
---