⚠️ PRE-RELEASE STATUS: Apache Tika gRPC Server is currently in development and has not been officially released yet. It will first be available in Tika 4.0.0. Until then, Docker images must be built from source code (see “Building from Development Branches” below).
This repo is used to create convenience Docker images for Apache Tika Grpc Server published as apache/tika-grpc on DockerHub by the Apache Tika Dev team.
Once Tika 4.0.0 is released, the images will create a functional Apache Tika Grpc Server instance that contains the latest Ubuntu running the appropriate version's server on Port 50052 using Java 17 LTS.
There is a minimal version, which contains only Apache Tika and it's core dependencies, and a full version, which also includes dependencies for the GDAL and Tesseract OCR parsers. To balance showing functionality versus the size of the full image, this file currently installs the language packs for the following languages:
To install more languages simply update the apt-get command to include the package containing the language you required, or include your own custom packs using an ADD command.
Below are the most recent 2.x series tags:
latest, 4.0.0: Apache Tika Server 4.0.0 (Minimal)latest-full, 4.0.0-full: Apache Tika Server 4.0.0 (Full)You can see a full set of tags for historical versions here.
You can pull down the version you would like using:
docker pull apache/tika-grpc:<tag>
Then to run the container, execute the following command:
docker run -d -p 127.0.0.1:50052:50052 apache/tika-grpc:<tag>
Where is the DockerHub tag corresponding to the Apache Tika Server version - e.g. 4.0.0, 4.0.0-full.
NOTE: The latest and latest-full tags are explicitly set to the latest released version when they are published.
NOTE: In the example above, we recommend binding the server to localhost because Docker alters iptables and may expose your tika-server to the internet. If you are confident that your tika-server is on an isolated network you can simply run:
docker run -d -p 50052:50052 apache/tika-grpc:<tag>
From version 4.0.0, 4.0.0-full of the image it is now easier to override the defaults and pass parameters to the running instance.
So for example if you wish to disable the OCR parser in the full image you could write a custom configuration:
cat <<EOT >> tika-config.xml
<?xml version="1.0" encoding="UTF-8"?>
<properties>
<parsers>
<parser class="org.apache.tika.parser.DefaultParser">
<parser-exclude class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
</parser>
</parsers>
</properties>
EOT
Then by mounting this custom configuration as a volume, you could pass the command line parameter to load it
docker run -d -p 127.0.0.1:50052:50052 -v `pwd`/tika-config.xml:/tika-config.xml apache/tika-grpc:4.0.0-full -c /tika-config.xml
You can see more configuration examples here.
You may want to do this to add optional components, such as the tika-eval metadata filter, or optional dependencies such as jai-imageio-jpeg2000 (check license compatibility first!).
There are a number of sample Docker Compose files included in the repos to allow you to test some different scenarios.
These files use docker-compose 3.x series and include:
The Docker Compose files and configurations (sourced from sample-configs directory) all have comments in them so you can try different options, or use them as a base to create your own custom configuration.
N.B. You will want to create a environment variable (used in some bash scripts) matching the version of tika-docker you want to work with in the docker compositions e.g. export TAG=4.0.0. Similarly you should also consult .env which is used in the docker-compose .yml files.
You can install docker-compose from here.
Since tika-grpc has not been officially released yet, you must build from source code using the build-from-branch.sh script:
# Build from main branch (recommended for latest development) ./build-from-branch.sh -b main # Build from a specific feature branch ./build-from-branch.sh -b TIKA-4578 # Build from your local tika directory (for rapid development) ./build-from-branch.sh -l /home/user/tika -t my-local-build
This will:
apache/tika-grpc:<branch-name>Running your built image:
docker run -d -p 127.0.0.1:50052:50052 apache/tika-grpc:main
See the “Building from Development Branches” section below for complete documentation and options.
Once Tika 4.0.0 is officially released, you'll be able to build Docker images from GPG-signed Apache release artifacts using docker-tool.sh:
# Build from signed release (future - requires Tika 4.0.0+) ./docker-tool.sh build 4.0.0 4.0.0 ./docker-tool.sh test 4.0.0 ./docker-tool.sh publish 4.0.0 4.0.0
This will:
tika-grpc-4.0.0.jar from Apache distribution mirrors.asc file)Manual build from release (future):
docker build -t apache/tika-grpc:4.0.0 --build-arg TIKA_VERSION=4.0.0 - < minimal/Dockerfile docker build -t apache/tika-grpc:4.0.0-full --build-arg TIKA_VERSION=4.0.0 - < full/Dockerfile
Note: The
minimal/andfull/Dockerfiles are prepared for future releases and will NOT work until tika-grpc-4.0.0.jar is published to Apache distribution mirrors.
For more infomation on Apache Tika Grpc Server, go to the Apache Tika Grpc Server documentation.
For more information on Apache Tika, go to the official Apache Tika project website.
To meet up with others using Apache Tika, consider coming to one of the Apache Tika Virtual Meetups.
For more information on the Apache Software Foundation, go to the Apache Software Foundation website.
For a full list of changes as of 4.0.0, visit CHANGES.md.
For our current release process, visit tika-docker Release Process
Apache Tika Dev Team (dev@tika.apache.org)
For testing unreleased features or development branches, you can build Docker images directly from source:
# Build from main branch ./build-from-branch.sh -b main # Build from a specific feature branch ./build-from-branch.sh -b TIKA-4578
# Build from your local tika checkout (for rapid development) ./build-from-branch.sh -l /home/user/tika -t my-local-build
./build-from-branch.sh [OPTIONS] Options: -b BRANCH Git branch or tag to build from (default: main) -r REPO Git repository URL (default: https://github.com/apache/tika.git) -l LOCAL_DIR Build from local tika directory instead of cloning -t TAG Docker image tag (default: branch-name or 'local') -p Push to Docker registry after building -h Display this help message
Build from main branch:
./build-from-branch.sh -b main
Build from your local tika repository:
./build-from-branch.sh -l /home/user/source/tika -t my-test
Build from a fork and push to registry:
./build-from-branch.sh \ -r https://github.com/yourusername/tika.git \ -b my-feature \ -t myregistry/tika-grpc:my-feature \ -p
Note: Development builds compile from source and do NOT use GPG-signed releases. They are intended for development and testing only, not production use.
The Apache Tika gRPC Docker project is designed to produce reproducible builds, ensuring transparency and security in the software supply chain.
Reproducible builds are a set of software development practices that create a verifiable path from source code to binary. When a build is reproducible, anyone can verify that the resulting Docker image was built from the exact source code claimed, without any tampering.
For Official Releases (Post Tika 4.0.0):
GPG Signature Verification
Multi-Stage Builds
Declarative Configuration
For Development Builds:
Git-based Source Control
Version Pinning
Build Transparency
To verify that an image was built from a specific source:
For release builds:
# The build process logs the GPG verification: docker build --build-arg TIKA_VERSION=4.0.0 -f full/Dockerfile . # Look for output like: # gpg: Signature made ... # gpg: Good signature from "Tim Allison (ASF signing key) <tallison@apache.org>"
For development builds:
# The build logs the exact Git commit: ./build-from-branch.sh -b TIKA-4578 # Look for output showing the Git commit SHA and message
For more information on reproducible builds, visit reproducible-builds.org.
There have been a range of contributors on GitHub and via suggestions, including:
Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Official release images are built using GPG-signed Apache release artifacts. The Dockerfiles in this repository:
tika-grpc-${VERSION}.jar from Apache distribution mirrors.asc GPG signature fileThis ensures that the Docker images contain only verified, officially released Apache Tika artifacts.
The build-from-branch.sh script allows building Docker images from source code for testing purposes. These builds:
For production use, always build from official Apache releases using the standard Dockerfiles and docker-tool.sh.
It is worth noting that whilst these Docker images download the binary JARs published by the Apache Tika Team on the Apache Software Foundation distribution sites, only the source release of an Apache Software Foundation project is an official release artefact. See Release Distribution Policy for more details.