Nutch Dockerfile

Get up and running quickly with Nutch on Docker.

What is Nutch?

Nutch logo

Apache Nutch is a highly extensible and scalable open source web crawler software project.

Nutch can run on a single machine, but gains a lot of its strength from running in a Hadoop cluster

Docker Image

Current configuration of this image consists of components:

Nutch 1.x (branch “master”)

Base Image

alpine:3.19

Tips

You may need to alias docker to “docker --tls” if you see errors such as:

2015/04/07 09:19:56 Post http://192.168.59.103:2376/v1.14/containers/create?name=NutchContainer: malformed HTTP response "\x15\x03\x01\x00\x02\x02\x16"

The easiest way to do this:

alias docker="docker --tls"

Installation

Install Docker.
Build from files in this directory:

$(boot2docker shellinit | grep export) #may not be necessary
docker build -t apache/nutch .

Security and plugin directories

Nutch loads executable code from the directories configured as plugin.folders (see nutch-default.xml). For production and shared images, treat those paths as trusted: mount them read-only where possible, rebuild images to change plugins, and run the crawl process under a dedicated low-privilege user so the filesystem cannot be abused to drop unexpected JARs or plugin.xml files into that tree.

User-defined JEXL in configuration (for example index.jexl.filter, generator expressions, and hostdb.filter.expression) is evaluated in a sandboxed engine by default. The property nutch.jexl.disable.sandbox disables that protection and must not be set in untrusted environments.

Usage

If not already running, start docker

boot2docker up
$(boot2docker shellinit | grep export)

Run a container interactively (nutch and crawl are on PATH; default command is bash):

docker run -t -i --name nutchcontainer apache/nutch

In another terminal, attach to a running container if needed:

docker exec -it nutchcontainer /bin/bash

Nutch is located in $NUTCH_HOME and is almost ready to run. You will need to set seed URLs and update the http.agent.name configuration property in $NUTCH_HOME/conf/nutch-site.xml with your crawler's Agent Name. For additional “getting started” information checkout the Nutch Tutorial.