tree: 8df63334f11194d1f8df1df71a2e37a753d33cab
  1. .dockerfilelintrc
  2. Dockerfile
  3. README.md
docker/README.md

Nutch Dockerfile

Docker Pulls Docker Image Size (latest by date) Docker Image Version (latest semver) Docker Stars Docker Automated build

Get up and running quickly with Nutch on Docker.

What is Nutch?

Nutch logo

Apache Nutch is a highly extensible and scalable open source web crawler software project.

Nutch can run on a single machine, but gains a lot of its strength from running in a Hadoop cluster

Docker Image

Current configuration of this image consists of components:

  • Nutch 1.x (branch “master”)

Base Image

Tips

You may need to alias docker to “docker --tls” if you see errors such as:

2015/04/07 09:19:56 Post http://192.168.59.103:2376/v1.14/containers/create?name=NutchContainer: malformed HTTP response "\x15\x03\x01\x00\x02\x02\x16"

The easiest way to do this:

  1. alias docker="docker --tls"

Installation

  1. Install Docker.

  2. Build from files in this directory:

$(boot2docker shellinit | grep export) #may not be necessary
docker build -t apache/nutch .

Security and plugin directories

Nutch loads executable code from the directories configured as plugin.folders (see nutch-default.xml). For production and shared images, treat those paths as trusted: mount them read-only where possible, rebuild images to change plugins, and run the crawl process under a dedicated low-privilege user so the filesystem cannot be abused to drop unexpected JARs or plugin.xml files into that tree.

User-defined JEXL in configuration (for example index.jexl.filter, generator expressions, and hostdb.filter.expression) is evaluated in a sandboxed engine by default. The property nutch.jexl.disable.sandbox disables that protection and must not be set in untrusted environments.

Usage

If not already running, start docker

boot2docker up
$(boot2docker shellinit | grep export)

Run a container interactively (nutch and crawl are on PATH; default command is bash):

docker run -t -i --name nutchcontainer apache/nutch

In another terminal, attach to a running container if needed:

docker exec -it nutchcontainer /bin/bash

Nutch is located in $NUTCH_HOME and is almost ready to run. You will need to set seed URLs and update the http.agent.name configuration property in $NUTCH_HOME/conf/nutch-site.xml with your crawler's Agent Name. For additional “getting started” information checkout the Nutch Tutorial.