Get up and running quickly with Nutch on Docker.

Apache Nutch is a highly extensible and scalable open source web crawler software project.
Nutch can run on a single machine, but gains a lot of its strength from running in a Hadoop cluster
Current configuration of this image consists of components:
You may need to alias docker to “docker --tls” if you see errors such as:
2015/04/07 09:19:56 Post http://192.168.59.103:2376/v1.14/containers/create?name=NutchContainer: malformed HTTP response "\x15\x03\x01\x00\x02\x02\x16"
The easiest way to do this:
alias docker="docker --tls"Install Docker.
Build from files in this directory:
$(boot2docker shellinit | grep export) #may not be necessary docker build -t apache/nutch .
Nutch loads executable code from the directories configured as plugin.folders (see nutch-default.xml). For production and shared images, treat those paths as trusted: mount them read-only where possible, rebuild images to change plugins, and run the crawl process under a dedicated low-privilege user so the filesystem cannot be abused to drop unexpected JARs or plugin.xml files into that tree.
User-defined JEXL in configuration (for example index.jexl.filter, generator expressions, and hostdb.filter.expression) is evaluated in a sandboxed engine by default. The property nutch.jexl.disable.sandbox disables that protection and must not be set in untrusted environments.
If not already running, start docker
boot2docker up $(boot2docker shellinit | grep export)
Run a container interactively (nutch and crawl are on PATH; default command is bash):
docker run -t -i --name nutchcontainer apache/nutch
In another terminal, attach to a running container if needed:
docker exec -it nutchcontainer /bin/bash
Nutch is located in $NUTCH_HOME and is almost ready to run. You will need to set seed URLs and update the http.agent.name configuration property in $NUTCH_HOME/conf/nutch-site.xml with your crawler's Agent Name. For additional “getting started” information checkout the Nutch Tutorial.