Nutch Dockerfile

Docker Pulls Docker Image Size (latest by date) Docker Image Version (latest semver) Docker Stars Docker Automated build

Get up and running quickly with Nutch on Docker.

What is Nutch?

Nutch logo

Apache Nutch is a highly extensible and scalable open source web crawler software project.

Nutch can run on a single machine, but gains a lot of its strength from running in a Hadoop cluster

Docker Image

Current configuration of this image consists of components:

  • Nutch 1.x (branch “master”)

Base Image

Tips

You may need to alias docker to “docker --tls” if you see errors such as:

2015/04/07 09:19:56 Post http://192.168.59.103:2376/v1.14/containers/create?name=NutchContainer: malformed HTTP response "\x15\x03\x01\x00\x02\x02\x16"

The easiest way to do this:

  1. alias docker="docker --tls"

Installation

  1. Install Docker.

  2. Build from files in this directory:

There are three build modes which can be activated using the --build-arg BUILD_MODE=0 flag. All values used here are defaults.

  • 0 == Nutch master branch source install with crawl and nutch scripts on $PATH
  • 1 == Same as mode 0 with addition of Nutch REST Server; additional build args --build-arg SERVER_PORT=8081 and --build-arg SERVER_HOST=0.0.0.0
  • 2 == Same as mode 1 with addition of Nutch WebApp; additional build args --build-arg WEBAPP_PORT=8080

For example, if you wanted to install Nutch master branch and run both the Nutch REST server and webapp then run the following

$(boot2docker shellinit | grep export) #may not be necessary
docker build -t apache/nutch . --build-arg BUILD_MODE=2 --build-arg SERVER_PORT=8081 --build-arg SERVER_HOST=0.0.0.0 --build-arg WEBAPP_PORT=8080

Usage

If not already running, start docker

boot2docker up
$(boot2docker shellinit | grep export)

Run a container

docker run -t -i -d -p 8080:8080 -p 8081:8081 --name nutchcontainer apache/nutch
c5401810e50a606f43256b4b24602443508bd9badcf2b7493bd97839834571fc

docker logs c5401810e50a606f43256b4b24602443508bd9badcf2b7493bd97839834571fc
2021-06-29 19:14:32,922 CRIT Supervisor is running as root.  Privileges were not dropped because no user is specified in the config file.  If you intend to run as root, you can set user=root in the config file to avoid this message.
2021-06-29 19:14:32,925 INFO supervisord started with pid 1
2021-06-29 19:14:33,929 INFO spawned: 'nutchserver' with pid 8
2021-06-29 19:14:33,932 INFO spawned: 'nutchwebapp' with pid 9
2021-06-29 19:14:36,012 INFO success: nutchserver entered RUNNING state, process has stayed up for > than 2 seconds (startsecs)
2021-06-29 19:14:36,012 INFO success: nutchwebapp entered RUNNING state, process has stayed up for > than 2 seconds (startsecs)

You can now access the webapp at http://localhost:8080 and you can interact with the REST API e.g.

curl http://localhost:8080/admin
{"startDate":1625118207995,"configuration":["default"],"jobs":[],"runningJobs":[]}

Attach to the container

docker exec -it c5401810e50a606f43256b4b24602443508bd9badcf2b7493bd97839834571fc /bin/bash

View supervisord logs

cat /tmp/supervisord.log
2021-06-29 19:14:32,922 CRIT Supervisor is running as root.  Privileges were not dropped because no user is specified in the config file.  If you intend to run as root, you can set user=root in the config file to avoid this message.
2021-06-29 19:14:32,925 INFO supervisord started with pid 1
2021-06-29 19:14:33,929 INFO spawned: 'nutchserver' with pid 8
2021-06-29 19:14:33,932 INFO spawned: 'nutchwebapp' with pid 9
2021-06-29 19:14:36,012 INFO success: nutchserver entered RUNNING state, process has stayed up for > than 2 seconds (startsecs)
2021-06-29 19:14:36,012 INFO success: nutchwebapp entered RUNNING state, process has stayed up for > than 2 seconds (startsecs)

View supervisord subprocess logs

ls /var/log/supervisord/
nutchserver_stderr.log  nutchserver_stdout.log  nutchwebapp_stderr.log  nutchwebapp_stdout.log

Nutch is located in $NUTCH_HOME and is almost ready to run. You will need to set seed URLs and update the http.agent.name configuration property in $NUTCH_HOME/conf/nutch-site.xml with your crawler's Agent Name. For additional “getting started” information checkout the Nutch Tutorial.