tree: f7a2e381ee9700149188f585781905fd2d6bc12f [path history] [tgz]
  1. config/
  2. Dockerfile
  3. README.md
docker/hbase/README.md

Nutch Dockerfile

This directory contains a Dockerfile of Nutch 2.X for Docker.

What is Nutch?

Nutch logo

Apache Nutch is a highly extensible and scalable open source web crawler software project.

Nutch can run on a single machine, but gains a lot of its strength from running in a Hadoop cluster

Docker Image

Current configuration of this image consists of components:

  • Apache Hadoop 2.5.1
  • Apache HBase 0.98.8-hadoop2
  • Apache Nutch 2.X HEAD (this will ensure that you are always running off of bleeding edge)

Base Image

Installation

  1. Install Docker.

2a. Download automated build from public hub registry docker pull nutch/nutch_with_hbase_hadoop

2b. Build from files in this directory:

$(boot2docker shellinit)
docker build -t <new name for image> .

Usage

Start docker

boot2docker up
$(boot2docker shellinit)

Start an image and enter shell. First command will start image and will print on stdout standard logs.

IMAGE_PID=$(docker run -i -t  nutch_with_hbase_hadoop)
docker exec -i -t $IMAGE_PID bash

Nutch is located in /opt/nutch/ and is almost ready to run. Review configuration in /opt/nutch/conf/ and you can start crawling.

echo 'http://nutch.apache.org' > seed.txt
/opt/nutch/bin/nutch inject seed.txt
/opt/nutch/bin/nutch generate -topN 10 -- this will return batchId
/opt/nutch/bin/nutch fetch <batchId>
/opt/nutch/bin/nutch parse <batchId>
/opt/nutch/bin/nutch updatedb <batchId>
[...]

Resources

For more information on Nutch 2.X please see the tutorials and Nutch 2.X wiki space.