Apache Nutch is a highly extensible and scalable open source web crawler software project.
Nutch can run on a single machine, but gains a lot of its strength from running in a Hadoop cluster
Current configuration of this image consists of components:
You may need to alias docker to “docker --tls” if you see errors such as:
2015/04/07 09:19:56 Post http://192.168.59.103:2376/v1.14/containers/create?name=NutchContainer: malformed HTTP response "\x15\x03\x01\x00\x02\x02\x16"
The easiest way to do this:
alias docker="docker --tls"
Build from files in this directory:
$(boot2docker shellinit | grep export) docker build -t apache/nutch .
boot2docker up $(boot2docker shellinit | grep export)
Start up an image and attach to it
docker run -t -i -d --name nutchcontainer apache/nutch /bin/bash docker attach --sig-proxy=false nutchcontainer
Nutch is located in ~/nutch and is almost ready to run. You will need to set seed URLs and update the configuration with your crawler's Agent Name. For additional “getting started” information checkout the Nutch Tutorial.