tree: 347864f66c9b83a75e027fea069acb8285b742dc [path history] [tgz]
  1. bin/
  2. cassandra/
  3. nutch/

Apache Nutch 2.x with Cassandra on Docker

This project contains 3 Docker containers running Apache Nutch 2.x configured with Apache Cassandra storage.

This is project is fully operational but its still experimental, any feedback, suggestions should be directed to and contribution(s) will be highly appreciated!


  1. Build the images and start the containers " NOTE: for Mac OS running boot2docker, Please read the Notes section Below ".
# Build the images ( this will build the application )

# Start all containers with data folders from scripts

# stop all containers 

# restart containers 

  1. Start Crawling with Nutch 2.X.
# Run the crawler, You can use docker exec command, or you can docker attach to the container and run the commands there, or use docker-enter if you are using Mac OS

docker exec NUTCH01 /opt/nutch/bin/crawl /opt/nutch/testUrls test_crawl 3
# OR

docker-enter NUTCH01
root@9ec43c388769:/# cd opt/nutch
root@9ec43c388769:/opt/nutch# ./bin/crawl
Usage: crawl <seedDir> <crawlID> [<solrUrl>] <numberOfRounds>
root@9ec43c388769:/opt/nutch# ./bin/crawl testUrls test_crawl 3


Nutch 2.x Container name : NUTCH01

Cassandra Container name : CASS01

Cassandra installed with OpsCenter

##MAC OSx notes

  • you need to mount data folders to your VirtualMachine to be able to get persistent data every time you run this application.
  • You might need to install docker-enter for easier access to the containers
mkdir ~/docker-data
mkdir ~/docker-data/cassandra
mkdir ~/docker-data/nutch

chmod -R 777  ~/docker-data/

VBoxManage sharedfolder add boot2docker-vm -name home -hostpath ~/

boot2docker up
boot2docker ssh

#mkdir /data
#mount -t vboxsf -o uid=1000,gid=50 data /data
#vi /etc/fstab
#data            /data           vboxsf   rw,nodev,relatime    0 0