This project contains 3 Docker containers running Apache Nutch 2.x configured with Apache Cassandra storage.
This is project is fully operational but its still experimental, any feedback, suggestions should be directed to and contribution(s) will be highly appreciated!
# Build the images ( this will build the application ) ./bin/ # Start all containers with data folders from scripts ./bin/ # stop all containers ./bin/ # restart containers ./bin/
# Run the crawler, You can use docker exec command, or you can docker attach to the container and run the commands there, or use docker-enter if you are using Mac OS docker exec NUTCH01 /opt/nutch/bin/crawl /opt/nutch/testUrls test_crawl 3 # OR docker-enter NUTCH01 root@9ec43c388769:/# cd opt/nutch root@9ec43c388769:/opt/nutch# ./bin/crawl Usage: crawl <seedDir> <crawlID> [<solrUrl>] <numberOfRounds> root@9ec43c388769:/opt/nutch# ./bin/crawl testUrls test_crawl 3
Nutch 2.x Container name : NUTCH01
Cassandra Container name : CASS01
Cassandra installed with OpsCenter
##MAC OSx notes
mkdir ~/docker-data mkdir ~/docker-data/cassandra mkdir ~/docker-data/nutch chmod -R 777 ~/docker-data/ VBoxManage sharedfolder add boot2docker-vm -name home -hostpath ~/ boot2docker up boot2docker ssh #mkdir /data #mount -t vboxsf -o uid=1000,gid=50 data /data #vi /etc/fstab #data /data vboxsf rw,nodev,relatime 0 0 #docker-enter