commit	8b56ffd95784c6f6e08ab8641716bfebe32c457e	[log] [tgz]
author	Mike Walch <mwalch@gmail.com>	Tue Sep 29 10:21:12 2015 -0400
committer	Mike Walch <mwalch@gmail.com>	Tue Sep 29 10:21:12 2015 -0400
tree	8957d6a5d7efbc406310bdb4836d8c40b9934295
parent	dd28a78f4967bb5706cfc2e2b4b8a2cee15dd351 [diff]
parent	4ebde3493049d49cc2a303e2a493dc7a00280892 [diff]

tree: 8957d6a5d7efbc406310bdb4836d8c40b9934295

README.md

WebIndex

An example Fluo applications that creates a web index using CommonCrawl data.

Requirements

In order run this application you need the following installed and running on your machine:

Hadoop (HDFS & YARN)
Accumulo
Fluo

Consider using fluo-dev to run these requirments

Configure your environment

First, you must create data.yml and dropwizard.yml files and edit them for your environment:

cd conf
cp data.yml.example data.yml
cp dropwizard.yml.example dropwizard.yml

Download CommonCrawl data

Next, run the following command to download CommonCrawl data files. The data files can have a fileType of wat, wet, or warc. The command downloads a file containing the URL path of thousands of files and numFiles specifies how many of those files will be downloaded from AWS and loaded into your HDFS instance.

 # Command structure
./bin/download.sh <fileType> <numFiles>
 # Use command below for this example
./bin/download.sh wat 1

Initialize Fluo & Accumulo with data

Next, run the following command to run Fluo and initialize it and Accumulo with data:

./bin/init.sh

Run the web application

Finally, run the following command to run the web app:

./bin/webapp.sh

Open your browser to http://localhost:8080/