commit | c482e7759e7dd08bb3c64b8ed50cdaca9d0abaeb | [log] [tgz] |
---|---|---|
author | Keith Turner <kturner@apache.org> | Fri Nov 13 23:01:09 2015 -0500 |
committer | Keith Turner <kturner@apache.org> | Tue Nov 17 18:01:28 2015 -0500 |
tree | d251ba40f0de67e71a25525e6d40dea5076b8889 | |
parent | a7ab6d13258b5d4312ba41de680d53e84bfceada [diff] |
#26 log error and continue when error getting domain
An example Fluo applications that indexes links to web pages in multiple ways. The example uses CommonCrawl data as input. See the tables and code documentation for more information about how this example works.
In order run this application you need the following installed and running on your machine:
Consider using fluo-dev to run these requirements
First, you must create the configuration file data.yml
in the conf/
directory and edit it for your environment:
cp conf/data.yml.example conf/data.yml
There are a few environment variables that need to be set to run these scripts (see conf/webindex-env.sh.example
for a list). If you don't want to set them in your ~/.bashrc
, create webindex-env.sh
in conf/
and set them.
cp conf/webindex-env.sh.example conf/webindex-env.sh
CommonCrawl data sets are hosted in S3. The following command will run a Spark job that copies data files from AWS into HDFS init/
and load/
directoires. This script is configured by conf/data.yml
where you can specify the number of files to copy into each directory:
./bin/webindex copy
After you have copied data into HDFS, run the following command to run a Spark job that will initialize Fluo and Accumulo with data in your HDFS init/
directory.
./bin/webindex init
The init
command can only be run on an empty cluster. To add more data, run the load
which starts a Spark job that loads Fluo with data in your HDFS load/
directory. Fluo will incrementally process this data and update Accumulo.
./bin/webindex copy
Run the following command to run the webindex UI:
./bin/webindex ui
The UI is implemented useing dropwizard. While the default dropwizard configuration works well, you can modify it by creating and editing dropwizard.yml
in conf/
:
cp conf/dropwizard.yml.example conf/dropwizard.yml
Open your browser to http://localhost:8080/