commit	69ec547a308d6b764b89bbd2b1714a9697777261	[log] [tgz]
author	Corey J. Nolet <cjnolet@gmail.com>	Mon Sep 29 14:01:24 2014 -0400
committer	Corey J. Nolet <cjnolet@gmail.com>	Mon Sep 29 14:01:24 2014 -0400
tree	f02b916b9bc6f55f77838e0521bb377c451a06cd
parent	f69da4a7e4e54f989d4c6ebe007f4775a07d3ceb [diff]

tree: f02b916b9bc6f55f77838e0521bb377c451a06cd

README.md

Fluo

A Percolator prototype for Accumulo. This prototype relies on Accumulo 1.6.0 which has ACCUMULO-1000 and ACCUMULO-112. ACCUMULO-1000 makes cross row transactions possible and ACCUMULO-112 makes it possible to efficiently find notifications. Theoretically, this prototype is to a point where it could run in a distributed manner. But this has not been tested. The pieces are in place, CAS is done on the tablet server and the Oracle is a service.

There is a lot that needs to be done. If you are interested in contributing send me an email or check out the issues.

If you want to experiment with Fluo, check out the phrasecount example. This example is a simple end-to-end application that is super easy to run. It also has handling for real world problems like high cardinality phrases.

Building Fluo

Using Maven, you can build Fluo with the following steps.

git clone https://github.com/fluo-io/fluo.git
cd fluo
mvn package

Running mvn package will build a tar ball that contains all of the dependencies needed at runtime. Currently the tar ball is targeted towards Hadoop 2.3.0 and Accumulo 1.6.0. If you have different versions installed on your cluster, you can try passing the following options to maven package. Other versions of Hadoop and Accumulo may not work, please open a bug if you run into problems.

mvn package -Daccumulo.version=1.6.1-SNAPSHOT -Dhadoop.version=2.4.0

Once the tarball is built, deploy it using the command below which can be modified to a directory of your choice (rather than /opt).

tar -C /opt -xvzf modules/distribution/target/fluo-1.0.0-alpha-1-SNAPSHOT-bin.tar.gz

Configuring Fluo

To configure and run Fluo, you will need a running Accumulo instance as well as an observer jar to run with Fluo. If you have not created your own observer jar, you can build an observer jar using the phrasecount example.

First, copy the example configuration files and modify them for you environment.

cd fluo-1.0.0-alpha-1-SNAPSHOT/conf
cp examples/* .
vim fluo-env.sh
vim fluo.properties

Copy your observer jar to Fluo and set up notifications to your observer in fluo.properties. Check out phrasecount to build an example observer jar and find instructions for configuring fluo.properties.

OBSERVER_JAR=<location of observer jar>
cd fluo-1.0.0-alpha-1-SNAPSHOT/
cp $OBSERVER_JAR lib/observers
vim conf/fluo.properties

Finally, initialize your instance which only needs to be called once and stores configuration in zookeeper.

./bin/initialize.sh

Running Fluo

A Fluo instance consists of 1 oracle process and multiple worker processes. These processes can either be run on a YARN cluster or started locally on each machine.

The preferred method to run Fluo applications is using YARN which will start up multiple workers as configured in fluo.properties. To start a Fluo cluster in YARN, run following commands:

./bin/oracle.sh start-yarn
./bin/worker.sh start-yarn

The start-yarn commands above submit your Fluo applications to YARN. Therefore, they work for a single-node or a large cluster. By using YARN, you no longer need to deploy the Fluo binaries to every node on your cluster or start processes on every node.

You can use yarn application -list to check the status of the applications. Logs are viewable within YARN. When finished, you can kill the applications using yarn application -kill <Application ID>. The application ID can be found using the list command.

If you do not have YARN set up, you can start a local Fluo process using the following commands:

./bin/oracle.sh start-local
./bin/worker.sh start-local

In a distributed environment, you will need to deploy the Fluo binary to every node and start each process individually.

To stop Fluo processes, run the following commands:

./bin/worker.sh stop-local
./bin/oracle.sh stop-local

Tuning Accumulo

Fluo will reread the same data frequently when it checks conditions on mutations. When Fluo initializes a table it enables data caching to make this more efficient. However you may need to increase the amount of memory available for caching in the tserver by increasing tserver.cache.data.size. Increasing this may require increasing the maximum tserver java heap size in accumulo-env.sh.

Fluo will run many client threads, will want to ensure the tablet server has enough threads. Should probably increase the tserver.server.threads.minimum Accumulo setting.

Using at least Accumulo 1.6.1 is recommended because multiple performance bugs were fixed.

Additional Documentation

Stress Testing