Continuous Query and Ingest

This directory contains a suite of scripts for placing continuous query and ingest load on accumulo. The purpose of these script is two-fold. First, place continuous load on accumulo to see if breaks. Second, collect statistics in order to understand how accumulo behaves. To run these scripts copy all of the .example files and modify them. You can put these scripts in the current directory or define a CONTINUOUS_CONF_DIR where the files will be read from. These scripts rely on pssh. Before running any script you may need to use pssh to create the log directory on each machine (if you want it local). Also, create the table “ci” before running. You can run org.apache.accumulo.test.continuous.GenSplits to generate splits points for a continuous ingest table.

The following ingest scripts insert data into accumulo that will form a random graph.

$ start-ingest.sh
$ stop-ingest.sh

The following query scripts randomly walk the graph created by the ingesters. Each walker produce detailed statistics on query/scan times.

$ start-walkers.sh
$ stop-walker.sh

The following scripts start and stop batch walkers.

$ start-batchwalkers.sh
$ stop-batchwalkers.sh

And the following scripts start and stop scanners.

$ start-scanners.sh $ stop-scanners.sh

In addition to placing continuous load, the following scripts start and stop a service that continually collect statistics about accumulo and HDFS.

$ start-stats.sh
$ stop-stats.sh

Optionally, start the agitator to periodically kill the tabletserver and/or datanode process(es) on random nodes. You can run this script as root and it will properly start processes as the user you configured in continuous-env.sh (HDFS_USER for the Datanode and ACCUMULO_USER for Accumulo processes). If you run it as yourself and the HDFS_USER and ACCUMULO_USER values are the same as your user, the agitator will not change users. In the case where you run the agitator as a non-privileged user which isn't the same as HDFS_USER or ACCUMULO_USER, the agitator will attempt to sudo to these users, which relies on correct configuration of sudo. Also, be sure that your HDFS_USER has password-less ssh configured.

$ start-agitator.sh
$ stop-agitator.sh

Start all three of these services and let them run for a few hours. Then run report.pl to generate a simple HTML report containing plots and histograms showing what has transpired.

A MapReduce job to verify all data created by continuous ingest can be run with the following command. Before running the command modify the VERIFY_* variables in continuous-env.sh if needed. Do not run ingest while running this command, this will cause erroneous reporting of UNDEFINED nodes. The MapReduce job will scan a reference after it has scanned the definition.

$ run-verify.sh

Each entry, except for the first batch of entries, inserted by continuous ingest references a previously flushed entry. Since we are referencing flushed entries, they should always exist. The MapReduce job checks that all referenced entries exist. If it finds any that do not exist it will increment the UNDEFINED counter and emit the referenced but undefined node. The MapReduce job produces two other counts : REFERENCED and UNREFERENCED. It is expected that these two counts are non zero. REFERENCED counts nodes that are defined and referenced. UNREFERENCED counts nodes that defined and unreferenced, these are the latest nodes inserted.

To stress accumulo, run the following script which starts a MapReduce job that reads and writes to your continuous ingest table. This MapReduce job will write out an entry for every entry in the table (except for ones created by the MapReduce job itself). Stop ingest before running this MapReduce job. Do not run more than one instance of this MapReduce job concurrently against a table.

$ run-moru.sh