Apache Accumulo Spark Example

Requirements

  • Accumulo 2.0+
  • Hadoop YARN installed & HADOOP_CONF_DIR set in environment
  • Spark installed & SPARK_HOME set in environment

Spark example

The CopyPlus5K example will create an Accumulo table called spark_example_input and write 100 key/value entries into Accumulo with the values 0..99. It then launches a Spark application that does following:

  • Read data from spark_example_input table using AccumuloInputFormat
  • Add 5000 to each value
  • Write the data to a new Accumulo table (called spark_example_output) using one of two methods.
    1. Bulk import - Write data to an RFile in HDFS using AccumuloFileOutputFormat and bulk import to Accumulo table
    2. Batchwriter - Creates a BatchWriter in Spark code to write to the table.

This application can be run using the command:

./run.sh batch /path/to/accumulo-client.properties

Change batch to bulk to use Bulk import method.