This example uses mapreduce and accumulo to compute word counts for a set of documents. This is accomplished using a map-only mapreduce job and a accumulo table with combiners.
To run this example you will need a directory in HDFS containing text files. The accumulo readme will be used to show how to run this example.
$ hadoop fs -copyFromLocal $ACCUMULO_HOME/README /user/username/wc/Accumulo.README $ hadoop fs -ls /user/username/wc Found 1 items -rw-r--r-- 2 username supergroup 9359 2009-07-15 17:54 /user/username/wc/Accumulo.README
The first part of running this example is to create a table with a combiner for the column family count.
$ ./bin/accumulo shell -u username -p password Shell - Apache Accumulo Interactive Shell - version: 1.5.0 - instance name: instance - instance id: 00000000-0000-0000-0000-000000000000 - - type 'help' for a list of available commands - username@instance> createtable wordCount username@instance wordCount> setiter -class org.apache.accumulo.core.iterators.user.SummingCombiner -p 10 -t wordCount -majc -minc -scan SummingCombiner interprets Values as Longs and adds them together. A variety of encodings (variable length, fixed length, or string) are available ----------> set SummingCombiner parameter all, set to true to apply Combiner to every column, otherwise leave blank. if true, columns option will be ignored.: false ----------> set SummingCombiner parameter columns, <col fam>[:<col qual>]{,<col fam>[:<col qual>]} escape non-alphanum chars using %<hex>.: count ----------> set SummingCombiner parameter lossy, if true, failed decodes are ignored. Otherwise combiner will error on failed decodes (default false): <TRUE|FALSE>: false ----------> set SummingCombiner parameter type, <VARLEN|FIXEDLEN|STRING|fullClassName>: STRING username@instance wordCount> quit
After creating the table, run the word count map reduce job.
$ bin/tool.sh lib/accumulo-examples-simple.jar org.apache.accumulo.examples.simple.mapreduce.WordCount -i instance -z zookeepers --input /user/username/wc -t wordCount -u username -p password 11/02/07 18:20:11 INFO input.FileInputFormat: Total input paths to process : 1 11/02/07 18:20:12 INFO mapred.JobClient: Running job: job_201102071740_0003 11/02/07 18:20:13 INFO mapred.JobClient: map 0% reduce 0% 11/02/07 18:20:20 INFO mapred.JobClient: map 100% reduce 0% 11/02/07 18:20:22 INFO mapred.JobClient: Job complete: job_201102071740_0003 11/02/07 18:20:22 INFO mapred.JobClient: Counters: 6 11/02/07 18:20:22 INFO mapred.JobClient: Job Counters 11/02/07 18:20:22 INFO mapred.JobClient: Launched map tasks=1 11/02/07 18:20:22 INFO mapred.JobClient: Data-local map tasks=1 11/02/07 18:20:22 INFO mapred.JobClient: FileSystemCounters 11/02/07 18:20:22 INFO mapred.JobClient: HDFS_BYTES_READ=10487 11/02/07 18:20:22 INFO mapred.JobClient: Map-Reduce Framework 11/02/07 18:20:22 INFO mapred.JobClient: Map input records=255 11/02/07 18:20:22 INFO mapred.JobClient: Spilled Records=0 11/02/07 18:20:22 INFO mapred.JobClient: Map output records=1452
After the map reduce job completes, query the accumulo table to see word counts.
$ ./bin/accumulo shell -u username -p password username@instance> table wordCount username@instance wordCount> scan -b the the count:20080906 [] 75 their count:20080906 [] 2 them count:20080906 [] 1 then count:20080906 [] 1 there count:20080906 [] 1 these count:20080906 [] 3 this count:20080906 [] 6 through count:20080906 [] 1 time count:20080906 [] 3 time. count:20080906 [] 1 to count:20080906 [] 27 total count:20080906 [] 1 tserver, count:20080906 [] 1 tserver.compaction.major.concurrent.max count:20080906 [] 1 ...
Another example to look at is org.apache.accumulo.examples.simple.mapreduce.UniqueColumns. This example computes the unique set of columns in a table and shows how a map reduce job can directly read a tables files from HDFS.