The WordCount example (WordCount.java) uses MapReduce and Accumulo to compute word counts for a set of documents. This is accomplished using a map-only MapReduce job and a Accumulo table with combiners.
To run this example, create a directory in HDFS containing text files. You can use the Accumulo README for data:
$ hdfs dfs -mkdir /wc $ hdfs dfs -copyFromLocal /path/to/accumulo/README.md /wc/README.md
Verify that the file was created:
$ hdfs dfs -ls /wc
After creating the table, run the WordCount MapReduce job with your HDFS input directory:
$ ./bin/runmr mapreduce.WordCount -i /wc
WordCount.java creates an Accumulo table (named with a SummingCombiner iterator attached to it. It runs a map-only M/R job that reads the specified HDFS directory containing text files and writes word counts to Accumulo table.
After the MapReduce job completes, query the Accumulo table to see word counts.
$ accumulo shell username@instance> table wordCount username@instance wordCount> scan -b the the count:20080906 [] 75 their count:20080906 [] 2 them count:20080906 [] 1 then count:20080906 [] 1 ...
When the WordCount MapReduce job was run above, the client properties were serialized into the MapReduce configuration. This is insecure if the properties contain sensitive information like passwords. A more secure option is store accumulo-client.properties in HDFS and run th job with the -D
options. This will configure the MapReduce job to obtain the client properties from HDFS:
$ hdfs dfs -copyFromLocal ./conf/accumulo-client.properties /user/myuser/ $ ./bin/runmr mapreduce.WordCount -i /wc -t wordCount2 -d /user/myuser/accumulo-client.properties
After the MapReduce job completes, query the wordCount2
table. The results should be the same as before:
$ accumulo shell username@instance> table wordCount username@instance wordCount> scan -b the the count:20080906 [] 75 their count:20080906 [] 2 ...