blob: 975b5a04b79ac830f667d261aeb275b251c4f853 [file] [log] [blame] [view]
---
title: MapReduce Example
---
This example uses mapreduce and accumulo to compute word counts for a set of
documents. This is accomplished using a map-only mapreduce job and a
accumulo table with aggregators.
To run this example you will need a directory in HDFS containing text files.
The accumulo readme will be used to show how to run this example.
$ hadoop fs -copyFromLocal $ACCUMULO_HOME/README /user/username/wc/Accumulo.README
$ hadoop fs -ls /user/username/wc
Found 1 items
-rw-r--r-- 2 username supergroup 9359 2009-07-15 17:54 /user/username/wc/Accumulo.README
The first part of running this example is to create a table with aggregation
for the column family count.
$ ./bin/accumulo shell -u username -p password
Shell - Apache Accumulo Interactive Shell
- version: 1.3.x-incubating
- instance name: instance
- instance id: 00000000-0000-0000-0000-000000000000
-
- type 'help' for a list of available commands
-
username@instance> createtable wordCount -a count=org.apache.accumulo.core.iterators.aggregation.StringSummation
username@instance wordCount> quit
After creating the table, run the word count map reduce job.
[user1@instance accumulo]$ bin/tool.sh lib/accumulo-examples-*[^c].jar org.apache.accumulo.examples.mapreduce.WordCount instance zookeepers /user/user1/wc wordCount -u username -p password
11/02/07 18:20:11 INFO input.FileInputFormat: Total input paths to process : 1
11/02/07 18:20:12 INFO mapred.JobClient: Running job: job_201102071740_0003
11/02/07 18:20:13 INFO mapred.JobClient: map 0% reduce 0%
11/02/07 18:20:20 INFO mapred.JobClient: map 100% reduce 0%
11/02/07 18:20:22 INFO mapred.JobClient: Job complete: job_201102071740_0003
11/02/07 18:20:22 INFO mapred.JobClient: Counters: 6
11/02/07 18:20:22 INFO mapred.JobClient: Job Counters
11/02/07 18:20:22 INFO mapred.JobClient: Launched map tasks=1
11/02/07 18:20:22 INFO mapred.JobClient: Data-local map tasks=1
11/02/07 18:20:22 INFO mapred.JobClient: FileSystemCounters
11/02/07 18:20:22 INFO mapred.JobClient: HDFS_BYTES_READ=10487
11/02/07 18:20:22 INFO mapred.JobClient: Map-Reduce Framework
11/02/07 18:20:22 INFO mapred.JobClient: Map input records=255
11/02/07 18:20:22 INFO mapred.JobClient: Spilled Records=0
11/02/07 18:20:22 INFO mapred.JobClient: Map output records=1452
After the map reduce job completes, query the accumulo table to see word
counts.
$ ./bin/accumulo shell -u username -p password
username@instance> table wordCount
username@instance wordCount> scan -b the
the count:20080906 [] 75
their count:20080906 [] 2
them count:20080906 [] 1
then count:20080906 [] 1
there count:20080906 [] 1
these count:20080906 [] 3
this count:20080906 [] 6
through count:20080906 [] 1
time count:20080906 [] 3
time. count:20080906 [] 1
to count:20080906 [] 27
total count:20080906 [] 1
tserver, count:20080906 [] 1
tserver.compaction.major.concurrent.max count:20080906 [] 1
...