| Title: Apache Accumulo MapReduce Example |
| Notice: Licensed to the Apache Software Foundation (ASF) under one |
| or more contributor license agreements. See the NOTICE file |
| distributed with this work for additional information |
| regarding copyright ownership. The ASF licenses this file |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this file except in compliance |
| with the License. You may obtain a copy of the License at |
| . |
| http://www.apache.org/licenses/LICENSE-2.0 |
| . |
| Unless required by applicable law or agreed to in writing, |
| software distributed under the License is distributed on an |
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| KIND, either express or implied. See the License for the |
| specific language governing permissions and limitations |
| under the License. |
| |
| This example uses mapreduce and accumulo to compute word counts for a set of |
| documents. This is accomplished using a map-only mapreduce job and a |
| accumulo table with combiners. |
| |
| To run this example you will need a directory in HDFS containing text files. |
| The accumulo readme will be used to show how to run this example. |
| |
| $ hadoop fs -copyFromLocal $ACCUMULO_HOME/README /user/username/wc/Accumulo.README |
| $ hadoop fs -ls /user/username/wc |
| Found 1 items |
| -rw-r--r-- 2 username supergroup 9359 2009-07-15 17:54 /user/username/wc/Accumulo.README |
| |
| The first part of running this example is to create a table with a combiner |
| for the column family count. |
| |
| $ ./bin/accumulo shell -u username -p password |
| Shell - Apache Accumulo Interactive Shell |
| - version: 1.4.x |
| - instance name: instance |
| - instance id: 00000000-0000-0000-0000-000000000000 |
| - |
| - type 'help' for a list of available commands |
| - |
| username@instance> createtable wordCount |
| username@instance wordCount> setiter -class org.apache.accumulo.core.iterators.user.SummingCombiner -p 10 -t wordCount -majc -minc -scan |
| SummingCombiner interprets Values as Longs and adds them together. A variety of encodings (variable length, fixed length, or string) are available |
| ----------> set SummingCombiner parameter all, set to true to apply Combiner to every column, otherwise leave blank. if true, columns option will be ignored.: false |
| ----------> set SummingCombiner parameter columns, <col fam>[:<col qual>]{,<col fam>[:<col qual>]} escape non-alphanum chars using %<hex>.: count |
| ----------> set SummingCombiner parameter lossy, if true, failed decodes are ignored. Otherwise combiner will error on failed decodes (default false): <TRUE|FALSE>: false |
| ----------> set SummingCombiner parameter type, <VARLEN|FIXEDLEN|STRING|fullClassName>: STRING |
| username@instance wordCount> quit |
| |
| After creating the table, run the word count map reduce job. |
| |
| $ bin/tool.sh lib/examples-simple*[^c].jar org.apache.accumulo.examples.simple.mapreduce.WordCount instance zookeepers /user/username/wc wordCount -u username -p password |
| |
| 11/02/07 18:20:11 INFO input.FileInputFormat: Total input paths to process : 1 |
| 11/02/07 18:20:12 INFO mapred.JobClient: Running job: job_201102071740_0003 |
| 11/02/07 18:20:13 INFO mapred.JobClient: map 0% reduce 0% |
| 11/02/07 18:20:20 INFO mapred.JobClient: map 100% reduce 0% |
| 11/02/07 18:20:22 INFO mapred.JobClient: Job complete: job_201102071740_0003 |
| 11/02/07 18:20:22 INFO mapred.JobClient: Counters: 6 |
| 11/02/07 18:20:22 INFO mapred.JobClient: Job Counters |
| 11/02/07 18:20:22 INFO mapred.JobClient: Launched map tasks=1 |
| 11/02/07 18:20:22 INFO mapred.JobClient: Data-local map tasks=1 |
| 11/02/07 18:20:22 INFO mapred.JobClient: FileSystemCounters |
| 11/02/07 18:20:22 INFO mapred.JobClient: HDFS_BYTES_READ=10487 |
| 11/02/07 18:20:22 INFO mapred.JobClient: Map-Reduce Framework |
| 11/02/07 18:20:22 INFO mapred.JobClient: Map input records=255 |
| 11/02/07 18:20:22 INFO mapred.JobClient: Spilled Records=0 |
| 11/02/07 18:20:22 INFO mapred.JobClient: Map output records=1452 |
| |
| After the map reduce job completes, query the accumulo table to see word |
| counts. |
| |
| $ ./bin/accumulo shell -u username -p password |
| username@instance> table wordCount |
| username@instance wordCount> scan -b the |
| the count:20080906 [] 75 |
| their count:20080906 [] 2 |
| them count:20080906 [] 1 |
| then count:20080906 [] 1 |
| there count:20080906 [] 1 |
| these count:20080906 [] 3 |
| this count:20080906 [] 6 |
| through count:20080906 [] 1 |
| time count:20080906 [] 3 |
| time. count:20080906 [] 1 |
| to count:20080906 [] 27 |
| total count:20080906 [] 1 |
| tserver, count:20080906 [] 1 |
| tserver.compaction.major.concurrent.max count:20080906 [] 1 |
| ... |
| |
| Another example to look at is |
| org.apache.accumulo.examples.simple.mapreduce.UniqueColumns. This example |
| computes the unique set of columns in a table and shows how a map reduce job |
| can directly read a tables files from HDFS. |
| |