docs/src/main/resources/examples/README.mapred - accumulo - Git at Google

 Title: Apache Accumulo MapReduce Example
 Notice:    Licensed to the Apache Software Foundation (ASF) under one
            or more contributor license agreements.  See the NOTICE file
            distributed with this work for additional information
            regarding copyright ownership.  The ASF licenses this file
            to you under the Apache License, Version 2.0 (the
            "License"); you may not use this file except in compliance
            with the License.  You may obtain a copy of the License at
            .
              http://www.apache.org/licenses/LICENSE-2.0
            .
            Unless required by applicable law or agreed to in writing,
            software distributed under the License is distributed on an
            "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
            KIND, either express or implied.  See the License for the
            specific language governing permissions and limitations
            under the License.

 This example uses mapreduce and accumulo to compute word counts for a set of
 documents. This is accomplished using a map-only mapreduce job and a
 accumulo table with combiners.

 To run this example you will need a directory in HDFS containing text files.
 The accumulo readme will be used to show how to run this example.

     $ hadoop fs -copyFromLocal $ACCUMULO_HOME/README /user/username/wc/Accumulo.README
     $ hadoop fs -ls /user/username/wc
     Found 1 items
     -rw-r--r--   2 username supergroup       9359 2009-07-15 17:54 /user/username/wc/Accumulo.README

 The first part of running this example is to create a table with a combiner
 for the column family count.

     $ ./bin/accumulo shell -u username -p password
     Shell - Apache Accumulo Interactive Shell
     - version: 1.5.0
     - instance name: instance
     - instance id: 00000000-0000-0000-0000-000000000000
     -
     - type 'help' for a list of available commands
     -
     username@instance> createtable wordCount
     username@instance wordCount> setiter -class org.apache.accumulo.core.iterators.user.SummingCombiner -p 10 -t wordCount -majc -minc -scan
     SummingCombiner interprets Values as Longs and adds them together. A variety of encodings (variable length, fixed length, or string) are available
     ----------> set SummingCombiner parameter all, set to true to apply Combiner to every column, otherwise leave blank. if true, columns option will be ignored.: false
     ----------> set SummingCombiner parameter columns, <col fam>[:<col qual>]{,<col fam>[:<col qual>]} escape non-alphanum chars using %<hex>.: count
     ----------> set SummingCombiner parameter lossy, if true, failed decodes are ignored. Otherwise combiner will error on failed decodes (default false): <TRUE|FALSE>: false
     ----------> set SummingCombiner parameter type, <VARLEN|FIXEDLEN|STRING|fullClassName>: STRING
     username@instance wordCount> quit

 After creating the table, run the word count map reduce job.

     $ bin/tool.sh lib/accumulo-examples-simple.jar org.apache.accumulo.examples.simple.mapreduce.WordCount -i instance -z zookeepers  --input /user/username/wc -t wordCount -u username -p password

     11/02/07 18:20:11 INFO input.FileInputFormat: Total input paths to process : 1
     11/02/07 18:20:12 INFO mapred.JobClient: Running job: job_201102071740_0003
     11/02/07 18:20:13 INFO mapred.JobClient:  map 0% reduce 0%
     11/02/07 18:20:20 INFO mapred.JobClient:  map 100% reduce 0%
     11/02/07 18:20:22 INFO mapred.JobClient: Job complete: job_201102071740_0003
     11/02/07 18:20:22 INFO mapred.JobClient: Counters: 6
     11/02/07 18:20:22 INFO mapred.JobClient:   Job Counters
     11/02/07 18:20:22 INFO mapred.JobClient:     Launched map tasks=1
     11/02/07 18:20:22 INFO mapred.JobClient:     Data-local map tasks=1
     11/02/07 18:20:22 INFO mapred.JobClient:   FileSystemCounters
     11/02/07 18:20:22 INFO mapred.JobClient:     HDFS_BYTES_READ=10487
     11/02/07 18:20:22 INFO mapred.JobClient:   Map-Reduce Framework
     11/02/07 18:20:22 INFO mapred.JobClient:     Map input records=255
     11/02/07 18:20:22 INFO mapred.JobClient:     Spilled Records=0
     11/02/07 18:20:22 INFO mapred.JobClient:     Map output records=1452

 After the map reduce job completes, query the accumulo table to see word
 counts.

     $ ./bin/accumulo shell -u username -p password
     username@instance> table wordCount
     username@instance wordCount> scan -b the
     the count:20080906 []    75
     their count:20080906 []    2
     them count:20080906 []    1
     then count:20080906 []    1
     there count:20080906 []    1
     these count:20080906 []    3
     this count:20080906 []    6
     through count:20080906 []    1
     time count:20080906 []    3
     time. count:20080906 []    1
     to count:20080906 []    27
     total count:20080906 []    1
     tserver, count:20080906 []    1
     tserver.compaction.major.concurrent.max count:20080906 []    1
     ...

 Another example to look at is
 org.apache.accumulo.examples.simple.mapreduce.UniqueColumns. This example
 computes the unique set of columns in a table and shows how a map reduce job
 can directly read a tables files from HDFS.

 One more example available is
 org.apache.accumulo.examples.simple.mapreduce.TokenFileWordCount.
 The TokenFileWordCount example works exactly the same as the WordCount example
 explained above except that it uses a token file rather than giving the
 password directly to the map-reduce job (this avoids having the password
 displayed in the job's configuration which is world-readable).

 To create a token file, use the create-token utility

   $ ./bin/accumulo create-token

 It defaults to creating a PasswordToken, but you can specify the token class
 with -tc (requires the fully qualified class name). Based on the token class,
 it will prompt you for each property required to create the token.

 The last value it prompts for is a local filename to save to. If this file
 exists, it will append the new token to the end. Multiple tokens can exist in
 a file, but only the first one for each user will be recognized.

 Rather than waiting for the prompts, you can specify some options when calling
 create-token, for example

   $ ./bin/accumulo create-token -u root -p secret -f root.pw

 would create a token file containing a PasswordToken for
 user 'root' with password 'secret' and saved to 'root.pw'

 This local file needs to be uploaded to hdfs to be used with the
 map-reduce job. For example, if the file were 'root.pw' in the local directory:

   $ hadoop fs -put root.pw root.pw

 This would put 'root.pw' in the user's home directory in hdfs.

 Because the basic WordCount example uses Opts to parse its arguments
 (which extends ClientOnRequiredTable), you can use a token file with
 the basic WordCount example by calling the same command as explained above
 except replacing the password with the token file (rather than -p, use -tf).

   $ ./bin/tool.sh lib/accumulo-examples-simple.jar org.apache.accumulo.examples.simple.mapreduce.WordCount -i instance -z zookeepers  --input /user/username/wc -t wordCount -u username -tf tokenfile

 In the above examples, username was 'root' and tokenfile was 'root.pw'

 However, if you don't want to use the Opts class to parse arguments,
 the TokenFileWordCount is an example of using the token file manually.

   $ bin/tool.sh lib/accumulo-examples-simple.jar org.apache.accumulo.examples.simple.mapreduce.TokenFileWordCount instance zookeepers username tokenfile /user/username/wc wordCount

 The results should be the same as the WordCount example except that the
 authentication token was not stored in the configuration. It was instead
 stored in a file that the map-reduce job pulled into the distributed cache.
 (If you ran either of these on the same table right after the
 WordCount example, then the resulting counts should just double.)
	Title: Apache Accumulo MapReduce Example
	Notice: Licensed to the Apache Software Foundation (ASF) under one
	or more contributor license agreements. See the NOTICE file
	distributed with this work for additional information
	regarding copyright ownership. The ASF licenses this file
	to you under the Apache License, Version 2.0 (the
	"License"); you may not use this file except in compliance
	with the License. You may obtain a copy of the License at
	.
	http://www.apache.org/licenses/LICENSE-2.0
	.
	Unless required by applicable law or agreed to in writing,
	software distributed under the License is distributed on an
	"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	KIND, either express or implied. See the License for the
	specific language governing permissions and limitations
	under the License.

	This example uses mapreduce and accumulo to compute word counts for a set of
	documents. This is accomplished using a map-only mapreduce job and a
	accumulo table with combiners.

	To run this example you will need a directory in HDFS containing text files.
	The accumulo readme will be used to show how to run this example.

	$ hadoop fs -copyFromLocal $ACCUMULO_HOME/README /user/username/wc/Accumulo.README
	$ hadoop fs -ls /user/username/wc
	Found 1 items
	-rw-r--r-- 2 username supergroup 9359 2009-07-15 17:54 /user/username/wc/Accumulo.README

	The first part of running this example is to create a table with a combiner
	for the column family count.

	$ ./bin/accumulo shell -u username -p password
	Shell - Apache Accumulo Interactive Shell
	- version: 1.5.0
	- instance name: instance
	- instance id: 00000000-0000-0000-0000-000000000000
	-
	- type 'help' for a list of available commands
	-
	username@instance> createtable wordCount
	username@instance wordCount> setiter -class org.apache.accumulo.core.iterators.user.SummingCombiner -p 10 -t wordCount -majc -minc -scan
	SummingCombiner interprets Values as Longs and adds them together. A variety of encodings (variable length, fixed length, or string) are available
	----------> set SummingCombiner parameter all, set to true to apply Combiner to every column, otherwise leave blank. if true, columns option will be ignored.: false
	----------> set SummingCombiner parameter columns, <col fam>[:<col qual>]{,<col fam>[:<col qual>]} escape non-alphanum chars using %<hex>.: count
	----------> set SummingCombiner parameter lossy, if true, failed decodes are ignored. Otherwise combiner will error on failed decodes (default false): <TRUE\|FALSE>: false
	----------> set SummingCombiner parameter type, <VARLEN\|FIXEDLEN\|STRING\|fullClassName>: STRING
	username@instance wordCount> quit

	After creating the table, run the word count map reduce job.

	$ bin/tool.sh lib/accumulo-examples-simple.jar org.apache.accumulo.examples.simple.mapreduce.WordCount -i instance -z zookeepers --input /user/username/wc -t wordCount -u username -p password

	11/02/07 18:20:11 INFO input.FileInputFormat: Total input paths to process : 1
	11/02/07 18:20:12 INFO mapred.JobClient: Running job: job_201102071740_0003
	11/02/07 18:20:13 INFO mapred.JobClient: map 0% reduce 0%
	11/02/07 18:20:20 INFO mapred.JobClient: map 100% reduce 0%
	11/02/07 18:20:22 INFO mapred.JobClient: Job complete: job_201102071740_0003
	11/02/07 18:20:22 INFO mapred.JobClient: Counters: 6
	11/02/07 18:20:22 INFO mapred.JobClient: Job Counters
	11/02/07 18:20:22 INFO mapred.JobClient: Launched map tasks=1
	11/02/07 18:20:22 INFO mapred.JobClient: Data-local map tasks=1
	11/02/07 18:20:22 INFO mapred.JobClient: FileSystemCounters
	11/02/07 18:20:22 INFO mapred.JobClient: HDFS_BYTES_READ=10487
	11/02/07 18:20:22 INFO mapred.JobClient: Map-Reduce Framework
	11/02/07 18:20:22 INFO mapred.JobClient: Map input records=255
	11/02/07 18:20:22 INFO mapred.JobClient: Spilled Records=0
	11/02/07 18:20:22 INFO mapred.JobClient: Map output records=1452

	After the map reduce job completes, query the accumulo table to see word
	counts.

	$ ./bin/accumulo shell -u username -p password
	username@instance> table wordCount
	username@instance wordCount> scan -b the
	the count:20080906 [] 75
	their count:20080906 [] 2
	them count:20080906 [] 1
	then count:20080906 [] 1
	there count:20080906 [] 1
	these count:20080906 [] 3
	this count:20080906 [] 6
	through count:20080906 [] 1
	time count:20080906 [] 3
	time. count:20080906 [] 1
	to count:20080906 [] 27
	total count:20080906 [] 1
	tserver, count:20080906 [] 1
	tserver.compaction.major.concurrent.max count:20080906 [] 1
	...

	Another example to look at is
	org.apache.accumulo.examples.simple.mapreduce.UniqueColumns. This example
	computes the unique set of columns in a table and shows how a map reduce job
	can directly read a tables files from HDFS.

	One more example available is
	org.apache.accumulo.examples.simple.mapreduce.TokenFileWordCount.
	The TokenFileWordCount example works exactly the same as the WordCount example
	explained above except that it uses a token file rather than giving the
	password directly to the map-reduce job (this avoids having the password
	displayed in the job's configuration which is world-readable).

	To create a token file, use the create-token utility

	$ ./bin/accumulo create-token

	It defaults to creating a PasswordToken, but you can specify the token class
	with -tc (requires the fully qualified class name). Based on the token class,
	it will prompt you for each property required to create the token.

	The last value it prompts for is a local filename to save to. If this file
	exists, it will append the new token to the end. Multiple tokens can exist in
	a file, but only the first one for each user will be recognized.

	Rather than waiting for the prompts, you can specify some options when calling
	create-token, for example

	$ ./bin/accumulo create-token -u root -p secret -f root.pw

	would create a token file containing a PasswordToken for
	user 'root' with password 'secret' and saved to 'root.pw'

	This local file needs to be uploaded to hdfs to be used with the
	map-reduce job. For example, if the file were 'root.pw' in the local directory:

	$ hadoop fs -put root.pw root.pw

	This would put 'root.pw' in the user's home directory in hdfs.

	Because the basic WordCount example uses Opts to parse its arguments
	(which extends ClientOnRequiredTable), you can use a token file with
	the basic WordCount example by calling the same command as explained above
	except replacing the password with the token file (rather than -p, use -tf).

	$ ./bin/tool.sh lib/accumulo-examples-simple.jar org.apache.accumulo.examples.simple.mapreduce.WordCount -i instance -z zookeepers --input /user/username/wc -t wordCount -u username -tf tokenfile

	In the above examples, username was 'root' and tokenfile was 'root.pw'

	However, if you don't want to use the Opts class to parse arguments,
	the TokenFileWordCount is an example of using the token file manually.

	$ bin/tool.sh lib/accumulo-examples-simple.jar org.apache.accumulo.examples.simple.mapreduce.TokenFileWordCount instance zookeepers username tokenfile /user/username/wc wordCount

	The results should be the same as the WordCount example except that the
	authentication token was not stored in the configuration. It was instead
	stored in a file that the map-reduce job pulled into the distributed cache.
	(If you ran either of these on the same table right after the
	WordCount example, then the resulting counts should just double.)