docs/wordcount.md - accumulo-examples - Git at Google

 <!--
 Licensed to the Apache Software Foundation (ASF) under one or more
 contributor license agreements.  See the NOTICE file distributed with
 this work for additional information regarding copyright ownership.
 The ASF licenses this file to You under the Apache License, Version 2.0
 (the "License"); you may not use this file except in compliance with
 the License.  You may obtain a copy of the License at

     http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
 -->
 # Apache Accumulo Word Count example

 The WordCount example ([WordCount.java]) uses MapReduce and Accumulo to compute
 word counts for a set of documents. This is accomplished using a map-only MapReduce
 job and an Accumulo table with combiners.

 To run this example, create a directory in HDFS containing text files. You can
 use the Accumulo README for data:

     $ hdfs dfs -mkdir /wc
     $ hdfs dfs -copyFromLocal /path/to/accumulo/README.md /wc/README.md

 Verify that the file was created:

     $ hdfs dfs -ls /wc

 After creating the table, run the WordCount MapReduce job with your HDFS input directory:

     $ ./bin/runmr mapreduce.WordCount -i /wc

 [WordCount.java] creates an Accumulo table named with a SummingCombiner iterator
 attached to it. It runs a map-only M/R job that reads the specified HDFS directory containing text files and
 writes word counts to Accumulo table.

 After the MapReduce job completes, query the Accumulo table to see word counts.

     $ accumulo shell
     username@instance> table examples.wordcount
     username@instance examples.wordcount> scan -b the
     the count:20080906 []    75
     their count:20080906 []    2
     them count:20080906 []    1
     then count:20080906 []    1
     ...

 When the WordCount MapReduce job was run above, the client properties were serialized
 into the MapReduce configuration.  This is insecure if the properties contain sensitive
 information like passwords. A more secure option is store accumulo-client.properties
 in HDFS and run the job with the `-D` options.  This will configure the MapReduce job
 to obtain the client properties from HDFS:

     $ hdfs dfs -mkdir /user
     $ hdfs dfs -mkdir /user/myuser
     $ hdfs dfs -copyFromLocal /path/to/accumulo/conf/accumulo-client.properties /user/myuser/
     $ ./bin/runmr mapreduce.WordCount -i /wc -t examples.wordcount2 -d /user/myuser/accumulo-client.properties

 After the MapReduce job completes, query the `examples.wordcount2` table. The results should
 be the same as before:

     $ accumulo shell
     username@instance> table examples.wordcount2
     username@instance examples.wordcount2> scan -b the
     the count:20080906 []    75
     their count:20080906 []    2
     ...


 [WordCount.java]: ../src/main/java/org/apache/accumulo/examples/mapreduce/WordCount.java
	<!--
	Licensed to the Apache Software Foundation (ASF) under one or more
	contributor license agreements. See the NOTICE file distributed with
	this work for additional information regarding copyright ownership.
	The ASF licenses this file to You under the Apache License, Version 2.0
	(the "License"); you may not use this file except in compliance with
	the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing, software
	distributed under the License is distributed on an "AS IS" BASIS,
	WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	See the License for the specific language governing permissions and
	limitations under the License.
	-->
	# Apache Accumulo Word Count example

	The WordCount example ([WordCount.java]) uses MapReduce and Accumulo to compute
	word counts for a set of documents. This is accomplished using a map-only MapReduce
	job and an Accumulo table with combiners.

	To run this example, create a directory in HDFS containing text files. You can
	use the Accumulo README for data:

	$ hdfs dfs -mkdir /wc
	$ hdfs dfs -copyFromLocal /path/to/accumulo/README.md /wc/README.md

	Verify that the file was created:

	$ hdfs dfs -ls /wc

	After creating the table, run the WordCount MapReduce job with your HDFS input directory:

	$ ./bin/runmr mapreduce.WordCount -i /wc

	[WordCount.java] creates an Accumulo table named with a SummingCombiner iterator
	attached to it. It runs a map-only M/R job that reads the specified HDFS directory containing text files and
	writes word counts to Accumulo table.

	After the MapReduce job completes, query the Accumulo table to see word counts.

	$ accumulo shell
	username@instance> table examples.wordcount
	username@instance examples.wordcount> scan -b the
	the count:20080906 [] 75
	their count:20080906 [] 2
	them count:20080906 [] 1
	then count:20080906 [] 1
	...

	When the WordCount MapReduce job was run above, the client properties were serialized
	into the MapReduce configuration. This is insecure if the properties contain sensitive
	information like passwords. A more secure option is store accumulo-client.properties
	in HDFS and run the job with the `-D` options. This will configure the MapReduce job
	to obtain the client properties from HDFS:

	$ hdfs dfs -mkdir /user
	$ hdfs dfs -mkdir /user/myuser
	$ hdfs dfs -copyFromLocal /path/to/accumulo/conf/accumulo-client.properties /user/myuser/
	$ ./bin/runmr mapreduce.WordCount -i /wc -t examples.wordcount2 -d /user/myuser/accumulo-client.properties

	After the MapReduce job completes, query the `examples.wordcount2` table. The results should
	be the same as before:

	$ accumulo shell
	username@instance> table examples.wordcount2
	username@instance examples.wordcount2> scan -b the
	the count:20080906 [] 75
	their count:20080906 [] 2
	...


	[WordCount.java]: ../src/main/java/org/apache/accumulo/examples/mapreduce/WordCount.java