Apache Accumulo Unique Columns example

The UniqueColumns examples (UniqueColumns.java) computes the unique set of column family and column qualifiers in a table. It also demonstrates how a mapReduce job can directly read a tables files from HDFS.

Create a table and add rows that all have identical column family and column qualifiers.

$ /path/to/accumulo shell -u username -p secret
username@instance> createnamespace examples
username@instance> createtable examples.unique
username@instance examples.unique> insert row1 fam1 qual1 v1
username@instance examples.unique> insert row2 fam1 qual1 v2
username@instance examples.unique> insert row3 fam1 qual1 v3

Exit the Accumulo shell and run the uniqueColumns mapReduce job against this table. Note that if the output file already exists in HDFS, it will need to be deleted.

$ ./bin/runmr mapreduce.UniqueColumns --table examples.unique --reducers 1 --output /tmp/unique

When the mapReduce job completes, examine the output.

$ hdfs dfs -cat /tmp/unique/part-r-00000
cf:fam1
cq:qual1

The output displays the unique column family and column qualifier values. In this case since all rows use the same values, there are only two values output.

Note that since the example used only one reducer all output will be contained within the single part-r-00000 file. If more than one reducer is used the output will be spread among various part-r-xxxxx files.

Go back to the shell and add some additional entries.

$ /path/to/accumulo shell -u username -p secret
username@instance> table unique
username@instance example.unique> insert row1 fam2 qual2 v2
username@instance example.unique> insert row1 fam3 qual2 v2
username@instance example.unique> insert row1 fam2 qual2 v2
username@instance example.unique> insert row2 fam2 qual2 v2
username@instance example.unique> insert row3 fam2 qual2 v2
username@instance example.unique> insert row3 fam3 qual3 v2
username@instance example.unique> insert row3 fam3 qual4 v2

Re-running the command will now find any additional unique column values.

$ hdfs dfs -rm -r -f /tmp/unique
$ ./bin/runmr mapreduce.UniqueColumns --table examples.unique --reducers 1 --output /tmp/unique 
$ hdfs dfs -cat /tmp/unique/part-r-00000
cf:fam1
cf:fam2
cf:fam3
cq:qual1
cq:qual2
cq:qual3
cq:qual4

The output now includes the additional column values that were added during the last batch of inserts.