Apache Accumulo Sampling Example

Basic Sampling Example

Accumulo supports building a set of sample data that can be efficiently accessed by scanners. What data is included in the sample set is configurable. Below, some data representing documents are inserted.

root@instance> createnamespace examples
root@instance> createtable examples.sampex
root@instance examples.sampex> insert 9255 doc content 'abcde'
root@instance examples.sampex> insert 9255 doc url file://foo.txt
root@instance examples.sampex> insert 8934 doc content 'accumulo scales'
root@instance examples.sampex> insert 8934 doc url file://accumulo_notes.txt
root@instance examples.sampex> insert 2317 doc content 'milk, eggs, bread, parmigiano-reggiano'
root@instance examples.sampex> insert 2317 doc url file://groceries/9.txt
root@instance examples.sampex> insert 3900 doc content 'EC2 ate my homework'
root@instance examples.sampex> insert 3900 doc uril file://final_project.txt

Below the table examples.sampex is configured to build a sample set. The configuration causes Accumulo to include any row where murmur3_32(row) % 3 ==0 in the tables sample data.

root@instance examples.sampex> config -t examples.sampex -s table.sampler.opt.hasher=murmur3_32
root@instance examples.sampex> config -t examples.sampex -s table.sampler.opt.modulus=3
root@instance examples.sampex> config -t examples.sampex -s table.sampler=org.apache.accumulo.core.client.sample.RowSampler

Below, attempting to scan the sample returns an error. This is because data was inserted before the sample set was configured.

root@instance examples.sampex> scan --sample
2015-09-09 12:21:50,643 [shell.Shell] ERROR: org.apache.accumulo.core.client.SampleNotPresentException: Table sampex(ID:2) does not have sampling configured or built

To remedy this problem, the following command will flush in memory data and compact any files that do not contain the correct sample data.

root@instance examples.sampex> compact -t examples.sampex --sf-no-sample

After the compaction, the sample scan works.

root@instance examples.sampex> scan --sample
2317 doc:content []    milk, eggs, bread, parmigiano-reggiano
2317 doc:url []    file://groceries/9.txt

The commands below show that updates to data in the sample are seen when scanning the sample.

root@instance examples.sampex> insert 2317 doc content 'milk, eggs, bread, parmigiano-reggiano, butter'
root@instance examples.sampex> scan --sample
2317 doc:content []    milk, eggs, bread, parmigiano-reggiano, butter
2317 doc:url []    file://groceries/9.txt

In order to make scanning the sample fast, sample data is partitioned as data is written to Accumulo. This means if the sample configuration is changed, that data written previously is partitioned using a different criteria. Accumulo will detect this situation and fail sample scans. The commands below show this failure and fixing the problem with a compaction.

root@instance examples.sampex> config -t examples.sampex -s table.sampler.opt.modulus=2
root@instance examples.sampex> scan --sample
2015-09-09 12:22:51,058 [shell.Shell] ERROR: org.apache.accumulo.core.client.SampleNotPresentException: Table sampex(ID:2) does not have sampling configured or built
root@instance examples.sampex> compact -t examples.sampex --sf-no-sample
2015-09-09 12:23:07,242 [shell.Shell] INFO : Compaction of table sampex started for given range
root@instance examples.sampex> scan --sample
2317 doc:content []    milk, eggs, bread, parmigiano-reggiano, butter
2317 doc:url []    file://groceries/9.txt
3900 doc:content []    EC2 ate my homework
3900 doc:uril []    file://final_project.txt
9255 doc:content []    abcde
9255 doc:url []    file://foo.txt

The example above is replicated in a java program using the Accumulo API. Below is the program name, and the command to run it.

./bin/runex sample.SampleExample

The commands below look under the hood to give some insight into how this feature works. The commands determine what files the sampex table is using.

root@instance> tables -l
accumulo.metadata    =>        !0
accumulo.replication =>      +rep
accumulo.root        =>        +r
examples.sampex      =>         2
trace                =>         1
root@instance sampex> scan -t accumulo.metadata -c file -b 2 -e 2<
2< file:hdfs://localhost:10000/accumulo/tables/2/default_tablet/A000000s.rf []    702,8

Below shows running accumulo rfile-info on the file above. This shows the rfile has a normal default locality group and a sample default locality group. The output also shows the configuration used to create the sample locality group. The sample configuration within a rfile must match the tables sample configuration for sample scan to work.

$ accumulo rfile-info hdfs://localhost:10000/accumulo/tables/2/default_tablet/A000000s.rf
Reading file: hdfs://localhost:10000/accumulo/tables/2/default_tablet/A000000s.rf
RFile Version            : 8

Locality group           : <DEFAULT>
	Start block            : 0
	Num   blocks           : 1
	Index level 0          : 35 bytes  1 blocks
	First key              : 2317 doc:content [] 1437672014986 false
	Last key               : 9255 doc:url [] 1437672014875 false
	Num entries            : 8
	Column families        : [doc]

Sample Configuration     :
	Sampler class          : org.apache.accumulo.core.client.sample.RowSampler
	Sampler options        : {hasher=murmur3_32, modulus=2}

Sample Locality group    : <DEFAULT>
	Start block            : 0
	Num   blocks           : 1
	Index level 0          : 36 bytes  1 blocks
	First key              : 2317 doc:content [] 1437672014986 false
	Last key               : 9255 doc:url [] 1437672014875 false
	Num entries            : 6
	Column families        : [doc]

Meta block     : BCFile.index
      Raw size             : 4 bytes
      Compressed size      : 12 bytes
      Compression type     : gz

Meta block     : RFile.index
      Raw size             : 309 bytes
      Compressed size      : 176 bytes
      Compression type     : gz

Shard Sampling Example

Note: Before continuing, you need to complete the Shard example, located here.

The Shard example shows how to index and search files using Accumulo. That example indexes documents into a table named examples.shard. The indexing scheme used in that example places the document name in the column qualifier. A useful sample of this indexing scheme should contain all data for any document in the sample. To accomplish this, the following commands build a sample for the shard table based on the column qualifier.

root@instance examples.shard> config -t examples.shard -s table.sampler.opt.hasher=murmur3_32
root@instance examples.shard> config -t examples.shard -s table.sampler.opt.modulus=101
root@instance examples.shard> config -t examples.shard -s table.sampler.opt.qualifier=true
root@instance examples.shard> config -t examples.shard -s table.sampler=org.apache.accumulo.core.client.sample.RowColumnSampler
root@instance examples.shard> compact -t examples.shard --sf-no-sample -w
2015-07-23 15:00:09,280 [shell.Shell] INFO : Compacting table ...
2015-07-23 15:00:10,134 [shell.Shell] INFO : Compaction of table shard completed for given range

After enabling sampling, the command below counts the number of documents in the sample containing the words import and int.

$ ./bin/runex shard.Query --sample -t examples.shard import int | fgrep '.java' | wc
      4       4     395

The command below counts the total number of documents containing the words import and int.

$ ./bin/runex shard.Query -t examples.shard import int | fgrep '.java' | wc
    382     382   40084

The counts 4 out of 395 total are around what would be expected for a modulus of 101. Querying the sample first provides a quick way to estimate how much data the real query will bring back.

Another way sample data could be used with the shard example is with a specialized iterator. In the examples source code there is an iterator named CutoffIntersectingIterator. This iterator first checks how many documents are found in the sample data. If too many documents are found in the sample data, then it returns nothing. Otherwise, it proceeds to query the full data set. To experiment with this iterator, use the following command. The --sampleCutoff option below will cause the query to return nothing if based on the sample it appears a query would return more than 1000 documents.

$ ./bin/runex shard.Query --sampleCutoff 1000 -t examples.shard import int | fgrep '.java' | wc