This example shows how to create a table with bloom filters enabled. The second part shows how bloom filters increase query performance when looking for values that do not exist in a table.
Accumulo data is divided into tablets and each tablet has multiple r-files. Lookup performance of a tablet with 3 r-files can be 3 times slower than a tablet with one r-file. However if the files contain unique sets of data, then bloom filters can help with performance.
Run the example below to create two identical tables. One table has bloom filters enabled, the other does not. The major compaction ratio was increased to prevent the files from being compacted into one file. If Accumulo is not configured with enough memory to hold 1 million rows then more r-files will be created.
$ ./bin/runex bloom.BloomFilters
Run the example below to perform 500 lookups against each table. Even though only one r-file will likely contain entries for the query, all files will be interrogated.
$ ./bin/runex bloom.BloomBatchScanner Scanning bloom_test1 with seed 7 Scan finished! 282.49 lookups/sec, 1.77 secs, 500 results All expected rows were scanned Scanning bloom_test2 with seed 7 Scan finished! 704.23 lookups/sec, 0.71 secs, 500 results All expected rows were scanned
You can verify the table has three or more r-files by looking in HDFS. To look in HDFS you will need the table ID, which can be found with the following shell command.
$ accumulo shell -u username -p password -e 'tables -l' accumulo.metadata => !0 accumulo.root => +r bloom_test1 => 2 bloom_test2 => 3 trace => 1
So the table id for bloom_test2 is 3. The command below shows what files this table has in HDFS. This assumes Accumulo is at the default location in HDFS.
$ hdfs dfs -ls -R /accumulo/tables/3 drwxr-xr-x - username supergroup 0 2012-01-10 14:02 /accumulo/tables/3/default_tablet -rw-r--r-- 3 username supergroup 52672650 2012-01-10 14:01 /accumulo/tables/3/default_tablet/F00000dj.rf -rw-r--r-- 3 username supergroup 52436176 2012-01-10 14:01 /accumulo/tables/3/default_tablet/F00000dk.rf -rw-r--r-- 3 username supergroup 52850173 2012-01-10 14:02 /accumulo/tables/3/default_tablet/F00000dl.rf
Running the rfile-info command shows that one of the files has a bloom filter and its 1.5MB.
$ accumulo rfile-info /accumulo/tables/3/default_tablet/F00000dj.rf Locality group : <DEFAULT> Start block : 0 Num blocks : 752 Index level 0 : 43,598 bytes 1 blocks First key : row_0000001169 foo:1 [exampleVis] 1326222052539 false Last key : row_0999999421 foo:1 [exampleVis] 1326222052058 false Num entries : 999,536 Column families : [foo] Meta block : BCFile.index Raw size : 4 bytes Compressed size : 12 bytes Compression type : gz Meta block : RFile.index Raw size : 43,696 bytes Compressed size : 15,592 bytes Compression type : gz Meta block : acu_bloom Raw size : 1,540,292 bytes Compressed size : 1,433,115 bytes Compression type : gz
Run the example below to create 2 tables, one with bloom filters enabled.
$ ./bin/runex bloom.BloomFiltersNotFound
One million random values initialized with seed 7 are inserted into each table.
Once the flush completes, 500 random queries are done against each table but with a different seed. Even when nothing is found the lookups are faster against the table with the bloom filters.
Writing data to bloom_test3 and bloom_test4 (bloom filters enabled) Scanning bloom_test3 with seed 8 Scan finished! 780.03 lookups/sec, 0.64 secs, 0 results Did not find 500 Scanning bloom_test4 with seed 8 Scan finished! 1736.11 lookups/sec, 0.29 secs, 0 results Did not find 500