blob: da3a97400bc5021b1284a4afe2dc7a0ceeb00f73 [file] [log] [blame] [view]
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Apache Accumulo Bloom Filter Example
This example shows how to create a table with bloom filters enabled. The second part
shows how bloom filters increase query performance when looking for values that
do not exist in a table.
## Bloom Filters Enabled
Accumulo data is divided into tablets and each tablet has multiple r-files.
Lookup performance of a tablet with 3 r-files can be 3 times slower than
a tablet with one r-file. However, if the files contain unique sets of data,
then bloom filters can help with performance.
Run the example below to create two identical tables. One table has bloom
filters enabled, the other does not. The major compaction ratio was increased to
prevent the files from being compacted into one file. If Accumulo is not configured
with enough memory to hold 1 million rows then more r-files will be created.
$ ./bin/runex bloom.BloomFilters
Run the example below to perform 500 lookups against each table. Even though only one r-file will
likely contain entries for the query, all files will be interrogated.
$ ./bin/runex bloom.BloomBatchScanner
Scanning example.bloom_test1 with seed 7
Scan finished! 282.49 lookups/sec, 1.77 secs, 500 results
All expected rows were scanned
Scanning examples.bloom_test2 with seed 7
Scan finished! 704.23 lookups/sec, 0.71 secs, 500 results
All expected rows were scanned
You can verify the table has three or more r-files by looking in HDFS. To look in HDFS
you will need the table ID, which can be found with the following shell command.
$ accumulo shell -u username -p password -e 'tables -l'
accumulo.metadata => !0
accumulo.root => +r
examples.bloom_test1 => 2
examples.bloom_test2 => 3
trace => 1
So the table id for bloom_test2 is 3. The command below shows what files this
table has in HDFS. This assumes Accumulo is at the default location in HDFS.
$ hdfs dfs -ls -R /accumulo/tables/3
drwxr-xr-x - username supergroup 0 2012-01-10 14:02 /accumulo/tables/3/default_tablet
-rw-r--r-- 3 username supergroup 52672650 2012-01-10 14:01 /accumulo/tables/3/default_tablet/F00000dj.rf
-rw-r--r-- 3 username supergroup 52436176 2012-01-10 14:01 /accumulo/tables/3/default_tablet/F00000dk.rf
-rw-r--r-- 3 username supergroup 52850173 2012-01-10 14:02 /accumulo/tables/3/default_tablet/F00000dl.rf
Running the rfile-info command shows that one of the files has a bloom filter
and its 1.5MB.
$ accumulo rfile-info /accumulo/tables/3/default_tablet/F00000dj.rf
Locality group : <DEFAULT>
Start block : 0
Num blocks : 752
Index level 0 : 43,598 bytes 1 blocks
First key : row_0000001169 foo:1 [exampleVis] 1326222052539 false
Last key : row_0999999421 foo:1 [exampleVis] 1326222052058 false
Num entries : 999,536
Column families : [foo]
Meta block : BCFile.index
Raw size : 4 bytes
Compressed size : 12 bytes
Compression type : gz
Meta block : RFile.index
Raw size : 43,696 bytes
Compressed size : 15,592 bytes
Compression type : gz
Meta block : acu_bloom
Raw size : 1,540,292 bytes
Compressed size : 1,433,115 bytes
Compression type : gz
## Bloom Filters when data is not found
Run the example below to create 2 tables, one with bloom filters enabled.
$ ./bin/runex bloom.BloomFiltersNotFound
One million random values initialized with seed 7 are inserted into each table.
Once the flush completes, 500 random queries are done against each table but with a different seed.
Even when nothing is found the lookups are faster against the table with the bloom filters.
Writing data to examples.bloom_test3 and examples.bloom_test4 (bloom filters enabled)
Scanning examples.bloom_test3 with seed 8
Scan finished! 780.03 lookups/sec, 0.64 secs, 0 results
Did not find 500
Scanning examples.bloom_test4 with seed 8
Scan finished! 1736.11 lookups/sec, 0.29 secs, 0 results
Did not find 500