blob: 555f06d1334279f4da916dbc79014cf04f7d42c4 [file] [log] [blame]
Title: Apache Accumulo Bloom Filter Example
Notice: Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
.
http://www.apache.org/licenses/LICENSE-2.0
.
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
This example shows how to create a table with bloom filters enabled. It also
shows how bloom filters increase query performance when looking for values that
do not exist in a table.
Below table named bloom_test is created and bloom filters are enabled.
$ ./bin/accumulo shell -u username -p password
Shell - Apache Accumulo Interactive Shell
- version: 1.5.0
- instance name: instance
- instance id: 00000000-0000-0000-0000-000000000000
-
- type 'help' for a list of available commands
-
username@instance> setauths -u username -s exampleVis
username@instance> createtable bloom_test
username@instance bloom_test> config -t bloom_test -s table.bloom.enabled=true
username@instance bloom_test> exit
Below 1 million random values are inserted into accumulo. The randomly
generated rows range between 0 and 1 billion. The random number generator is
initialized with the seed 7.
$ ./bin/accumulo org.apache.accumulo.examples.simple.client.RandomBatchWriter --seed 7 -i instance -z zookeepers -u username -p password -t bloom_test --num 1000000 --min 0 --max 1000000000 --size 50 --batchMemory 2M --batchLatency 60s --batchThreads 3 --vis exampleVis
Below the table is flushed:
$ ./bin/accumulo shell -u username -p password -e 'flush -t bloom_test -w'
05 10:40:06,069 [shell.Shell] INFO : Flush of table bloom_test completed.
After the flush completes, 500 random queries are done against the table. The
same seed is used to generate the queries, therefore everything is found in the
table.
$ ./bin/accumulo org.apache.accumulo.examples.simple.client.RandomBatchScanner --seed 7 -i instance -z zookeepers -u username -p password -t bloom_test --num 500 --min 0 --max 1000000000 --size 50 --scanThreads 20 --auths exampleVis
Generating 500 random queries...finished
96.19 lookups/sec 5.20 secs
num results : 500
Generating 500 random queries...finished
102.35 lookups/sec 4.89 secs
num results : 500
Below another 500 queries are performed, using a different seed which results
in nothing being found. In this case the lookups are much faster because of
the bloom filters.
$ ./bin/accumulo org.apache.accumulo.examples.simple.client.RandomBatchScanner --seed 8 -i instance -z zookeepers -u username -p password -t bloom_test --num 500 --min 0 --max 1000000000 --size 50 -batchThreads 20 -auths exampleVis
Generating 500 random queries...finished
2212.39 lookups/sec 0.23 secs
num results : 0
Did not find 500 rows
Generating 500 random queries...finished
4464.29 lookups/sec 0.11 secs
num results : 0
Did not find 500 rows
********************************************************************************
Bloom filters can also speed up lookups for entries that exist. In accumulo
data is divided into tablets and each tablet has multiple map files. Every
lookup in accumulo goes to a specific tablet where a lookup is done on each
map file in the tablet. So if a tablet has three map files, lookup performance
can be three times slower than a tablet with one map file. However if the map
files contain unique sets of data, then bloom filters can help eliminate map
files that do not contain the row being looked up. To illustrate this two
identical tables were created using the following process. One table had bloom
filters, the other did not. Also the major compaction ratio was increased to
prevent the files from being compacted into one file.
* Insert 1 million entries using RandomBatchWriter with a seed of 7
* Flush the table using the shell
* Insert 1 million entries using RandomBatchWriter with a seed of 8
* Flush the table using the shell
* Insert 1 million entries using RandomBatchWriter with a seed of 9
* Flush the table using the shell
After following the above steps, each table will have a tablet with three map
files. Flushing the table after each batch of inserts will create a map file.
Each map file will contain 1 million entries generated with a different seed.
This is assuming that Accumulo is configured with enough memory to hold 1
million inserts. If not, then more map files will be created.
The commands for creating the first table without bloom filters are below.
$ ./bin/accumulo shell -u username -p password
Shell - Apache Accumulo Interactive Shell
- version: 1.5.0
- instance name: instance
- instance id: 00000000-0000-0000-0000-000000000000
-
- type 'help' for a list of available commands
-
username@instance> setauths -u username -s exampleVis
username@instance> createtable bloom_test1
username@instance bloom_test1> config -t bloom_test1 -s table.compaction.major.ratio=7
username@instance bloom_test1> exit
$ ARGS="-i instance -z zookeepers -u username -p password -t bloom_test1 --num 1000000 --min 0 --max 1000000000 --size 50 --batchMemory 2M --batchLatency 60s --batchThreads 3 --vis exampleVis"
$ ./bin/accumulo org.apache.accumulo.examples.simple.client.RandomBatchWriter --seed 7 $ARGS
$ ./bin/accumulo shell -u username -p password -e 'flush -t bloom_test1 -w'
$ ./bin/accumulo org.apache.accumulo.examples.simple.client.RandomBatchWriter --seed 8 $ARGS
$ ./bin/accumulo shell -u username -p password -e 'flush -t bloom_test1 -w'
$ ./bin/accumulo org.apache.accumulo.examples.simple.client.RandomBatchWriter --seed 9 $ARGS
$ ./bin/accumulo shell -u username -p password -e 'flush -t bloom_test1 -w'
The commands for creating the second table with bloom filers are below.
$ ./bin/accumulo shell -u username -p password
Shell - Apache Accumulo Interactive Shell
- version: 1.5.0
- instance name: instance
- instance id: 00000000-0000-0000-0000-000000000000
-
- type 'help' for a list of available commands
-
username@instance> setauths -u username -s exampleVis
username@instance> createtable bloom_test2
username@instance bloom_test2> config -t bloom_test2 -s table.compaction.major.ratio=7
username@instance bloom_test2> config -t bloom_test2 -s table.bloom.enabled=true
username@instance bloom_test2> exit
$ ARGS="-i instance -z zookeepers -u username -p password -t bloom_test2 --num 1000000 --min 0 --max 1000000000 --size 50 --batchMemory 2M --batchLatency 60s --batchThreads 3 --vis exampleVis"
$ ./bin/accumulo org.apache.accumulo.examples.simple.client.RandomBatchWriter --seed 7 $ARGS
$ ./bin/accumulo shell -u username -p password -e 'flush -t bloom_test2 -w'
$ ./bin/accumulo org.apache.accumulo.examples.simple.client.RandomBatchWriter --seed 8 $ARGS
$ ./bin/accumulo shell -u username -p password -e 'flush -t bloom_test2 -w'
$ ./bin/accumulo org.apache.accumulo.examples.simple.client.RandomBatchWriter --seed 9 $ARGS
$ ./bin/accumulo shell -u username -p password -e 'flush -t bloom_test2 -w'
Below 500 lookups are done against the table without bloom filters using random
NG seed 7. Even though only one map file will likely contain entries for this
seed, all map files will be interrogated.
$ ./bin/accumulo org.apache.accumulo.examples.simple.client.RandomBatchScanner --seed 7 -i instance -z zookeepers -u username -p password -t bloom_test1 --num 500 --min 0 --max 1000000000 --size 50 --scanThreads 20 --auths exampleVis
Generating 500 random queries...finished
35.09 lookups/sec 14.25 secs
num results : 500
Generating 500 random queries...finished
35.33 lookups/sec 14.15 secs
num results : 500
Below the same lookups are done against the table with bloom filters. The
lookups were 2.86 times faster because only one map file was used, even though three
map files existed.
$ ./bin/accumulo org.apache.accumulo.examples.simple.client.RandomBatchScanner --seed 7 -i instance -z zookeepers -u username -p password -t bloom_test2 --num 500 --min 0 --max 1000000000 --size 50 -scanThreads 20 --auths exampleVis
Generating 500 random queries...finished
99.03 lookups/sec 5.05 secs
num results : 500
Generating 500 random queries...finished
101.15 lookups/sec 4.94 secs
num results : 500
You can verify the table has three files by looking in HDFS. To look in HDFS
you will need the table ID, because this is used in HDFS instead of the table
name. The following command will show table ids.
$ ./bin/accumulo shell -u username -p password -e 'tables -l'
accumulo.metadata => !0
accumulo.root => +r
bloom_test1 => o7
bloom_test2 => o8
trace => 1
So the table id for bloom_test2 is o8. The command below shows what files this
table has in HDFS. This assumes Accumulo is at the default location in HDFS.
$ hadoop fs -lsr /accumulo/tables/o8
drwxr-xr-x - username supergroup 0 2012-01-10 14:02 /accumulo/tables/o8/default_tablet
-rw-r--r-- 3 username supergroup 52672650 2012-01-10 14:01 /accumulo/tables/o8/default_tablet/F00000dj.rf
-rw-r--r-- 3 username supergroup 52436176 2012-01-10 14:01 /accumulo/tables/o8/default_tablet/F00000dk.rf
-rw-r--r-- 3 username supergroup 52850173 2012-01-10 14:02 /accumulo/tables/o8/default_tablet/F00000dl.rf
Running the rfile-info command shows that one of the files has a bloom filter
and its 1.5MB.
$ ./bin/accumulo rfile-info /accumulo/tables/o8/default_tablet/F00000dj.rf
Locality group : <DEFAULT>
Start block : 0
Num blocks : 752
Index level 0 : 43,598 bytes 1 blocks
First key : row_0000001169 foo:1 [exampleVis] 1326222052539 false
Last key : row_0999999421 foo:1 [exampleVis] 1326222052058 false
Num entries : 999,536
Column families : [foo]
Meta block : BCFile.index
Raw size : 4 bytes
Compressed size : 12 bytes
Compression type : gz
Meta block : RFile.index
Raw size : 43,696 bytes
Compressed size : 15,592 bytes
Compression type : gz
Meta block : acu_bloom
Raw size : 1,540,292 bytes
Compressed size : 1,433,115 bytes
Compression type : gz