| Title: Apache Accumulo Bloom Filter Example |
| Notice: Licensed to the Apache Software Foundation (ASF) under one |
| or more contributor license agreements. See the NOTICE file |
| distributed with this work for additional information |
| regarding copyright ownership. The ASF licenses this file |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this file except in compliance |
| with the License. You may obtain a copy of the License at |
| . |
| http://www.apache.org/licenses/LICENSE-2.0 |
| . |
| Unless required by applicable law or agreed to in writing, |
| software distributed under the License is distributed on an |
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| KIND, either express or implied. See the License for the |
| specific language governing permissions and limitations |
| under the License. |
| |
| This example shows how to create a table with bloom filters enabled. It also |
| shows how bloom filters increase query performance when looking for values that |
| do not exist in a table. |
| |
| Below table named bloom_test is created and bloom filters are enabled. |
| |
| $ ./bin/accumulo shell -u username -p password |
| Shell - Apache Accumulo Interactive Shell |
| - version: 1.5.0 |
| - instance name: instance |
| - instance id: 00000000-0000-0000-0000-000000000000 |
| - |
| - type 'help' for a list of available commands |
| - |
| username@instance> setauths -u username -s exampleVis |
| username@instance> createtable bloom_test |
| username@instance bloom_test> config -t bloom_test -s table.bloom.enabled=true |
| username@instance bloom_test> exit |
| |
| Below 1 million random values are inserted into accumulo. The randomly |
| generated rows range between 0 and 1 billion. The random number generator is |
| initialized with the seed 7. |
| |
| $ ./bin/accumulo org.apache.accumulo.examples.simple.client.RandomBatchWriter --seed 7 -i instance -z zookeepers -u username -p password -t bloom_test --num 1000000 --min 0 --max 1000000000 --size 50 --batchMemory 2M --batchLatency 60s --batchThreads 3 --vis exampleVis |
| |
| Below the table is flushed: |
| |
| $ ./bin/accumulo shell -u username -p password -e 'flush -t bloom_test -w' |
| 05 10:40:06,069 [shell.Shell] INFO : Flush of table bloom_test completed. |
| |
| After the flush completes, 500 random queries are done against the table. The |
| same seed is used to generate the queries, therefore everything is found in the |
| table. |
| |
| $ ./bin/accumulo org.apache.accumulo.examples.simple.client.RandomBatchScanner --seed 7 -i instance -z zookeepers -u username -p password -t bloom_test --num 500 --min 0 --max 1000000000 --size 50 --scanThreads 20 --auths exampleVis |
| Generating 500 random queries...finished |
| 96.19 lookups/sec 5.20 secs |
| num results : 500 |
| Generating 500 random queries...finished |
| 102.35 lookups/sec 4.89 secs |
| num results : 500 |
| |
| Below another 500 queries are performed, using a different seed which results |
| in nothing being found. In this case the lookups are much faster because of |
| the bloom filters. |
| |
| $ ./bin/accumulo org.apache.accumulo.examples.simple.client.RandomBatchScanner --seed 8 -i instance -z zookeepers -u username -p password -t bloom_test --num 500 --min 0 --max 1000000000 --size 50 -batchThreads 20 -auths exampleVis |
| Generating 500 random queries...finished |
| 2212.39 lookups/sec 0.23 secs |
| num results : 0 |
| Did not find 500 rows |
| Generating 500 random queries...finished |
| 4464.29 lookups/sec 0.11 secs |
| num results : 0 |
| Did not find 500 rows |
| |
| ******************************************************************************** |
| |
| Bloom filters can also speed up lookups for entries that exist. In accumulo |
| data is divided into tablets and each tablet has multiple map files. Every |
| lookup in accumulo goes to a specific tablet where a lookup is done on each |
| map file in the tablet. So if a tablet has three map files, lookup performance |
| can be three times slower than a tablet with one map file. However if the map |
| files contain unique sets of data, then bloom filters can help eliminate map |
| files that do not contain the row being looked up. To illustrate this two |
| identical tables were created using the following process. One table had bloom |
| filters, the other did not. Also the major compaction ratio was increased to |
| prevent the files from being compacted into one file. |
| |
| * Insert 1 million entries using RandomBatchWriter with a seed of 7 |
| * Flush the table using the shell |
| * Insert 1 million entries using RandomBatchWriter with a seed of 8 |
| * Flush the table using the shell |
| * Insert 1 million entries using RandomBatchWriter with a seed of 9 |
| * Flush the table using the shell |
| |
| After following the above steps, each table will have a tablet with three map |
| files. Flushing the table after each batch of inserts will create a map file. |
| Each map file will contain 1 million entries generated with a different seed. |
| This is assuming that Accumulo is configured with enough memory to hold 1 |
| million inserts. If not, then more map files will be created. |
| |
| The commands for creating the first table without bloom filters are below. |
| |
| $ ./bin/accumulo shell -u username -p password |
| Shell - Apache Accumulo Interactive Shell |
| - version: 1.5.0 |
| - instance name: instance |
| - instance id: 00000000-0000-0000-0000-000000000000 |
| - |
| - type 'help' for a list of available commands |
| - |
| username@instance> setauths -u username -s exampleVis |
| username@instance> createtable bloom_test1 |
| username@instance bloom_test1> config -t bloom_test1 -s table.compaction.major.ratio=7 |
| username@instance bloom_test1> exit |
| |
| $ ARGS="-i instance -z zookeepers -u username -p password -t bloom_test1 --num 1000000 --min 0 --max 1000000000 --size 50 --batchMemory 2M --batchLatency 60s --batchThreads 3 --vis exampleVis" |
| $ ./bin/accumulo org.apache.accumulo.examples.simple.client.RandomBatchWriter --seed 7 $ARGS |
| $ ./bin/accumulo shell -u username -p password -e 'flush -t bloom_test1 -w' |
| $ ./bin/accumulo org.apache.accumulo.examples.simple.client.RandomBatchWriter --seed 8 $ARGS |
| $ ./bin/accumulo shell -u username -p password -e 'flush -t bloom_test1 -w' |
| $ ./bin/accumulo org.apache.accumulo.examples.simple.client.RandomBatchWriter --seed 9 $ARGS |
| $ ./bin/accumulo shell -u username -p password -e 'flush -t bloom_test1 -w' |
| |
| The commands for creating the second table with bloom filers are below. |
| |
| $ ./bin/accumulo shell -u username -p password |
| Shell - Apache Accumulo Interactive Shell |
| - version: 1.5.0 |
| - instance name: instance |
| - instance id: 00000000-0000-0000-0000-000000000000 |
| - |
| - type 'help' for a list of available commands |
| - |
| username@instance> setauths -u username -s exampleVis |
| username@instance> createtable bloom_test2 |
| username@instance bloom_test2> config -t bloom_test2 -s table.compaction.major.ratio=7 |
| username@instance bloom_test2> config -t bloom_test2 -s table.bloom.enabled=true |
| username@instance bloom_test2> exit |
| |
| $ ARGS="-i instance -z zookeepers -u username -p password -t bloom_test2 --num 1000000 --min 0 --max 1000000000 --size 50 --batchMemory 2M --batchLatency 60s --batchThreads 3 --vis exampleVis" |
| $ ./bin/accumulo org.apache.accumulo.examples.simple.client.RandomBatchWriter --seed 7 $ARGS |
| $ ./bin/accumulo shell -u username -p password -e 'flush -t bloom_test2 -w' |
| $ ./bin/accumulo org.apache.accumulo.examples.simple.client.RandomBatchWriter --seed 8 $ARGS |
| $ ./bin/accumulo shell -u username -p password -e 'flush -t bloom_test2 -w' |
| $ ./bin/accumulo org.apache.accumulo.examples.simple.client.RandomBatchWriter --seed 9 $ARGS |
| $ ./bin/accumulo shell -u username -p password -e 'flush -t bloom_test2 -w' |
| |
| Below 500 lookups are done against the table without bloom filters using random |
| NG seed 7. Even though only one map file will likely contain entries for this |
| seed, all map files will be interrogated. |
| |
| $ ./bin/accumulo org.apache.accumulo.examples.simple.client.RandomBatchScanner --seed 7 -i instance -z zookeepers -u username -p password -t bloom_test1 --num 500 --min 0 --max 1000000000 --size 50 --scanThreads 20 --auths exampleVis |
| Generating 500 random queries...finished |
| 35.09 lookups/sec 14.25 secs |
| num results : 500 |
| Generating 500 random queries...finished |
| 35.33 lookups/sec 14.15 secs |
| num results : 500 |
| |
| Below the same lookups are done against the table with bloom filters. The |
| lookups were 2.86 times faster because only one map file was used, even though three |
| map files existed. |
| |
| $ ./bin/accumulo org.apache.accumulo.examples.simple.client.RandomBatchScanner --seed 7 -i instance -z zookeepers -u username -p password -t bloom_test2 --num 500 --min 0 --max 1000000000 --size 50 -scanThreads 20 --auths exampleVis |
| Generating 500 random queries...finished |
| 99.03 lookups/sec 5.05 secs |
| num results : 500 |
| Generating 500 random queries...finished |
| 101.15 lookups/sec 4.94 secs |
| num results : 500 |
| |
| You can verify the table has three files by looking in HDFS. To look in HDFS |
| you will need the table ID, because this is used in HDFS instead of the table |
| name. The following command will show table ids. |
| |
| $ ./bin/accumulo shell -u username -p password -e 'tables -l' |
| accumulo.metadata => !0 |
| accumulo.root => +r |
| bloom_test1 => o7 |
| bloom_test2 => o8 |
| trace => 1 |
| |
| So the table id for bloom_test2 is o8. The command below shows what files this |
| table has in HDFS. This assumes Accumulo is at the default location in HDFS. |
| |
| $ hadoop fs -lsr /accumulo/tables/o8 |
| drwxr-xr-x - username supergroup 0 2012-01-10 14:02 /accumulo/tables/o8/default_tablet |
| -rw-r--r-- 3 username supergroup 52672650 2012-01-10 14:01 /accumulo/tables/o8/default_tablet/F00000dj.rf |
| -rw-r--r-- 3 username supergroup 52436176 2012-01-10 14:01 /accumulo/tables/o8/default_tablet/F00000dk.rf |
| -rw-r--r-- 3 username supergroup 52850173 2012-01-10 14:02 /accumulo/tables/o8/default_tablet/F00000dl.rf |
| |
| Running the rfile-info command shows that one of the files has a bloom filter |
| and its 1.5MB. |
| |
| $ ./bin/accumulo rfile-info /accumulo/tables/o8/default_tablet/F00000dj.rf |
| Locality group : <DEFAULT> |
| Start block : 0 |
| Num blocks : 752 |
| Index level 0 : 43,598 bytes 1 blocks |
| First key : row_0000001169 foo:1 [exampleVis] 1326222052539 false |
| Last key : row_0999999421 foo:1 [exampleVis] 1326222052058 false |
| Num entries : 999,536 |
| Column families : [foo] |
| |
| Meta block : BCFile.index |
| Raw size : 4 bytes |
| Compressed size : 12 bytes |
| Compression type : gz |
| |
| Meta block : RFile.index |
| Raw size : 43,696 bytes |
| Compressed size : 15,592 bytes |
| Compression type : gz |
| |
| Meta block : acu_bloom |
| Raw size : 1,540,292 bytes |
| Compressed size : 1,433,115 bytes |
| Compression type : gz |
| |