| <!-- |
| Licensed to the Apache Software Foundation (ASF) under one or more |
| contributor license agreements. See the NOTICE file distributed with |
| this work for additional information regarding copyright ownership. |
| The ASF licenses this file to You under the Apache License, Version 2.0 |
| (the "License"); you may not use this file except in compliance with |
| the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License. |
| --> |
| # Apache Accumulo Bloom Filter Example |
| |
| This example shows how to create a table with bloom filters enabled. The second part |
| shows how bloom filters increase query performance when looking for values that |
| do not exist in a table. |
| |
| ## Bloom Filters Enabled |
| |
| Accumulo data is divided into tablets and each tablet has multiple r-files. |
| Lookup performance of a tablet with 3 r-files can be 3 times slower than |
| a tablet with one r-file. However, if the files contain unique sets of data, |
| then bloom filters can help with performance. |
| |
| Run the example below to create two identical tables. One table has bloom |
| filters enabled, the other does not. The major compaction ratio was increased to |
| prevent the files from being compacted into one file. If Accumulo is not configured |
| with enough memory to hold 1 million rows then more r-files will be created. |
| |
| $ ./bin/runex bloom.BloomFilters |
| |
| Run the example below to perform 500 lookups against each table. Even though only one r-file will |
| likely contain entries for the query, all files will be interrogated. |
| |
| $ ./bin/runex bloom.BloomBatchScanner |
| |
| Scanning example.bloom_test1 with seed 7 |
| Scan finished! 282.49 lookups/sec, 1.77 secs, 500 results |
| All expected rows were scanned |
| Scanning examples.bloom_test2 with seed 7 |
| Scan finished! 704.23 lookups/sec, 0.71 secs, 500 results |
| All expected rows were scanned |
| |
| You can verify the table has three or more r-files by looking in HDFS. To look in HDFS |
| you will need the table ID, which can be found with the following shell command. |
| |
| $ accumulo shell -u username -p password -e 'tables -l' |
| accumulo.metadata => !0 |
| accumulo.root => +r |
| examples.bloom_test1 => 2 |
| examples.bloom_test2 => 3 |
| trace => 1 |
| |
| So the table id for bloom_test2 is 3. The command below shows what files this |
| table has in HDFS. This assumes Accumulo is at the default location in HDFS. |
| |
| $ hdfs dfs -ls -R /accumulo/tables/3 |
| drwxr-xr-x - username supergroup 0 2012-01-10 14:02 /accumulo/tables/3/default_tablet |
| -rw-r--r-- 3 username supergroup 52672650 2012-01-10 14:01 /accumulo/tables/3/default_tablet/F00000dj.rf |
| -rw-r--r-- 3 username supergroup 52436176 2012-01-10 14:01 /accumulo/tables/3/default_tablet/F00000dk.rf |
| -rw-r--r-- 3 username supergroup 52850173 2012-01-10 14:02 /accumulo/tables/3/default_tablet/F00000dl.rf |
| |
| Running the rfile-info command shows that one of the files has a bloom filter |
| and its 1.5MB. |
| |
| $ accumulo rfile-info /accumulo/tables/3/default_tablet/F00000dj.rf |
| Locality group : <DEFAULT> |
| Start block : 0 |
| Num blocks : 752 |
| Index level 0 : 43,598 bytes 1 blocks |
| First key : row_0000001169 foo:1 [exampleVis] 1326222052539 false |
| Last key : row_0999999421 foo:1 [exampleVis] 1326222052058 false |
| Num entries : 999,536 |
| Column families : [foo] |
| |
| Meta block : BCFile.index |
| Raw size : 4 bytes |
| Compressed size : 12 bytes |
| Compression type : gz |
| |
| Meta block : RFile.index |
| Raw size : 43,696 bytes |
| Compressed size : 15,592 bytes |
| Compression type : gz |
| |
| Meta block : acu_bloom |
| Raw size : 1,540,292 bytes |
| Compressed size : 1,433,115 bytes |
| Compression type : gz |
| |
| ## Bloom Filters when data is not found |
| |
| Run the example below to create 2 tables, one with bloom filters enabled. |
| |
| $ ./bin/runex bloom.BloomFiltersNotFound |
| |
| One million random values initialized with seed 7 are inserted into each table. |
| Once the flush completes, 500 random queries are done against each table but with a different seed. |
| Even when nothing is found the lookups are faster against the table with the bloom filters. |
| |
| Writing data to examples.bloom_test3 and examples.bloom_test4 (bloom filters enabled) |
| Scanning examples.bloom_test3 with seed 8 |
| Scan finished! 780.03 lookups/sec, 0.64 secs, 0 results |
| Did not find 500 |
| Scanning examples.bloom_test4 with seed 8 |
| Scan finished! 1736.11 lookups/sec, 0.29 secs, 0 results |
| Did not find 500 |