docs/src/main/resources/examples/README.bloom - accumulo - Git at Google

 Title: Apache Accumulo Bloom Filter Example
 Notice:    Licensed to the Apache Software Foundation (ASF) under one
            or more contributor license agreements.  See the NOTICE file
            distributed with this work for additional information
            regarding copyright ownership.  The ASF licenses this file
            to you under the Apache License, Version 2.0 (the
            "License"); you may not use this file except in compliance
            with the License.  You may obtain a copy of the License at
            .
              http://www.apache.org/licenses/LICENSE-2.0
            .
            Unless required by applicable law or agreed to in writing,
            software distributed under the License is distributed on an
            "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
            KIND, either express or implied.  See the License for the
            specific language governing permissions and limitations
            under the License.

 This example shows how to create a table with bloom filters enabled.  It also
 shows how bloom filters increase query performance when looking for values that
 do not exist in a table.

 Below table named bloom_test is created and bloom filters are enabled.

     $ ./bin/accumulo shell -u username -p password
     Shell - Apache Accumulo Interactive Shell
     - version: 1.5.0
     - instance name: instance
     - instance id: 00000000-0000-0000-0000-000000000000
     -
     - type 'help' for a list of available commands
     -
     username@instance> setauths -u username -s exampleVis
     username@instance> createtable bloom_test
     username@instance bloom_test> config -t bloom_test -s table.bloom.enabled=true
     username@instance bloom_test> exit

 Below 1 million random values are inserted into accumulo. The randomly
 generated rows range between 0 and 1 billion. The random number generator is
 initialized with the seed 7.

     $ ./bin/accumulo org.apache.accumulo.examples.simple.client.RandomBatchWriter --seed 7 -i instance -z zookeepers -u username -p password -t bloom_test --num 1000000 --min 0 --max 1000000000 --size 50 --batchMemory 2M --batchLatency 60s --batchThreads 3 --vis exampleVis

 Below the table is flushed:

     $ ./bin/accumulo shell -u username -p password -e 'flush -t bloom_test -w'
     05 10:40:06,069 [shell.Shell] INFO : Flush of table bloom_test completed.

 After the flush completes, 500 random queries are done against the table. The
 same seed is used to generate the queries, therefore everything is found in the
 table.

     $ ./bin/accumulo org.apache.accumulo.examples.simple.client.RandomBatchScanner --seed 7 -i instance -z zookeepers -u username -p password -t bloom_test --num 500 --min 0 --max 1000000000 --size 50 --scanThreads 20 --auths exampleVis
     Generating 500 random queries...finished
     96.19 lookups/sec   5.20 secs
     num results : 500
     Generating 500 random queries...finished
     102.35 lookups/sec   4.89 secs
     num results : 500

 Below another 500 queries are performed, using a different seed which results
 in nothing being found. In this case the lookups are much faster because of
 the bloom filters.

     $ ./bin/accumulo org.apache.accumulo.examples.simple.client.RandomBatchScanner --seed 8 -i instance -z zookeepers -u username -p password -t bloom_test --num 500 --min 0 --max 1000000000 --size 50 -batchThreads 20 -auths exampleVis
     Generating 500 random queries...finished
     2212.39 lookups/sec   0.23 secs
     num results : 0
     Did not find 500 rows
     Generating 500 random queries...finished
     4464.29 lookups/sec   0.11 secs
     num results : 0
     Did not find 500 rows

 ********************************************************************************

 Bloom filters can also speed up lookups for entries that exist. In accumulo
 data is divided into tablets and each tablet has multiple map files. Every
 lookup in accumulo goes to a specific tablet where a lookup is done on each
 map file in the tablet. So if a tablet has three map files, lookup performance
 can be three times slower than a tablet with one map file. However if the map
 files contain unique sets of data, then bloom filters can help eliminate map
 files that do not contain the row being looked up. To illustrate this two
 identical tables were created using the following process. One table had bloom
 filters, the other did not. Also the major compaction ratio was increased to
 prevent the files from being compacted into one file.

  * Insert 1 million entries using  RandomBatchWriter with a seed of 7
  * Flush the table using the shell
  * Insert 1 million entries using  RandomBatchWriter with a seed of 8
  * Flush the table using the shell
  * Insert 1 million entries using  RandomBatchWriter with a seed of 9
  * Flush the table using the shell

 After following the above steps, each table will have a tablet with three map
 files. Flushing the table after each batch of inserts will create a map file.
 Each map file will contain 1 million entries generated with a different seed.
 This is assuming that Accumulo is configured with enough memory to hold 1
 million inserts. If not, then more map files will be created.

 The commands for creating the first table without bloom filters are below.

     $ ./bin/accumulo shell -u username -p password
     Shell - Apache Accumulo Interactive Shell
     - version: 1.5.0
     - instance name: instance
     - instance id: 00000000-0000-0000-0000-000000000000
     -
     - type 'help' for a list of available commands
     -
     username@instance> setauths -u username -s exampleVis
     username@instance> createtable bloom_test1
     username@instance bloom_test1> config -t bloom_test1 -s table.compaction.major.ratio=7
     username@instance bloom_test1> exit

     $ ARGS="-i instance -z zookeepers -u username -p password -t bloom_test1 --num 1000000 --min 0 --max 1000000000 --size 50 --batchMemory 2M --batchLatency 60s --batchThreads 3 --vis exampleVis"
     $ ./bin/accumulo org.apache.accumulo.examples.simple.client.RandomBatchWriter --seed 7 $ARGS
     $ ./bin/accumulo shell -u username -p password -e 'flush -t bloom_test1 -w'
     $ ./bin/accumulo org.apache.accumulo.examples.simple.client.RandomBatchWriter --seed 8 $ARGS
     $ ./bin/accumulo shell -u username -p password -e 'flush -t bloom_test1 -w'
     $ ./bin/accumulo org.apache.accumulo.examples.simple.client.RandomBatchWriter --seed 9 $ARGS
     $ ./bin/accumulo shell -u username -p password -e 'flush -t bloom_test1 -w'

 The commands for creating the second table with bloom filers are below.

     $ ./bin/accumulo shell -u username -p password
     Shell - Apache Accumulo Interactive Shell
     - version: 1.5.0
     - instance name: instance
     - instance id: 00000000-0000-0000-0000-000000000000
     -
     - type 'help' for a list of available commands
     -
     username@instance> setauths -u username -s exampleVis
     username@instance> createtable bloom_test2
     username@instance bloom_test2> config -t bloom_test2 -s table.compaction.major.ratio=7
     username@instance bloom_test2> config -t bloom_test2 -s table.bloom.enabled=true
     username@instance bloom_test2> exit

     $ ARGS="-i instance -z zookeepers -u username -p password -t bloom_test2 --num 1000000 --min 0 --max 1000000000 --size 50 --batchMemory 2M --batchLatency 60s --batchThreads 3 --vis exampleVis"
     $ ./bin/accumulo org.apache.accumulo.examples.simple.client.RandomBatchWriter --seed 7 $ARGS
     $ ./bin/accumulo shell -u username -p password -e 'flush -t bloom_test2 -w'
     $ ./bin/accumulo org.apache.accumulo.examples.simple.client.RandomBatchWriter --seed 8 $ARGS
     $ ./bin/accumulo shell -u username -p password -e 'flush -t bloom_test2 -w'
     $ ./bin/accumulo org.apache.accumulo.examples.simple.client.RandomBatchWriter --seed 9 $ARGS
     $ ./bin/accumulo shell -u username -p password -e 'flush -t bloom_test2 -w'

 Below 500 lookups are done against the table without bloom filters using random
 NG seed 7. Even though only one map file will likely contain entries for this
 seed, all map files will be interrogated.

     $ ./bin/accumulo org.apache.accumulo.examples.simple.client.RandomBatchScanner --seed 7 -i instance -z zookeepers -u username -p password -t bloom_test1 --num 500 --min 0 --max 1000000000 --size 50 --scanThreads 20 --auths exampleVis
     Generating 500 random queries...finished
     35.09 lookups/sec  14.25 secs
     num results : 500
     Generating 500 random queries...finished
     35.33 lookups/sec  14.15 secs
     num results : 500

 Below the same lookups are done against the table with bloom filters. The
 lookups were 2.86 times faster because only one map file was used, even though three
 map files existed.

     $ ./bin/accumulo org.apache.accumulo.examples.simple.client.RandomBatchScanner --seed 7 -i instance -z zookeepers -u username -p password -t bloom_test2 --num 500 --min 0 --max 1000000000 --size 50 -scanThreads 20 --auths exampleVis
     Generating 500 random queries...finished
     99.03 lookups/sec   5.05 secs
     num results : 500
     Generating 500 random queries...finished
     101.15 lookups/sec   4.94 secs
     num results : 500

 You can verify the table has three files by looking in HDFS. To look in HDFS
 you will need the table ID, because this is used in HDFS instead of the table
 name. The following command will show table ids.

     $ ./bin/accumulo shell -u username -p password -e 'tables -l'
     accumulo.metadata    =>        !0
     accumulo.root        =>        +r
     bloom_test1          =>        o7
     bloom_test2          =>        o8
     trace                =>         1

 So the table id for bloom_test2 is o8. The command below shows what files this
 table has in HDFS. This assumes Accumulo is at the default location in HDFS.

     $ hadoop fs -lsr /accumulo/tables/o8
     drwxr-xr-x   - username supergroup          0 2012-01-10 14:02 /accumulo/tables/o8/default_tablet
     -rw-r--r--   3 username supergroup   52672650 2012-01-10 14:01 /accumulo/tables/o8/default_tablet/F00000dj.rf
     -rw-r--r--   3 username supergroup   52436176 2012-01-10 14:01 /accumulo/tables/o8/default_tablet/F00000dk.rf
     -rw-r--r--   3 username supergroup   52850173 2012-01-10 14:02 /accumulo/tables/o8/default_tablet/F00000dl.rf

 Running the rfile-info command shows that one of the files has a bloom filter
 and its 1.5MB.

     $ ./bin/accumulo rfile-info /accumulo/tables/o8/default_tablet/F00000dj.rf
     Locality group         : <DEFAULT>
 	Start block          : 0
 	Num   blocks         : 752
 	Index level 0        : 43,598 bytes  1 blocks
 	First key            : row_0000001169 foo:1 [exampleVis] 1326222052539 false
 	Last key             : row_0999999421 foo:1 [exampleVis] 1326222052058 false
 	Num entries          : 999,536
 	Column families      : [foo]

     Meta block     : BCFile.index
       Raw size             : 4 bytes
       Compressed size      : 12 bytes
       Compression type     : gz

     Meta block     : RFile.index
       Raw size             : 43,696 bytes
       Compressed size      : 15,592 bytes
       Compression type     : gz

     Meta block     : acu_bloom
       Raw size             : 1,540,292 bytes
       Compressed size      : 1,433,115 bytes
       Compression type     : gz
	Title: Apache Accumulo Bloom Filter Example
	Notice: Licensed to the Apache Software Foundation (ASF) under one
	or more contributor license agreements. See the NOTICE file
	distributed with this work for additional information
	regarding copyright ownership. The ASF licenses this file
	to you under the Apache License, Version 2.0 (the
	"License"); you may not use this file except in compliance
	with the License. You may obtain a copy of the License at
	.
	http://www.apache.org/licenses/LICENSE-2.0
	.
	Unless required by applicable law or agreed to in writing,
	software distributed under the License is distributed on an
	"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	KIND, either express or implied. See the License for the
	specific language governing permissions and limitations
	under the License.

	This example shows how to create a table with bloom filters enabled. It also
	shows how bloom filters increase query performance when looking for values that
	do not exist in a table.

	Below table named bloom_test is created and bloom filters are enabled.

	$ ./bin/accumulo shell -u username -p password
	Shell - Apache Accumulo Interactive Shell
	- version: 1.5.0
	- instance name: instance
	- instance id: 00000000-0000-0000-0000-000000000000
	-
	- type 'help' for a list of available commands
	-
	username@instance> setauths -u username -s exampleVis
	username@instance> createtable bloom_test
	username@instance bloom_test> config -t bloom_test -s table.bloom.enabled=true
	username@instance bloom_test> exit

	Below 1 million random values are inserted into accumulo. The randomly
	generated rows range between 0 and 1 billion. The random number generator is
	initialized with the seed 7.

	$ ./bin/accumulo org.apache.accumulo.examples.simple.client.RandomBatchWriter --seed 7 -i instance -z zookeepers -u username -p password -t bloom_test --num 1000000 --min 0 --max 1000000000 --size 50 --batchMemory 2M --batchLatency 60s --batchThreads 3 --vis exampleVis

	Below the table is flushed:

	$ ./bin/accumulo shell -u username -p password -e 'flush -t bloom_test -w'
	05 10:40:06,069 [shell.Shell] INFO : Flush of table bloom_test completed.

	After the flush completes, 500 random queries are done against the table. The
	same seed is used to generate the queries, therefore everything is found in the
	table.

	$ ./bin/accumulo org.apache.accumulo.examples.simple.client.RandomBatchScanner --seed 7 -i instance -z zookeepers -u username -p password -t bloom_test --num 500 --min 0 --max 1000000000 --size 50 --scanThreads 20 --auths exampleVis
	Generating 500 random queries...finished
	96.19 lookups/sec 5.20 secs
	num results : 500
	Generating 500 random queries...finished
	102.35 lookups/sec 4.89 secs
	num results : 500

	Below another 500 queries are performed, using a different seed which results
	in nothing being found. In this case the lookups are much faster because of
	the bloom filters.

	$ ./bin/accumulo org.apache.accumulo.examples.simple.client.RandomBatchScanner --seed 8 -i instance -z zookeepers -u username -p password -t bloom_test --num 500 --min 0 --max 1000000000 --size 50 -batchThreads 20 -auths exampleVis
	Generating 500 random queries...finished
	2212.39 lookups/sec 0.23 secs
	num results : 0
	Did not find 500 rows
	Generating 500 random queries...finished
	4464.29 lookups/sec 0.11 secs
	num results : 0
	Did not find 500 rows

	********************************************************************************

	Bloom filters can also speed up lookups for entries that exist. In accumulo
	data is divided into tablets and each tablet has multiple map files. Every
	lookup in accumulo goes to a specific tablet where a lookup is done on each
	map file in the tablet. So if a tablet has three map files, lookup performance
	can be three times slower than a tablet with one map file. However if the map
	files contain unique sets of data, then bloom filters can help eliminate map
	files that do not contain the row being looked up. To illustrate this two
	identical tables were created using the following process. One table had bloom
	filters, the other did not. Also the major compaction ratio was increased to
	prevent the files from being compacted into one file.

	* Insert 1 million entries using RandomBatchWriter with a seed of 7
	* Flush the table using the shell
	* Insert 1 million entries using RandomBatchWriter with a seed of 8
	* Flush the table using the shell
	* Insert 1 million entries using RandomBatchWriter with a seed of 9
	* Flush the table using the shell

	After following the above steps, each table will have a tablet with three map
	files. Flushing the table after each batch of inserts will create a map file.
	Each map file will contain 1 million entries generated with a different seed.
	This is assuming that Accumulo is configured with enough memory to hold 1
	million inserts. If not, then more map files will be created.

	The commands for creating the first table without bloom filters are below.

	$ ./bin/accumulo shell -u username -p password
	Shell - Apache Accumulo Interactive Shell
	- version: 1.5.0
	- instance name: instance
	- instance id: 00000000-0000-0000-0000-000000000000
	-
	- type 'help' for a list of available commands
	-
	username@instance> setauths -u username -s exampleVis
	username@instance> createtable bloom_test1
	username@instance bloom_test1> config -t bloom_test1 -s table.compaction.major.ratio=7
	username@instance bloom_test1> exit

	$ ARGS="-i instance -z zookeepers -u username -p password -t bloom_test1 --num 1000000 --min 0 --max 1000000000 --size 50 --batchMemory 2M --batchLatency 60s --batchThreads 3 --vis exampleVis"
	$ ./bin/accumulo org.apache.accumulo.examples.simple.client.RandomBatchWriter --seed 7 $ARGS
	$ ./bin/accumulo shell -u username -p password -e 'flush -t bloom_test1 -w'
	$ ./bin/accumulo org.apache.accumulo.examples.simple.client.RandomBatchWriter --seed 8 $ARGS
	$ ./bin/accumulo shell -u username -p password -e 'flush -t bloom_test1 -w'
	$ ./bin/accumulo org.apache.accumulo.examples.simple.client.RandomBatchWriter --seed 9 $ARGS
	$ ./bin/accumulo shell -u username -p password -e 'flush -t bloom_test1 -w'

	The commands for creating the second table with bloom filers are below.

	$ ./bin/accumulo shell -u username -p password
	Shell - Apache Accumulo Interactive Shell
	- version: 1.5.0
	- instance name: instance
	- instance id: 00000000-0000-0000-0000-000000000000
	-
	- type 'help' for a list of available commands
	-
	username@instance> setauths -u username -s exampleVis
	username@instance> createtable bloom_test2
	username@instance bloom_test2> config -t bloom_test2 -s table.compaction.major.ratio=7
	username@instance bloom_test2> config -t bloom_test2 -s table.bloom.enabled=true
	username@instance bloom_test2> exit

	$ ARGS="-i instance -z zookeepers -u username -p password -t bloom_test2 --num 1000000 --min 0 --max 1000000000 --size 50 --batchMemory 2M --batchLatency 60s --batchThreads 3 --vis exampleVis"
	$ ./bin/accumulo org.apache.accumulo.examples.simple.client.RandomBatchWriter --seed 7 $ARGS
	$ ./bin/accumulo shell -u username -p password -e 'flush -t bloom_test2 -w'
	$ ./bin/accumulo org.apache.accumulo.examples.simple.client.RandomBatchWriter --seed 8 $ARGS
	$ ./bin/accumulo shell -u username -p password -e 'flush -t bloom_test2 -w'
	$ ./bin/accumulo org.apache.accumulo.examples.simple.client.RandomBatchWriter --seed 9 $ARGS
	$ ./bin/accumulo shell -u username -p password -e 'flush -t bloom_test2 -w'

	Below 500 lookups are done against the table without bloom filters using random
	NG seed 7. Even though only one map file will likely contain entries for this
	seed, all map files will be interrogated.

	$ ./bin/accumulo org.apache.accumulo.examples.simple.client.RandomBatchScanner --seed 7 -i instance -z zookeepers -u username -p password -t bloom_test1 --num 500 --min 0 --max 1000000000 --size 50 --scanThreads 20 --auths exampleVis
	Generating 500 random queries...finished
	35.09 lookups/sec 14.25 secs
	num results : 500
	Generating 500 random queries...finished
	35.33 lookups/sec 14.15 secs
	num results : 500

	Below the same lookups are done against the table with bloom filters. The
	lookups were 2.86 times faster because only one map file was used, even though three
	map files existed.

	$ ./bin/accumulo org.apache.accumulo.examples.simple.client.RandomBatchScanner --seed 7 -i instance -z zookeepers -u username -p password -t bloom_test2 --num 500 --min 0 --max 1000000000 --size 50 -scanThreads 20 --auths exampleVis
	Generating 500 random queries...finished
	99.03 lookups/sec 5.05 secs
	num results : 500
	Generating 500 random queries...finished
	101.15 lookups/sec 4.94 secs
	num results : 500

	You can verify the table has three files by looking in HDFS. To look in HDFS
	you will need the table ID, because this is used in HDFS instead of the table
	name. The following command will show table ids.

	$ ./bin/accumulo shell -u username -p password -e 'tables -l'
	accumulo.metadata => !0
	accumulo.root => +r
	bloom_test1 => o7
	bloom_test2 => o8
	trace => 1

	So the table id for bloom_test2 is o8. The command below shows what files this
	table has in HDFS. This assumes Accumulo is at the default location in HDFS.

	$ hadoop fs -lsr /accumulo/tables/o8
	drwxr-xr-x - username supergroup 0 2012-01-10 14:02 /accumulo/tables/o8/default_tablet
	-rw-r--r-- 3 username supergroup 52672650 2012-01-10 14:01 /accumulo/tables/o8/default_tablet/F00000dj.rf
	-rw-r--r-- 3 username supergroup 52436176 2012-01-10 14:01 /accumulo/tables/o8/default_tablet/F00000dk.rf
	-rw-r--r-- 3 username supergroup 52850173 2012-01-10 14:02 /accumulo/tables/o8/default_tablet/F00000dl.rf

	Running the rfile-info command shows that one of the files has a bloom filter
	and its 1.5MB.

	$ ./bin/accumulo rfile-info /accumulo/tables/o8/default_tablet/F00000dj.rf
	Locality group : <DEFAULT>
	Start block : 0
	Num blocks : 752
	Index level 0 : 43,598 bytes 1 blocks
	First key : row_0000001169 foo:1 [exampleVis] 1326222052539 false
	Last key : row_0999999421 foo:1 [exampleVis] 1326222052058 false
	Num entries : 999,536
	Column families : [foo]

	Meta block : BCFile.index
	Raw size : 4 bytes
	Compressed size : 12 bytes
	Compression type : gz

	Meta block : RFile.index
	Raw size : 43,696 bytes
	Compressed size : 15,592 bytes
	Compression type : gz

	Meta block : acu_bloom
	Raw size : 1,540,292 bytes
	Compressed size : 1,433,115 bytes
	Compression type : gz