| <!-- |
| Licensed to the Apache Software Foundation (ASF) under one or more |
| contributor license agreements. See the NOTICE file distributed with |
| this work for additional information regarding copyright ownership. |
| The ASF licenses this file to You under the Apache License, Version 2.0 |
| (the "License"); you may not use this file except in compliance with |
| the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License. |
| --> |
| # Apache Accumulo Sampling Example |
| |
| Basic Sampling Example |
| ---------------------- |
| |
| Accumulo supports building a set of sample data that can be efficiently |
| accessed by scanners. What data is included in the sample set is configurable. |
| Below, some data representing documents are inserted. |
| |
| root@instance> createnamespace examples |
| root@instance> createtable examples.sampex |
| root@instance examples.sampex> insert 9255 doc content 'abcde' |
| root@instance examples.sampex> insert 9255 doc url file://foo.txt |
| root@instance examples.sampex> insert 8934 doc content 'accumulo scales' |
| root@instance examples.sampex> insert 8934 doc url file://accumulo_notes.txt |
| root@instance examples.sampex> insert 2317 doc content 'milk, eggs, bread, parmigiano-reggiano' |
| root@instance examples.sampex> insert 2317 doc url file://groceries/9.txt |
| root@instance examples.sampex> insert 3900 doc content 'EC2 ate my homework' |
| root@instance examples.sampex> insert 3900 doc uril file://final_project.txt |
| |
| Below the table examples.sampex is configured to build a sample set. The configuration |
| causes Accumulo to include any row where `murmur3_32(row) % 3 ==0` in the |
| tables sample data. |
| |
| root@instance examples.sampex> config -t examples.sampex -s table.sampler.opt.hasher=murmur3_32 |
| root@instance examples.sampex> config -t examples.sampex -s table.sampler.opt.modulus=3 |
| root@instance examples.sampex> config -t examples.sampex -s table.sampler=org.apache.accumulo.core.client.sample.RowSampler |
| |
| Below, attempting to scan the sample returns an error. This is because data |
| was inserted before the sample set was configured. |
| |
| root@instance examples.sampex> scan --sample |
| 2015-09-09 12:21:50,643 [shell.Shell] ERROR: org.apache.accumulo.core.client.SampleNotPresentException: Table sampex(ID:2) does not have sampling configured or built |
| |
| To remedy this problem, the following command will flush in memory data and |
| compact any files that do not contain the correct sample data. |
| |
| root@instance examples.sampex> compact -t examples.sampex --sf-no-sample |
| |
| After the compaction, the sample scan works. |
| |
| root@instance examples.sampex> scan --sample |
| 2317 doc:content [] milk, eggs, bread, parmigiano-reggiano |
| 2317 doc:url [] file://groceries/9.txt |
| |
| The commands below show that updates to data in the sample are seen when |
| scanning the sample. |
| |
| root@instance examples.sampex> insert 2317 doc content 'milk, eggs, bread, parmigiano-reggiano, butter' |
| root@instance examples.sampex> scan --sample |
| 2317 doc:content [] milk, eggs, bread, parmigiano-reggiano, butter |
| 2317 doc:url [] file://groceries/9.txt |
| |
| In order to make scanning the sample fast, sample data is partitioned as data is |
| written to Accumulo. This means if the sample configuration is changed, that |
| data written previously is partitioned using a different criteria. Accumulo |
| will detect this situation and fail sample scans. The commands below show this |
| failure and fixing the problem with a compaction. |
| |
| root@instance examples.sampex> config -t examples.sampex -s table.sampler.opt.modulus=2 |
| root@instance examples.sampex> scan --sample |
| 2015-09-09 12:22:51,058 [shell.Shell] ERROR: org.apache.accumulo.core.client.SampleNotPresentException: Table sampex(ID:2) does not have sampling configured or built |
| root@instance examples.sampex> compact -t examples.sampex --sf-no-sample |
| 2015-09-09 12:23:07,242 [shell.Shell] INFO : Compaction of table sampex started for given range |
| root@instance examples.sampex> scan --sample |
| 2317 doc:content [] milk, eggs, bread, parmigiano-reggiano, butter |
| 2317 doc:url [] file://groceries/9.txt |
| 3900 doc:content [] EC2 ate my homework |
| 3900 doc:uril [] file://final_project.txt |
| 9255 doc:content [] abcde |
| 9255 doc:url [] file://foo.txt |
| |
| The example above is replicated in a java program using the Accumulo API. |
| Below is the program name, and the command to run it. |
| |
| ./bin/runex sample.SampleExample |
| |
| The commands below look under the hood to give some insight into how this |
| feature works. The commands determine what files the sampex table is using. |
| |
| root@instance> tables -l |
| accumulo.metadata => !0 |
| accumulo.replication => +rep |
| accumulo.root => +r |
| examples.sampex => 2 |
| trace => 1 |
| root@instance sampex> scan -t accumulo.metadata -c file -b 2 -e 2< |
| 2< file:hdfs://localhost:10000/accumulo/tables/2/default_tablet/A000000s.rf [] 702,8 |
| |
| Below shows running `accumulo rfile-info` on the file above. This shows the |
| rfile has a normal default locality group and a sample default locality group. |
| The output also shows the configuration used to create the sample locality |
| group. The sample configuration within a rfile must match the tables sample |
| configuration for sample scan to work. |
| |
| $ accumulo rfile-info hdfs://localhost:10000/accumulo/tables/2/default_tablet/A000000s.rf |
| Reading file: hdfs://localhost:10000/accumulo/tables/2/default_tablet/A000000s.rf |
| RFile Version : 8 |
| |
| Locality group : <DEFAULT> |
| Start block : 0 |
| Num blocks : 1 |
| Index level 0 : 35 bytes 1 blocks |
| First key : 2317 doc:content [] 1437672014986 false |
| Last key : 9255 doc:url [] 1437672014875 false |
| Num entries : 8 |
| Column families : [doc] |
| |
| Sample Configuration : |
| Sampler class : org.apache.accumulo.core.client.sample.RowSampler |
| Sampler options : {hasher=murmur3_32, modulus=2} |
| |
| Sample Locality group : <DEFAULT> |
| Start block : 0 |
| Num blocks : 1 |
| Index level 0 : 36 bytes 1 blocks |
| First key : 2317 doc:content [] 1437672014986 false |
| Last key : 9255 doc:url [] 1437672014875 false |
| Num entries : 6 |
| Column families : [doc] |
| |
| Meta block : BCFile.index |
| Raw size : 4 bytes |
| Compressed size : 12 bytes |
| Compression type : gz |
| |
| Meta block : RFile.index |
| Raw size : 309 bytes |
| Compressed size : 176 bytes |
| Compression type : gz |
| |
| |
| Shard Sampling Example |
| ---------------------- |
| |
| Note: Before continuing, you need to complete the Shard example, located [here][shard]. |
| |
| The Shard example shows how to index and search files using Accumulo. That |
| example indexes documents into a table named `examples.shard`. The indexing scheme used |
| in that example places the document name in the column qualifier. A useful |
| sample of this indexing scheme should contain all data for any document in the |
| sample. To accomplish this, the following commands build a sample for the |
| shard table based on the column qualifier. |
| |
| root@instance examples.shard> config -t examples.shard -s table.sampler.opt.hasher=murmur3_32 |
| root@instance examples.shard> config -t examples.shard -s table.sampler.opt.modulus=101 |
| root@instance examples.shard> config -t examples.shard -s table.sampler.opt.qualifier=true |
| root@instance examples.shard> config -t examples.shard -s table.sampler=org.apache.accumulo.core.client.sample.RowColumnSampler |
| root@instance examples.shard> compact -t examples.shard --sf-no-sample -w |
| 2015-07-23 15:00:09,280 [shell.Shell] INFO : Compacting table ... |
| 2015-07-23 15:00:10,134 [shell.Shell] INFO : Compaction of table shard completed for given range |
| |
| After enabling sampling, the command below counts the number of documents in |
| the sample containing the words `import` and `int`. |
| |
| $ ./bin/runex shard.Query --sample -t examples.shard import int | fgrep '.java' | wc |
| 4 4 395 |
| |
| The command below counts the total number of documents containing the words |
| `import` and `int`. |
| |
| $ ./bin/runex shard.Query -t examples.shard import int | fgrep '.java' | wc |
| 382 382 40084 |
| |
| The counts 4 out of 395 total are around what would be expected for a modulus |
| of 101. Querying the sample first provides a quick way to estimate how much data |
| the real query will bring back. |
| |
| Another way sample data could be used with the shard example is with a |
| specialized iterator. In the examples source code there is an iterator named |
| CutoffIntersectingIterator. This iterator first checks how many documents are |
| found in the sample data. If too many documents are found in the sample data, |
| then it returns nothing. Otherwise, it proceeds to query the full data set. |
| To experiment with this iterator, use the following command. The |
| `--sampleCutoff` option below will cause the query to return nothing if based |
| on the sample it appears a query would return more than 1000 documents. |
| |
| $ ./bin/runex shard.Query --sampleCutoff 1000 -t examples.shard import int | fgrep '.java' | wc |
| |
| [shard]: shard.md |