blob: 3642cc66d5c5ebe9f1176c9a1bd22a858c749bd3 [file] [log] [blame]
Title: Apache Accumulo Batch Writing and Scanning Example
Notice: Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
.
http://www.apache.org/licenses/LICENSE-2.0
.
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
Basic Sampling Example
----------------------
Accumulo supports building a set of sample data that can be efficiently
accessed by scanners. What data is included in the sample set is configurable.
Below, some data representing documents are inserted.
root@instance sampex> createtable sampex
root@instance sampex> insert 9255 doc content 'abcde'
root@instance sampex> insert 9255 doc url file://foo.txt
root@instance sampex> insert 8934 doc content 'accumulo scales'
root@instance sampex> insert 8934 doc url file://accumulo_notes.txt
root@instance sampex> insert 2317 doc content 'milk, eggs, bread, parmigiano-reggiano'
root@instance sampex> insert 2317 doc url file://groceries/9.txt
root@instance sampex> insert 3900 doc content 'EC2 ate my homework'
root@instance sampex> insert 3900 doc uril file://final_project.txt
Below the table sampex is configured to build a sample set. The configuration
causes Accumulo to include any row where `murmur3_32(row) % 3 ==0` in the
tables sample data.
root@instance sampex> config -t sampex -s table.sampler.opt.hasher=murmur3_32
root@instance sampex> config -t sampex -s table.sampler.opt.modulus=3
root@instance sampex> config -t sampex -s table.sampler=org.apache.accumulo.core.client.sample.RowSampler
Below, attempting to scan the sample returns an error. This is because data
was inserted before the sample set was configured.
root@instance sampex> scan --sample
2015-09-09 12:21:50,643 [shell.Shell] ERROR: org.apache.accumulo.core.client.SampleNotPresentException: Table sampex(ID:2) does not have sampling configured or built
To remedy this problem, the following command will flush in memory data and
compact any files that do not contain the correct sample data.
root@instance sampex> compact -t sampex --sf-no-sample
After the compaction, the sample scan works.
root@instance sampex> scan --sample
2317 doc:content [] milk, eggs, bread, parmigiano-reggiano
2317 doc:url [] file://groceries/9.txt
The commands below show that updates to data in the sample are seen when
scanning the sample.
root@instance sampex> insert 2317 doc content 'milk, eggs, bread, parmigiano-reggiano, butter'
root@instance sampex> scan --sample
2317 doc:content [] milk, eggs, bread, parmigiano-reggiano, butter
2317 doc:url [] file://groceries/9.txt
Inorder to make scanning the sample fast, sample data is partitioned as data is
written to Accumulo. This means if the sample configuration is changed, that
data written previously is partitioned using a different criteria. Accumulo
will detect this situation and fail sample scans. The commands below show this
failure and fixiing the problem with a compaction.
root@instance sampex> config -t sampex -s table.sampler.opt.modulus=2
root@instance sampex> scan --sample
2015-09-09 12:22:51,058 [shell.Shell] ERROR: org.apache.accumulo.core.client.SampleNotPresentException: Table sampex(ID:2) does not have sampling configured or built
root@instance sampex> compact -t sampex --sf-no-sample
2015-09-09 12:23:07,242 [shell.Shell] INFO : Compaction of table sampex started for given range
root@instance sampex> scan --sample
2317 doc:content [] milk, eggs, bread, parmigiano-reggiano
2317 doc:url [] file://groceries/9.txt
3900 doc:content [] EC2 ate my homework
3900 doc:uril [] file://final_project.txt
9255 doc:content [] abcde
9255 doc:url [] file://foo.txt
The example above is replicated in a java program using the Accumulo API.
Below is the program name and the command to run it.
./bin/accumulo org.apache.accumulo.examples.simple.sample.SampleExample -i instance -z localhost -u root -p secret
The commands below look under the hood to give some insight into how this
feature works. The commands determine what files the sampex table is using.
root@instance sampex> tables -l
accumulo.metadata => !0
accumulo.replication => +rep
accumulo.root => +r
sampex => 2
trace => 1
root@instance sampex> scan -t accumulo.metadata -c file -b 2 -e 2<
2< file:hdfs://localhost:10000/accumulo/tables/2/default_tablet/A000000s.rf [] 702,8
Below shows running `accumulo rfile-info` on the file above. This shows the
rfile has a normal default locality group and a sample default locality group.
The output also shows the configuration used to create the sample locality
group. The sample configuration within a rfile must match the tables sample
configuration for sample scan to work.
$ ./bin/accumulo rfile-info hdfs://localhost:10000/accumulo/tables/2/default_tablet/A000000s.rf
Reading file: hdfs://localhost:10000/accumulo/tables/2/default_tablet/A000000s.rf
RFile Version : 8
Locality group : <DEFAULT>
Start block : 0
Num blocks : 1
Index level 0 : 35 bytes 1 blocks
First key : 2317 doc:content [] 1437672014986 false
Last key : 9255 doc:url [] 1437672014875 false
Num entries : 8
Column families : [doc]
Sample Configuration :
Sampler class : org.apache.accumulo.core.client.sample.RowSampler
Sampler options : {hasher=murmur3_32, modulus=2}
Sample Locality group : <DEFAULT>
Start block : 0
Num blocks : 1
Index level 0 : 36 bytes 1 blocks
First key : 2317 doc:content [] 1437672014986 false
Last key : 9255 doc:url [] 1437672014875 false
Num entries : 6
Column families : [doc]
Meta block : BCFile.index
Raw size : 4 bytes
Compressed size : 12 bytes
Compression type : gz
Meta block : RFile.index
Raw size : 309 bytes
Compressed size : 176 bytes
Compression type : gz
Shard Sampling Example
-------------------------
`README.shard` shows how to index and search files using Accumulo. That
example indexes documents into a table named `shard`. The indexing scheme used
in that example places the document name in the column qualifier. A useful
sample of this indexing scheme should contain all data for any document in the
sample. To accomplish this, the following commands build a sample for the
shard table based on the column qualifier.
root@instance shard> config -t shard -s table.sampler.opt.hasher=murmur3_32
root@instance shard> config -t shard -s table.sampler.opt.modulus=101
root@instance shard> config -t shard -s table.sampler.opt.qualifier=true
root@instance shard> config -t shard -s table.sampler=org.apache.accumulo.core.client.sample.RowColumnSampler
root@instance shard> compact -t shard --sf-no-sample -w
2015-07-23 15:00:09,280 [shell.Shell] INFO : Compacting table ...
2015-07-23 15:00:10,134 [shell.Shell] INFO : Compaction of table shard completed for given range
After enabling sampling, the command below counts the number of documents in
the sample containing the words `import` and `int`.
$ ./bin/accumulo org.apache.accumulo.examples.simple.shard.Query --sample -i instance16 -z localhost -t shard -u root -p secret import int | fgrep '.java' | wc
11 11 1246
The command below counts the total number of documents containing the words
`import` and `int`.
$ ./bin/accumulo org.apache.accumulo.examples.simple.shard.Query -i instance16 -z localhost -t shard -u root -p secret import int | fgrep '.java' | wc
1085 1085 118175
The counts 11 out of 1085 total are around what would be expected for a modulus
of 101. Querying the sample first provides a quick way to estimate how much data
the real query will bring back.
Another way sample data could be used with the shard example is with a
specialized iterator. In the examples source code there is an iterator named
CutoffIntersectingIterator. This iterator first checks how many documents are
found in the sample data. If too many documents are found in the sample data,
then it returns nothing. Otherwise it proceeds to query the full data set.
To experiment with this iterator, use the following command. The
`--sampleCutoff` option below will cause the query to return nothing if based
on the sample it appears a query would return more than 1000 documents.
$ ./bin/accumulo org.apache.accumulo.examples.simple.shard.Query --sampleCutoff 1000 -i instance16 -z localhost -t shard -u root -p secret import int | fgrep '.java' | wc