blob: d08658aadabf2484fe4ce82d79b58f1b0f7f34bf [file] [log] [blame]
Title: Apache Accumulo Shard Example
Notice: Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
Accumulo has an iterator called the intersecting iterator which supports querying a term index that is partitioned by
document, or "sharded". This example shows how to use the intersecting iterator through these four programs:
* - Indexes a set of text files into an Accumulo table
* - Finds documents containing a given set of terms.
* - Reads the index table and writes a map of documents to terms into another table.
* Uses the table populated by to select N random terms per document. Then it continuously and randomly queries those terms.
To run these example programs, create two tables like below.
username@instance> createtable shard
username@instance shard> createtable doc2term
After creating the tables, index some files. The following command indexes all of the java files in the Accumulo source code.
$ cd /local/username/workspace/accumulo/
$ find core/src server/src -name "*.java" | xargs ./bin/accumulo org.apache.accumulo.examples.simple.shard.Index -i instance -z zookeepers -t shard -u username -p password --partitions 30
The following command queries the index to find all files containing 'foo' and 'bar'.
$ ./bin/accumulo org.apache.accumulo.examples.simple.shard.Query -i instance -z zookeepers -t shard -u username -p password foo bar
In order to run ContinuousQuery, we need to run to populate doc2term.
$ ./bin/accumulo org.apache.accumulo.examples.simple.shard.Reverse -i instance -z zookeepers --shardTable shard --doc2Term doc2term -u username -p password
Below ContinuousQuery is run using 5 terms. So it selects 5 random terms from each document, then it continually
randomly selects one set of 5 terms and queries. It prints the number of matching documents and the time in seconds.
$ ./bin/accumulo org.apache.accumulo.examples.simple.shard.ContinuousQuery -i instance -z zookeepers --shardTable shard --doc2Term doc2term -u username -p password --terms 5
[public, core, class, binarycomparable, b] 2 0.081
[wordtodelete, unindexdocument, doctablename, putdelete, insert] 1 0.041
[import, columnvisibilityinterpreterfactory, illegalstateexception, cv, columnvisibility] 1 0.049
[getpackage, testversion, util, version, 55] 1 0.048
[for, static, println, public, the] 55 0.211
[sleeptime, wrappingiterator, options, long, utilwaitthread] 1 0.057
[string, public, long, 0, wait] 12 0.132