Apache Accumulo Shard Example

Accumulo has an iterator called the intersecting iterator which supports querying a term index that is partitioned by document, or “sharded”. This example shows how to use the intersecting iterator through these four programs:

  • Index.java - Indexes a set of text files into an Accumulo table
  • Query.java - Finds documents containing a given set of terms.
  • Reverse.java - Reads the index table and writes a map of documents to terms into another table.
  • ContinuousQuery.java - Uses the table populated by Reverse.java to select N random terms per document. Then it continuously and randomly queries those terms.

To run these example programs, create two tables like below.

username@instance> createnamespace examples
username@instance> createtable examples.shard
username@instance examples.shard> createtable examples.doc2term

After creating the tables, index some files. The following command indexes all the java files in the Accumulo source code.

$ find /path/to/accumulo/core -name "*.java" | xargs ./bin/runex shard.Index -t examples.shard --partitions 30

The following command queries the index to find all files containing ‘foo’ and ‘bar’.

$ ./bin/runex shard.Query -t examples.shard foo bar
  /path/to/accumulo/core/src/test/java/org/apache/accumulo/core/spi/balancer/BaseHostRegexTableLoadBalancerTest.java
  /path/to/accumulo/core/src/test/java/org/apache/accumulo/core/iterators/user/WholeRowIteratorTest.java
  /path/to/accumulo/core/src/test/java/org/apache/accumulo/core/iteratorsImpl/IteratorConfigUtilTest.java
  /path/to/accumulo/core/src/test/java/org/apache/accumulo/core/data/KeyBuilderTest.java
  /path/to/accumulo/core/src/test/java/org/apache/accumulo/core/spi/balancer/HostRegexTableLoadBalancerReconfigurationTest.java
  /path/to/accumulo/core/src/test/java/org/apache/accumulo/core/security/ColumnVisibilityTest.java
  /path/to/accumulo/core/src/test/java/org/apache/accumulo/core/summary/SummaryCollectionTest.java
  /path/to/accumulo/core/src/test/java/org/apache/accumulo/core/spi/balancer/HostRegexTableLoadBalancerTest.java
  /path/to/accumulo/core/src/test/java/org/apache/accumulo/core/client/IteratorSettingTest.java
  /path/to/accumulo/core/src/test/java/org/apache/accumulo/core/data/KeyExtentTest.java
  /path/to/accumulo/core/src/test/java/org/apache/accumulo/core/security/VisibilityEvaluatorTest.java
  /path/to/accumulo/core/src/test/java/org/apache/accumulo/core/iterators/user/TransformingIteratorTest.java
  /path/to/accumulo/core/src/test/java/org/apache/accumulo/core/client/admin/NewTableConfigurationTest.java
  /path/to/accumulo/core/src/test/java/org/apache/accumulo/core/conf/HadoopCredentialProviderTest.java
  /path/to/accumulo/core/src/test/java/org/apache/accumulo/core/clientImpl/TableOperationsHelperTest.java
  /path/to/accumulo/core/src/test/java/org/apache/accumulo/core/iterators/user/WholeColumnFamilyIteratorTest.java

In order to run ContinuousQuery, we need to run Reverse.java to populate the examples.doc2term table.

$ ./bin/runex shard.Reverse --shardTable examples.shard --doc2Term examples.doc2term

Below ContinuousQuery is run using 5 terms. So it selects 5 random terms from each document, then it continually randomly selects one set of 5 terms and queries. It prints the number of matching documents and the time in seconds.

$ ./bin/runex shard.ContinuousQuery --shardTable examples.shard --doc2Term examples.doc2term --terms 5
  [string, protected, sizeopt, cache, build] 1  0.084
  [public, these, exception, to, as] 25  0.267   
  [by, encodeprevendrow, 0, work, as] 4  0.056
  [except, to, a, limitations, one] 969  0.197           
  [copy, as, asf, version, is] 969  0.341                                                 
  [core, class, may, regarding, without] 862  0.437
  [max_data_to_print, default_visibility_cache_size, use, accumulo_export_info, fate] 1  0.066