Accumulo has an iterator called the intersecting iterator which supports querying a term index that is partitioned by document, or “sharded”. This example shows how to use the intersecting iterator through these four programs:
To run these example programs, create two tables like below.
username@instance> createnamespace examples username@instance> createtable examples.shard username@instance examples.shard> createtable examples.doc2term
After creating the tables, index some files. The following command indexes all the java files in the Accumulo source code.
$ find /path/to/accumulo/core -name "*.java" | xargs ./bin/runex shard.Index -t examples.shard --partitions 30
The following command queries the index to find all files containing ‘foo’ and ‘bar’.
$ ./bin/runex shard.Query -t examples.shard foo bar /path/to/accumulo/core/src/test/java/org/apache/accumulo/core/spi/balancer/BaseHostRegexTableLoadBalancerTest.java /path/to/accumulo/core/src/test/java/org/apache/accumulo/core/iterators/user/WholeRowIteratorTest.java /path/to/accumulo/core/src/test/java/org/apache/accumulo/core/iteratorsImpl/IteratorConfigUtilTest.java /path/to/accumulo/core/src/test/java/org/apache/accumulo/core/data/KeyBuilderTest.java /path/to/accumulo/core/src/test/java/org/apache/accumulo/core/spi/balancer/HostRegexTableLoadBalancerReconfigurationTest.java /path/to/accumulo/core/src/test/java/org/apache/accumulo/core/security/ColumnVisibilityTest.java /path/to/accumulo/core/src/test/java/org/apache/accumulo/core/summary/SummaryCollectionTest.java /path/to/accumulo/core/src/test/java/org/apache/accumulo/core/spi/balancer/HostRegexTableLoadBalancerTest.java /path/to/accumulo/core/src/test/java/org/apache/accumulo/core/client/IteratorSettingTest.java /path/to/accumulo/core/src/test/java/org/apache/accumulo/core/data/KeyExtentTest.java /path/to/accumulo/core/src/test/java/org/apache/accumulo/core/security/VisibilityEvaluatorTest.java /path/to/accumulo/core/src/test/java/org/apache/accumulo/core/iterators/user/TransformingIteratorTest.java /path/to/accumulo/core/src/test/java/org/apache/accumulo/core/client/admin/NewTableConfigurationTest.java /path/to/accumulo/core/src/test/java/org/apache/accumulo/core/conf/HadoopCredentialProviderTest.java /path/to/accumulo/core/src/test/java/org/apache/accumulo/core/clientImpl/TableOperationsHelperTest.java /path/to/accumulo/core/src/test/java/org/apache/accumulo/core/iterators/user/WholeColumnFamilyIteratorTest.java
In order to run ContinuousQuery, we need to run Reverse.java to populate the examples.doc2term
table.
$ ./bin/runex shard.Reverse --shardTable examples.shard --doc2Term examples.doc2term
Below ContinuousQuery is run using 5 terms. So it selects 5 random terms from each document, then it continually randomly selects one set of 5 terms and queries. It prints the number of matching documents and the time in seconds.
$ ./bin/runex shard.ContinuousQuery --shardTable examples.shard --doc2Term examples.doc2term --terms 5 [string, protected, sizeopt, cache, build] 1 0.084 [public, these, exception, to, as] 25 0.267 [by, encodeprevendrow, 0, work, as] 4 0.056 [except, to, a, limitations, one] 969 0.197 [copy, as, asf, version, is] 969 0.341 [core, class, may, regarding, without] 862 0.437 [max_data_to_print, default_visibility_cache_size, use, accumulo_export_info, fate] 1 0.066