blob: 05beea3e2f415380ab8a62d32e43d90b9021f26b [file] [log] [blame] [view]
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Apache Accumulo Shard Example
Accumulo has an iterator called the intersecting iterator which supports querying a term index that is partitioned by
document, or "sharded". This example shows how to use the intersecting iterator through these four programs:
* [Index.java] - Indexes a set of text files into an Accumulo table
* [Query.java] - Finds documents containing a given set of terms.
* [Reverse.java] - Reads the index table and writes a map of documents to terms into another table.
* [ContinuousQuery.java] - Uses the table populated by Reverse.java to select N random terms per document. Then it continuously and randomly queries those terms.
To run these example programs, create two tables like below.
username@instance> createnamespace examples
username@instance> createtable examples.shard
username@instance examples.shard> createtable examples.doc2term
After creating the tables, index some files. The following command indexes all the java files in the Accumulo source code.
$ find /path/to/accumulo/core -name "*.java" | xargs ./bin/runex shard.Index -t examples.shard --partitions 30
The following command queries the index to find all files containing 'foo' and 'bar'.
$ ./bin/runex shard.Query -t examples.shard foo bar
/path/to/accumulo/core/src/test/java/org/apache/accumulo/core/spi/balancer/BaseHostRegexTableLoadBalancerTest.java
/path/to/accumulo/core/src/test/java/org/apache/accumulo/core/iterators/user/WholeRowIteratorTest.java
/path/to/accumulo/core/src/test/java/org/apache/accumulo/core/iteratorsImpl/IteratorConfigUtilTest.java
/path/to/accumulo/core/src/test/java/org/apache/accumulo/core/data/KeyBuilderTest.java
/path/to/accumulo/core/src/test/java/org/apache/accumulo/core/spi/balancer/HostRegexTableLoadBalancerReconfigurationTest.java
/path/to/accumulo/core/src/test/java/org/apache/accumulo/core/security/ColumnVisibilityTest.java
/path/to/accumulo/core/src/test/java/org/apache/accumulo/core/summary/SummaryCollectionTest.java
/path/to/accumulo/core/src/test/java/org/apache/accumulo/core/spi/balancer/HostRegexTableLoadBalancerTest.java
/path/to/accumulo/core/src/test/java/org/apache/accumulo/core/client/IteratorSettingTest.java
/path/to/accumulo/core/src/test/java/org/apache/accumulo/core/data/KeyExtentTest.java
/path/to/accumulo/core/src/test/java/org/apache/accumulo/core/security/VisibilityEvaluatorTest.java
/path/to/accumulo/core/src/test/java/org/apache/accumulo/core/iterators/user/TransformingIteratorTest.java
/path/to/accumulo/core/src/test/java/org/apache/accumulo/core/client/admin/NewTableConfigurationTest.java
/path/to/accumulo/core/src/test/java/org/apache/accumulo/core/conf/HadoopCredentialProviderTest.java
/path/to/accumulo/core/src/test/java/org/apache/accumulo/core/clientImpl/TableOperationsHelperTest.java
/path/to/accumulo/core/src/test/java/org/apache/accumulo/core/iterators/user/WholeColumnFamilyIteratorTest.java
In order to run ContinuousQuery, we need to run Reverse.java to populate the `examples.doc2term` table.
$ ./bin/runex shard.Reverse --shardTable examples.shard --doc2Term examples.doc2term
Below ContinuousQuery is run using 5 terms. So it selects 5 random terms from each document, then it continually
randomly selects one set of 5 terms and queries. It prints the number of matching documents and the time in seconds.
$ ./bin/runex shard.ContinuousQuery --shardTable examples.shard --doc2Term examples.doc2term --terms 5
[string, protected, sizeopt, cache, build] 1 0.084
[public, these, exception, to, as] 25 0.267
[by, encodeprevendrow, 0, work, as] 4 0.056
[except, to, a, limitations, one] 969 0.197
[copy, as, asf, version, is] 969 0.341
[core, class, may, regarding, without] 862 0.437
[max_data_to_print, default_visibility_cache_size, use, accumulo_export_info, fate] 1 0.066
[Index.java]: ../src/main/java/org/apache/accumulo/examples/shard/Index.java
[Query.java]: ../src/main/java/org/apache/accumulo/examples/shard/Query.java
[Reverse.java]: ../src/main/java/org/apache/accumulo/examples/shard/Reverse.java
[ContinuousQuery.java]: ../src/main/java/org/apache/accumulo/examples/shard/ContinuousQuery.java