blob: 42289fe54240283f087b22e89a5ed68b8fdf206e [file] [log] [blame] [view]
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Apache Accumulo Wikisearch
Wikisearch is an example Accumulo application that provides a flexible, scalable
search over Wikipedia articles.
## Installation
Follow the [install instructions][install] to run the example.
## Design
The example uses an indexing technique helpful for doing multiple logical tests
against content. In this case, we can perform a word search on Wikipedia
articles. The sample application takes advantage of 3 unique capabilities of
Accumulo:
1. Extensible iterators that operate within the distributed tablet servers of
the key-value store
1. Custom aggregators which can efficiently condense information during the
various life-cycles of the log-structured merge tree
1. Custom load balancing, which ensures that a table is evenly distributed on
all tablet servers
In the example, Accumulo tracks the cardinality of all terms as elements are
ingested. If the cardinality is small enough, it will track the set of
documents by term directly. For example:
| Row (word) | Value (count) | Value (document list) |
|------------|--------------:|:----------------------------|
| Octopus | 2 | [Document 57, Document 220] |
| Other | 172,849 | [] |
| Ostrich | 1 | [Document 901] |
Searches can be optimized to focus on low-cardinality terms. To create these
counts, the example installs "aggregators" which are used to combine inserted
values. The ingester just writes simple "(Octopus, 1, Document 57)" tuples.
The tablet servers then used the installed aggregators to merge the cells as
the data is re-written, or queried. This reduces the in-memory locking
required to update high-cardinality terms, and defers aggregation to a later
time, where it can be done more efficiently.
The example also creates a reverse word index to map each word to the document
in which it appears. But it does this by choosing an arbitrary partition for
the document. The article, and the word index for the article are grouped
together into the same partition. For example:
| Row (partition) | Column Family | Column Qualifier | Value |
|-----------------|---------------|------------------|-----------------|
| 1 | D | Document 57 | "smart Octopus" |
| 1 | Word, Octopus | Document 57 | |
| 1 | Word, smart | Document 57 | |
| ... | | | |
| 2 | D | Document 220 | "big Octopus" |
| 2 | Word, big | Document 220 | |
| 2 | Word, Octopus | Document 220 | |
Of course, there would be large numbers of documents in each partition, and the
elements of those documents would be interlaced according to their sort order.
By dividing the index space into partitions, the multi-word searches can be
performed in parallel across all the nodes. Also, by grouping the document
together with its index, a document can be retrieved without a second request
from the client. The query "octopus" and "big" will be performed on all the
servers, but only those partitions for which the low-cardinality term "octopus"
can be found by using the aggregated reverse index information. The query for a
document is performed by extensions provided in the example. These extensions
become part of the tablet server's iterator stack. By cloning the underlying
iterators, the query extensions can seek to specific words within the index,
and when it finds a matching document, it can then seek to the document
location and retrieve the contents.
## Performance
The Wikisearch examples was run a on a cluster of 10 servers, each with 12 cores, and 32G
RAM, 6 500G drives. Accumulo tablet servers were allowed a maximum of 3G of
working memory, of which 2G was dedicated to caching file data.
Following the instructions in the example, the Wikipedia XML data for articles
was loaded for English, Spanish and German languages into 10 partitions. The
data is not partitioned by language: multiple languages were used to get a
larger set of test data. The data load took around 8 hours, and has not been
optimized for scale. Once the data was loaded, the content was compacted which
took about 35 minutes.
The example uses the language-specific tokenizers available from the Apache
Lucene project for Wikipedia data.
Original files:
| Articles | Compressed size | Filename |
|----------|-----------------|----------------------------------------|
| 1.3M | 2.5G | dewiki-20111120-pages-articles.xml.bz2 |
| 3.8M | 7.9G | enwiki-20111115-pages-articles.xml.bz2 |
| 0.8M | 1.4G | eswiki-20111112-pages-articles.xml.bz2 |
The resulting tables:
> du -p wiki.*
47,325,680,634 [wiki]
5,125,169,305 [wikiIndex]
413 [wikiMetadata]
5,521,690,682 [wikiReverseIndex]
Roughly a 6:1 increase in size.
We performed the following queries, and repeated the set 5 times. The query
language is much more expressive than what is shown below. The actual query
specified that these words were to be found in the body of the article. Regular
expressions, searches within titles, negative tests, etc are available.
| Query | Sample 1 (seconds) | Sample 2 (seconds) | Sample 3 (seconds) | Sample 4 (seconds) | Sample 5 (seconds) | Matches | Result Size |
|-----------------------------------------|------|------|------|------|------|--------|-----------|
| "old" and "man" and "sea" | 4.07 | 3.79 | 3.65 | 3.85 | 3.67 | 22,956 | 3,830,102 |
| "paris" and "in" and "the" and "spring" | 3.06 | 3.06 | 2.78 | 3.02 | 2.92 | 10,755 | 1,757,293 |
| "rubber" and "ducky" and "ernie" | 0.08 | 0.08 | 0.1 | 0.11 | 0.1 | 6 | 808 |
| "fast" and ( "furious" or "furriest") | 1.34 | 1.33 | 1.3 | 1.31 | 1.31 | 2,973 | 493,800 |
| "slashdot" and "grok" | 0.06 | 0.06 | 0.06 | 0.06 | 0.06 | 14 | 2,371 |
| "three" and "little" and "pigs" | 0.92 | 0.91 | 0.9 | 1.08 | 0.88 | 2,742 | 481,531 |
Because the terms are tested together within the tablet server, even fairly
high-cardinality terms such as "old," "man," and "sea" can be tested
efficiently, without needing to return to the client, or make distributed calls
between servers to perform the intersection between terms.
For reference, here are the cardinalities for all the terms in the query
(remember, this is across all languages loaded):
| Term | Cardinality |
|----------|-------------|
| ducky | 795 |
| ernie | 13,433 |
| fast | 166,813 |
| furious | 10,535 |
| furriest | 45 |
| grok | 1,168 |
| in | 1,884,638 |
| little | 320,748 |
| man | 548,238 |
| old | 720,795 |
| paris | 232,464 |
| pigs | 8,356 |
| rubber | 17,235 |
| sea | 247,231 |
| slashdot | 2,343 |
| spring | 125,605 |
| the | 3,509,498 |
| three | 718,810 |
Accumulo supports caching index information, which is turned on by default, and
for the non-index blocks of a file, which is not. After turning on data block
caching for the wiki table:
| Query | Sample 1 (seconds) | Sample 2 (seconds) | Sample 3 (seconds) | Sample 4 (seconds) | Sample 5 (seconds) |
|-----------------------------------------|------|------|------|------|------|
| "old" and "man" and "sea" | 2.47 | 2.48 | 2.51 | 2.48 | 2.49 |
| "paris" and "in" and "the" and "spring" | 1.33 | 1.42 | 1.6 | 1.61 | 1.47 |
| "rubber" and "ducky" and "ernie" | 0.07 | 0.08 | 0.07 | 0.07 | 0.07 |
| "fast" and ( "furious" or "furriest") | 1.28 | 0.78 | 0.77 | 0.79 | 0.78 |
| "slashdot" and "grok" | 0.04 | 0.04 | 0.04 | 0.04 | 0.04 |
| "three" and "little" and "pigs" | 0.55 | 0.32 | 0.32 | 0.31 | 0.27 |
For comparison, these are the cold start lookup times (restart Accumulo, and
drop the operating system disk cache):
| Query | Sample |
|-----------------------------------------|--------|
| "old" and "man" and "sea" | 13.92 |
| "paris" and "in" and "the" and "spring" | 8.46 |
| "rubber" and "ducky" and "ernie" | 2.96 |
| "fast" and ( "furious" or "furriest") | 6.77 |
| "slashdot" and "grok" | 4.06 |
| "three" and "little" and "pigs" | 8.13 |
### Random Query Load
Random queries were generated using common english words. A uniform random
sample of 3 to 5 words taken from the 10000 most common words in the Project
Gutenberg's online text collection were joined with "and". Words containing
anything other than letters (such as contractions) were not used. A client was
started simultaneously on each of the 10 servers and each ran 100 random
queries (1000 queries total).
| Time | Count |
|-------|---------|
| 41.97 | 440,743 |
| 41.61 | 320,522 |
| 42.11 | 347,969 |
| 38.32 | 275,655 |
### Query Load During Ingest
The English wikipedia data was re-ingested on top of the existing, compacted
data. The following query samples were taken in 5 minute intervals while
ingesting 132 articles/second:
| Query | Sample 1 (seconds) | Sample 2 (seconds) | Sample 3 (seconds) | Sample 4 (seconds) | Sample 5 (seconds) |
|-----------------------------------------|------|------|-------|------|-------|
| "old" and "man" and "sea" | 4.91 | 3.92 | 11.58 | 9.86 | 10.21 |
| "paris" and "in" and "the" and "spring" | 5.03 | 3.37 | 12.22 | 3.29 | 9.46 |
| "rubber" and "ducky" and "ernie" | 4.21 | 2.04 | 8.57 | 1.54 | 1.68 |
| "fast" and ( "furious" or "furriest") | 5.84 | 2.83 | 2.56 | 3.12 | 3.09 |
| "slashdot" and "grok" | 5.68 | 2.62 | 2.2 | 2.78 | 2.8 |
| "three" and "little" and "pigs" | 7.82 | 3.42 | 2.79 | 3.29 | 3.3 |
[install]: INSTALL.md