docs/src/main/asciidoc/chapters/iterator_design.txt - accumulo - Git at Google

 // Licensed to the Apache Software Foundation (ASF) under one or more
 // contributor license agreements. See the NOTICE file distributed with
 // this work for additional information regarding copyright ownership.
 // The ASF licenses this file to You under the Apache License, Version 2.0
 // (the "License"); you may not use this file except in compliance with
 // the License. You may obtain a copy of the License at
 //
 //     http://www.apache.org/licenses/LICENSE-2.0
 //
 // Unless required by applicable law or agreed to in writing, software
 // distributed under the License is distributed on an "AS IS" BASIS,
 // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 // See the License for the specific language governing permissions and
 // limitations under the License.

 == Iterator Design

 Accumulo SortedKeyValueIterators, commonly referred to as Iterators for short, are server-side programming constructs
 that allow users to implement custom retrieval or computational purpose within Accumulo TabletServers.  The name rightly
 brings forward similarities to the Java Iterator interface; however, Accumulo Iterators are more complex than Java
 Iterators. Notably, in addition to the expected methods to retrieve the current element and advance to the next element
 in the iteration, Accumulo Iterators must also support the ability to "move" (`seek`) to an specified point in the
 iteration (the Accumulo table). Accumulo Iterators are designed to be concatenated together, similar to applying a
 series of transformations to a list of elements. Accumulo Iterators can duplicate their underlying source to create
 multiple "pointers" over the same underlying data (which is extremely powerful since each stream is sorted) or they can
 merge multiple Iterators into a single view. In this sense, a collection of Iterators operating in tandem is close to
 a tree-structure than a list, but there is always a sense of a flow of Key-Value pairs through some Iterators. Iterators
 are not designed to act as triggers nor are they designed to operate outside of the purview of a single table.

 Understanding how TabletServers invoke the methods on a SortedKeyValueIterator can be obtuse as the actual code is
 buried within the implementation of the TabletServer; however, it is generally unnecessary to have a strong
 understanding of this as the interface provides clear definitions about what each action each method should take. This
 chapter aims to provide a more detailed description of how Iterators are invoked, some best practices and some common
 pitfalls.

 === Instantiation

 To invoke an Accumulo Iterator inside of the TabletServer, the Iterator class must be on the classpath of every
 TabletServer. For production environments, it is common to place a JAR file which contains the Iterator in
 `$ACCUMULO_HOME/lib`.  In development environments, it is convenient to instead place the JAR file in
 `$ACCUMULO_HOME/lib/ext` as JAR files in this directory are dynamically reloaded by the TabletServers alleviating the
 need to restart Accumulo while testing an Iterator. Advanced classloader features which enable other types of
 filesystems and per-table classpath configurations (as opposed to process-wide classpaths). These features
 are not covered here, but elsewhere in the user manual.

 Accumulo references the Iterator class by name and uses Java reflection to instantiate the Iterator. This means that
 Iterators must have a public no-args constructor.

 === Interface

 A normal implementation of the SortedKeyValueIterator defines functionality for the following methods:

 [source,java]
 ----
 void init(SortedKeyValueIterator<Key,Value> source, Map<String,String> options, IteratorEnvironment env) throws IOException;

 boolean hasTop();

 void next() throws IOException;

 void seek(Range range, Collection<ByteSequence> columnFamilies, boolean inclusive) throws IOException;

 Key getTopKey();

 Value getTopValue();

 SortedKeyValueIterator<Key,Value> deepCopy(IteratorEnvironment env);
 ----

 ==== `init`

 The `init` method is called by the TabletServer after it constructs an instance of the Iterator.  This method should
 clear/reset any internal state in the Iterator and prepare it to process data.  The first argument, the `source`, is the
 Iterator "below" this Iterator (where the client is at "top" and the Iterator for files in HDFS are at the "bottom").
 The "source" Iterator provides the Key-Value pairs which this Iterator will operate upon.

 The second argument, a Map of options, is made up of options provided by the user, options set in the table's
 configuration, and/or options set in the containing namespace's configuration.
 These options allow for Iterators to dynamically configure themselves on the fly. If no options are used in the current context
 (a Scan or Compaction), the Map will be empty. An example of a configuration item for an Iterator could be a pattern used to filter
 Key-Value pairs in a regular expression Iterator.

 The third argument, the `IteratorEnvironment`, is a special object which provides information to this Iterator about the
 context in which it was invoked. Commonly, this information is not necessary to inspect. For example, if an Iterator
 knows that it is running in the context of a full-major compaction (reading all of the data) as opposed to a user scan
 (which may strongly limit the number of columns), the Iterator might make different algorithmic decisions in an attempt to
 optimize itself.

 ==== `seek`

 The `seek` method is likely the most confusing method on the Iterator interface. The purpose of this method is to
 advance the stream of Key-Value pairs to a certain point in the iteration (the Accumulo table). It is common that before
 the implementation of this method returns some additional processing is performed which may further advance the current
 position past the `startKey` of the `Range`. This, however, is dependent on the functionality the iterator provides. For
 example, a filtering iterator would consume a number Key-Value pairs which do not meets its criteria before `seek`
 returns. The important condition for `seek` to meet is that this Iterator should be ready to return the first Key-Value
 pair, or none if no such pair is available, when the method returns. The Key-Value pair would be returned by `getTopKey`
 and `getTopValue`, respectively, and `hasTop` should return a boolean denoting whether or not there is
 a Key-Value pair to return.

 The arguments passed to seek are as follows:

 The TabletServer first provides a `Range`, an object which defines some collection of Accumulo `Key`s, which defines the
 Key-Value pairs that this Iterator should return. Each `Range` has a `startKey` and `endKey` with an inclusive flag for
 both. While this Range is often similar to the Range(s) set by the client on a Scanner or BatchScanner, it is not
 guaranteed to be a Range that the client set. Accumulo will split up larger ranges and group them together based on
 Tablet boundaries per TabletServer. Iterators should not attempt to implement any custom logic based on the Range(s)
 provided to `seek` and Iterators should not return any Keys that fall outside of the provided Range.

 The second argument, a `Collection<ByteSequence>`, is the set of column families which should be retained or
 excluded by this Iterator. The third argument, a boolean, defines whether the collection of column families
 should be treated as an inclusion collection (true) or an exclusion collection (false).

 It is likely that all implementations of `seek` will first make a call to the `seek` method on the
 "source" Iterator that was provided in the `init` method. The collection of column families and
 the boolean `include` argument should be passed down as well as the `Range`. Somewhat commonly, the Iterator will
 also implement some sort of additional logic to find or compute the first Key-Value pair in the provided
 Range. For example, a regular expression Iterator would consume all records which do not match the given
 pattern before returning from `seek`.

 It is important to retain the original Range passed to this method to know when this Iterator should stop
 reading more Key-Value pairs. Ignoring this typically does not affect scans from a Scanner, but it
 will result in duplicate keys emitting from a BatchScan if the scanned table has more than one tablet.
 Best practice is to never emit entries outside the seek range.

 ==== `next`

 The `next` method is analogous to the `next` method on a Java Iterator: this method should advance
 the Iterator to the next Key-Value pair. For implementations that perform some filtering or complex
 logic, this may result in more than one Key-Value pair being inspected. This method alters
 some internal state that is exposed via the `hasTop`, `getTopKey`, and `getTopValue` methods.

 The result of this method is commonly caching a Key-Value pair which `getTopKey` and `getTopValue`
 can later return. While there is another Key-Value pair to return, `hasTop` should return true.
 If there are no more Key-Value pairs to return from this Iterator since the last call to
 `seek`, `hasTop` should return false.

 ==== `hasTop`

 The `hasTop` method is similar to the `hasNext` method on a Java Iterator in that it informs
 the caller if there is a Key-Value pair to be returned. If there is no pair to return, this method
 should return false. Like a Java Iterator, multiple calls to `hasTop` (without calling `next`) should not
 alter the internal state of the Iterator.

 ==== `getTopKey` and `getTopValue`

 These methods simply return the current Key-Value pair for this iterator. If `hasTop` returns true,
 both of these methods should return non-null objects. If `hasTop` returns false, it is undefined
 what these methods should return. Like `hasTop`, multiple calls to these methods should not alter
 the state of the Iterator.

 Users should take caution when either

 1. caching the Key/Value from `getTopKey`/`getTopValue`, for use after calling `next` on the source iterator.
 In this case, the cached Key/Value object is aliased to the reference returned by the source iterator.
 Iterators may reuse the same Key/Value object in a `next` call for performance reasons, changing the data
 that the cached Key/Value object references and resulting in a logic bug.
 2. modifying the Key/Value from `getTopKey`/`getTopValue`. If the source iterator reuses data stored in the Key/Value,
 then the source iterator may use the modified data that the Key/Value references. This may/may not result in a logic bug.

 In both cases, copying the Key/Value's data into a new object ensures iterator correctness. If neither case applies,
 it is safe to not copy the Key/Value.  The general guideline is to be aware of who else may use Key/Value objects
 returned from `getTopKey`/`getTopValue`.

 ==== `deepCopy`

 The `deepCopy` method is similar to the `clone` method from the Java `Cloneable` interface.
 Implementations of this method should return a new object of the same type as the Accumulo Iterator
 instance it was called on. Any internal state from the instance `deepCopy` was called
 on should be carried over to the returned copy. The returned copy should be ready to have
 `seek` called on it. The SortedKeyValueIterator interface guarantees that `init` will be called on
 an iterator before `deepCopy` and that `init` will not be called on the iterator returned by
 `deepCopy`.

 Typically, implementations of `deepCopy` call a copy-constructor which will initialize
 internal data structures. As with `seek`, it is common for the `IteratorEnvironment`
 argument to be ignored as most Iterator implementations can be written without the explicit
 information the environment provides.

 In the analogy of a series of Iterators representing a tree, `deepCopy` can be thought of as
 early programming assignments which implement their own tree data structures. `deepCopy` calls
 copy on its sources (the children), copies itself, attaches the copies of the children, and
 then returns itself.

 === TabletServer invocation of Iterators

 The following code is a general outline for how TabletServers invoke Iterators.

 [source,java]
 ----
  List<KeyValue> batch;
  Range range = getRangeFromClient();
  while(!overSizeLimit(batch)){
    SortedKeyValueIterator source = getSystemIterator();

    for(String clzName : getUserIterators()){
     Class<?> clz = Class.forName(clzName);
     SortedKeyValueIterator iter = (SortedKeyValueIterator) clz.newInstance();
     iter.init(source, opts, env);
     source = iter;
    }

    // read a batch of data to return to client
    // the last iterator, the "top"
    SortedKeyValueIterator topIter = source;
    topIter.seek(getRangeFromUser(), ...)

    while(topIter.hasTop() && !overSizeLimit(batch)){
      key = topIter.getTopKey()
      val = topIter.getTopValue()
      batch.add(new KeyValue(key, val)
      if(systemDataSourcesChanged()){
        // code does not show isolation case, which will
        // keep using same data sources until a row boundry is hit
        range = new Range(key, false, range.endKey(), range.endKeyInclusive());
        break;
      }
    }
  }
  //return batch of key values to client
 ----

 Additionally, the obtuse "re-seek" case can be outlined as the following:

 [source,java]
 ----
   // Given the above
   List<KeyValue> batch = getNextBatch();

   // Store off lastKeyReturned for this client
   lastKeyReturned = batch.get(batch.size() - 1).getKey();

   // thread goes away (client stops asking for the next batch).

   // Eventually client comes back
   // Setup as before...

   Range userRange = getRangeFromUser();
   Range actualRange = new Range(lastKeyReturned, false
       userRange.getEndKey(), userRange.isEndKeyInclusive());

   // Use the actualRange, not the user provided one
   topIter.seek(actualRange);
 ----


 === Isolation

 Accumulo provides a feature which clients can enable to prevent the viewing of partially
 applied mutations within the context of rows. If a client is submitting multiple column
 updates to rows at a time, isolation would ensure that a client would either see all of
 updates made to that row or none of the updates (until they are all applied).

 When using Isolation, there are additional concerns in iterator design. A scan time iterator in accumulo
 reads from a set of data sources. While an iterator is reading data it has an isolated view. However, after it returns a
 key/value it is possible that accumulo may switch data sources and re-seek the iterator. This is done so that resources
 may be reclaimed. When the user does not request isolation this can occur after any key is returned. When a user enables
 Isolation, this will only occur after a new row is returned, in which case it will re-seek to the very beginning of the
 next possible row.

 === Abstract Iterators

 A number of Abstract implementations of Iterators are provided to allow for faster creation
 of common patterns. The most commonly used abstract implementations are the `Filter` and
 `Combiner` classes. When possible these classes should be used instead as they have been
 thoroughly tested inside Accumulo itself.

 ==== Filter

 The `Filter` abstract Iterator provides a very simple implementation which allows implementations
 to define whether or not a Key-Value pair should be returned via an `accept(Key, Value)` method.

 Filters are extremely simple to implement; however, when the implementation is filtering a
 large percentage of Key-Value pairs with respect to the total number of pairs examined,
 it can be very inefficient. For example, if a Filter implementation can determine after examining
 part of the row that no other pairs in this row will be accepted, there is no mechanism to
 efficiently skip the remaining Key-Value pairs. Concretely, take a row which is comprised of
 1000 Key-Value pairs. After examining the first 10 Key-Value pairs, it is determined
 that no other Key-Value pairs in this row will be accepted. The Filter must still examine each
 remaining 990 Key-Value pairs in this row. Another way to express this deficiency is that
 Filters have no means to leverage the `seek` method to efficiently skip large portions
 of Key-Value pairs.

 As such, the `Filter` class functions well for filtering small amounts of data, but is
 inefficient for filtering large amounts of data. The decision to use a `Filter` strongly
 depends on the use case and distribution of data being filtered.

 ==== Combiner

 The `Combiner` class is another common abstract Iterator. Similar to the `Combiner` interface
 define in Hadoop's MapReduce framework, implementations of this abstract class reduce
 multiple Values for different versions of a Key (Keys which only differ by timestamps) into one Key-Value pair.
 Combiners provide a simple way to implement common operations like summation and
 aggregation without the need to implement the entire Accumulo Iterator interface.

 One important consideration when choosing to design a Combiner is that the "reduction" operation
 is often best represented when it is associative and commutative. Operations which do not meet
 these criteria can be implemented; however, the implementation can be difficult.

 A second consideration is that a Combiner is not guaranteed to see every Key-Value pair
 which differ only by timestamp every time it is invoked. For example, if there are 5 Key-Value
 pairs in a table which only differ by the timestamps 1, 2, 3, 4, and 5, it is not guaranteed that
 every invocation of the Combiner will see 5 timestamps. One invocation might see the Values for
 Keys with timestamp 1 and 4, while another invocation might see the Values for Keys with the
 timestamps 1, 2, 4 and 5.

 Finally, when configuring an Accumulo table to use a Combiner, be sure to disable the Versioning Iterator or set the
 Combiner at a priority less than the Combiner (the Versioning Iterator is added at a priority of 20 by default). The
 Versioning Iterator will filter out multiple Key-Value pairs that differ only by timestamp and return only the Key-Value
 pair that has the largest timestamp.

 === Best practices

 Because of the flexibility that the `SortedKeyValueInterface` provides, it doesn't directly disallow
 many implementations which are poor design decisions. The following are some common recommendations to
 follow and pitfalls to avoid in Iterator implementations.

 ==== Avoid special logic encoded in Ranges

 Commonly, granular Ranges that a client passes to an Iterator from a `Scanner` or `BatchScanner` are unmodified.
 If a `Range` falls within the boundaries of a Tablet, an Iterator will often see that same Range in the
 `seek` method. However, there is no guarantee that the `Range` will remain unaltered from client to server. As such, Iterators
 should *never* make assumptions about the current state/context based on the `Range`.

 The common failure condition is referred to as a "re-seek". In the context of a Scan, TabletServers construct the
 "stack" of Iterators and batch up Key-Value pairs to send back to the client. When a sufficient number of Key-Value
 pairs are collected, it is common for the Iterators to be "torn down" until the client asks for the next batch of
 Key-Value pairs. This is done by the TabletServer to add fairness in ensuring one Scan does not monopolize the available
 resources. When the client asks for the next batch, the implementation modifies the original Range so that servers know
 the point to resume the iteration (to avoid returning duplicate Key-Value pairs). Specifically, the new Range is created
 from the original but is shortened by setting the startKey of the original Range to the Key last returned by the Scan,
 non-inclusive.

 ==== `seek`'ing backwards

 The ability for an Iterator to "skip over" large blocks of Key-Value pairs is a major tenet behind Iterators.
 By `seek`'ing when it is known that there is a collection of Key-Value pairs which can be ignored can
 greatly increase the speed of a scan as many Key-Value pairs do not have to be deserialized and processed.

 While the `seek` method provides the `Range` that should be used to `seek` the underlying source Iterator,
 there is no guarantee that the implementing Iterator uses that `Range` to perform the `seek` on its
 "source" Iterator. As such, it is possible to seek to any `Range` and the interface has no assertions
 to prevent this from happening.

 Since Iterators are allowed to `seek` to arbitrary Keys, it also allows Iterators to create infinite loops
 inside Scans that will repeatedly read the same data without end. If an arbitrary Range is constructed, it should
 construct a completely new Range as it allows for bugs to be introduced which will break Accumulo.

 Thus, `seek`'s should always be thought of as making "forward progress" in the view of the total iteration. The
 `startKey` of a `Range` should always be greater than the current Key seen by the Iterator while the `endKey` of the
 `Range` should always retain the original `endKey` (and `endKey` inclusivity) of the last `Range` seen by your
 Iterator's implementation of seek.

 ==== Take caution in constructing new data in an Iterator

 Implementations of Iterator might be tempted to open BatchWriters inside of an Iterator as a means
 to implement triggers for writing additional data outside of their client application. The lifecycle of an Iterator
 is *not* managed in such a way that guarantees that this is safe nor efficient. Specifically, there
 is no way to guarantee that the internal ThreadPool inside of the BatchWriter is closed (and the thread(s)
 are reaped) without calling the close() method. `close`'ing and recreating a `BatchWriter` after every
 Key-Value pair is also prohibitively performance limiting to be considered an option.

 The only safe way to generate additional data in an Iterator is to alter the current Key-Value pair.
 For example, the `WholeRowIterator` serializes the all of the Key-Values pairs that fall within each
 row. A safe way to generate more data in an Iterator would be to construct an Iterator that is
 "higher" (at a larger priority) than the `WholeRowIterator`, that is, the Iterator receives the Key-Value pairs which are
 a serialization of many Key-Value pairs. The custom Iterator could deserialize the pairs, compute
 some function, and add a new Key-Value pair to the original collection, re-serializing the collection
 of Key-Value pairs back into a single Key-Value pair.

 Any other situation is likely not guaranteed to ensure that the caller (a Scan or a Compaction) will
 always see all intended data that is generated.

 === Final things to remember

 Some simple recommendations/points to keep in mind:

 ==== Method call order

 On an instance of an Iterator: `init` is always called before `seek`, `seek` is always called before `hasTop`,
 `getTopKey` and `getTopValue` will not be called if `hasTop` returns false.

 ==== Teardown

 As mentioned, instance of Iterators may be torn down inside of the server transparently. When a complex
 collection of iterators is performing some advanced functionality, they will not be torn down until a Key-Value
 pair is returned out of the "stack" of Iterators (and added into the batch of Key-Values to be returned
 to the caller). Being torn-down is equivalent to a new instance of the Iterator being creating and `deepCopy`
 being called on the new instance with the old instance provided as the argument to `deepCopy`. References
 to the old instance are removed and the object is lazily garbage collected by the JVM.

 === Compaction-time Iterators

 When Iterators are configured to run during compactions, at the `minc` or `majc` scope, these Iterators sometimes need
 to make different assertions than those who only operate at scan time. Iterators won't see the delete entries; however,
 Iterators will not necessarily see all of the Key-Value pairs in ever invocation. Because compactions often do not rewrite
 all files (only a subset of them), it is possible that the logic take this into consideration.

 For example, a Combiner that runs over data at during compactions, might not see all of the values for a given Key. The
 Combiner must recognize this and not perform any function that would be incorrect due
 to the missing values.