blob: bb580e544a45fb09808d68a08609b696a55e103d [file] [log] [blame]
Frequent Items
--------------
This sketch is useful for tracking approximate frequencies of items of type `<T>` with optional associated counts `(<T> item, int count)`
that are members of a multiset of such items.
The true frequency of an item is defined to be the sum of associated counts.
This implementation provides the following capabilities:
* Estimate the *frequency* of an item.
* Return *upper* and *lower bounds* of any item, such that the true frequency is always between the upper and lower bounds.
* Return a global *maximum error* that holds for all items in the stream.
* Return an array of frequent items that qualify either a *NO_FALSE_POSITIVES* or a *NO_FALSE_NEGATIVES* error type.
* *Merge* itself with another sketch object created from this class.
* *Serialize/Deserialize* to/from a byte array.
**Space Usage**
The sketch is initialized with a maximum map size, `maxMapSize`, that specifies the maximum physical length of the internal hash map of the form `(<T> item, int count)`.
The maximum map size is always a power of 2, defined through the variables `lg_max_map_size`.
The hash map starts at a very small size (8 entries) and grows as needed up to the specified maximum map size.
Excluding external space required for the item objects, the internal memory space usage of this sketch is `18 * mapSize bytes` (assuming 8 bytes for each reference),
plus a small constant number of additional bytes.
The internal memory space usage of this sketch will never exceed `18 * maxMapSize` bytes, plus a small constant number of additional bytes.
**Maximum Capacity of the Sketch**
The `LOAD_FACTOR` for the hash map is internally set at :math:`75\%`, which means at any time the map capacity of `(item, count)` pairs is `mapCap = 0.75 * mapSize`.
The maximum capacity of `(item, count)`` pairs of the sketch is `maxMapCap = 0.75 * maxMapSize`.
**Updating the sketch with `(item, count)` pairs**
If the item is found in the hash map, the mapped count field (the "counter") is incremented by the incoming count; otherwise, a new counter `"(item, count) pair"` is created.
If the number of tracked counters reaches the maximum capacity of the hash map, the sketch decrements all of the counters (by an approximately computed median)
and removes any non-positive counters.
**Accuracy**
If fewer than `0.75 * maxMapSize` different items are inserted into the sketch, the estimated frequencies returned by the sketch will be exact.
The logic of the frequent items sketch is such that the stored counts and true counts are never too different.
More specifically, for any item, the sketch can return an estimate of the true frequency of item, along with upper and lower bounds on the frequency (that hold deterministically).
For this implementation and for a specific active item, it is guaranteed that the true frequency will be between the Upper Bound (UB) and the Lower Bound (LB) computed for that item.
Specifically, `(UB- LB) W * epsilon`, where :math:`W` denotes the sum of all item counts, and :math:`epsilon = 3.5/M`, where :math:`epsilon = M` is the maxMapSize.
This is a worst-case guarantee that applies to arbitrary inputs.
For inputs typically seen in practice, `(UB-LB)` is usually much smaller.
**Background**
This code implements a variant of what is commonly known as the "Misra-Gries algorithm".
Variants of it were discovered and rediscovered and redesigned several times over the years:
* *Finding repeated elements*, Misra, Gries, 1982
* *Frequency estimation of Internet packet streams with limited space* Demaine, Lopez-Ortiz, Munro, 2002
* *A simple algorithm for finding frequent elements in streams and bags* Karp, Shenker, Papadimitriou, 2003
* *Efficient Computation of Frequent and Top-k Elements in Data Streams* Metwally, Agrawal, Abbadi, 2006
For speed, we do employ some randomization that introduces a small probability that our proof of the worst-case bound might not apply to a given run.
However, we have ensured that this probability is extremely small.
For example, if the stream causes one table purge (rebuild), our proof of the worst-case bound applies with a probability of at least `1 - 1E-14`.
If the stream causes `1E9` purges, our proof applies with a probability of at least `1 - 1E-5`.
Parameter: <T> The type of item to be tracked by this sketch
.. autoclass:: _datasketches.frequent_items_sketch
:members:
:undoc-members:
:exclude-members: deserialize, get_epsilon_for_lg_size, get_apriori_error
:member-order: groupwise
.. rubric:: Static Methods:
.. automethod:: deserialize
.. automethod:: get_epsilon_for_lg_size
.. automethod:: get_apriori_error
.. rubric:: Non-static Methods: