blob: ab7f0c465bfdd2ce385418227ae78810cf7c0c93 [file] [log] [blame]
KLL Sketch
----------
.. currentmodule:: datasketches
Implementation of a very compact quantiles sketch with lazy compaction scheme
and nearly optimal accuracy per retained item.
See `Optimal Quantile Approximation in Streams`.
This is a stochastic streaming sketch that enables near real-time analysis of the
approximate distribution of items from a very large stream in a single pass, requiring only
that the items are comparable.
The analysis is obtained using `get_quantile()` function or the
inverse functions `get_rank()`, `get_pmf()` (Probability Mass Function), and `get_cdf()`
(Cumulative Distribution Function).
Given an input stream of `N` items, the `natural rank` of any specific
item is defined as its index `(1 to N)` in inclusive mode
or `(0 to N-1)` in exclusive mode
in the hypothetical sorted stream of all `N` input items.
The `normalized rank` (`rank`) of any specific item is defined as its
`natural rank` divided by `N`.
Thus, the `normalized rank` is between zero and one.
In the documentation for this sketch `natural rank` is never used so any
reference to just `rank` should be interpreted to mean `normalized rank`.
This sketch is configured with a parameter `k`, which affects the size of the sketch
and its estimation error.
The estimation error is commonly called `epsilon` (or `eps`) and is a fraction
between zero and one. Larger values of `k` result in smaller values of `epsilon`.
Epsilon is always with respect to the rank and cannot be applied to the
corresponding items.
The relationship between the `normalized rank` and the corresponding items can be viewed
as a two-dimensional monotonic plot with the `normalized rank` on one axis and the
corresponding items on the other axis. If the y-axis is specified as the item-axis and
the x-axis as the `normalized rank`, then `y = get_quantile(x)` is a monotonically
increasing function.
The function `get_quantile(rank)` translates ranks into
corresponding quantiles. The functions `get_rank(item)`,
`get_cdf(...)` (Cumulative Distribution Function), and `get_pmf(...)`
(Probability Mass Function) perform the opposite operation and translate items into ranks.
The `get_pmf(...)` function has about 13 to 47% worse rank error (depending
on `k`) than the other queries because the mass of each "bin" of the PMF has
"double-sided" error from the upper and lower edges of the bin as a result of a subtraction,
as the errors from the two edges can sometimes add.
The default `k` of 200 yields a "single-sided" `epsilon` of about 1.33% and a
"double-sided" (PMF) `epsilon` of about 1.65%.
A `get_quantile(rank)` query has the following guarantees:
- Let `q = get_quantile(r)` where `r` is the rank between zero and one.
- The quantile `q` will be an item from the input stream.
- Let `true_rank` be the true rank of `q` derived from the hypothetical sorted
stream of all `N` items.
- Let `eps = get_normalized_rank_error(false)`.
- Then `r - eps ≤ true_rank ≤ r + eps` with a confidence of 99%. Note that the
error is on the rank, not the quantile.
A `get_rank(item)` query has the following guarantees:
- Let `r = get_rank(i)` where `i` is an item between the min and max items of
the input stream.
- Let `true_rank` be the true rank of `i` derived from the hypothetical sorted
stream of all `N` items.
- Let `eps = get_normalized_rank_error(false)`.
- Then `r - eps ≤ true_rank ≤ r + eps` with a confidence of 99%.
A `get_pmf()` query has the following guarantees:
- Let `{r1, r2, ..., r(m+1)} = get_pmf(s1, s2, ..., sm)` where `s1, s2` are
split points (items from the input domain) between the min and max items of
the input stream.
- Let `mass_i = estimated mass between s_i and s_i+1`.
- Let `true_mass` be the true mass between the items of `s_i`,
`s_i+1` derived from the hypothetical sorted stream of all `N` items.
- Let `eps = get_normalized_rank_error(true)`.
- then `mass - eps ≤ true_mass ≤ mass + eps` with a confidence of 99%.
- `r(m+1)` includes the mass of all points larger than `s_m`.
A `get_cdf(...)` query has the following guarantees;
- Let `{r1, r2, ..., r(m+1)} = get_cdf(s1, s2, ..., sm)` where `s1, s2, ...` are
split points (items from the input domain) between the min and max items of
the input stream.
- Let `mass_i = r_(i+1) - r_i`.
- Let `true_mass` be the true mass between the true ranks of `s_i`,
`s_i+1` derived from the hypothetical sorted stream of all `N` items.
- Let `eps = get_normalized_rank_error(true)`.
- then `mass - eps ≤ true_mass ≤ mass + eps` with a confidence of 99%.
- `1 - r(m+1)` includes the mass of all points larger than `s_m`.
From the above, it might seem like we could make some estimates to bound the
`item` returned from a call to `get_quantile()`. The sketch, however, does not
let us derive error bounds or confidences around items. Because errors are independent, we
can approximately bracket a value as shown below, but there are no error estimates available.
Additionally, the interval may be quite large for certain distributions.
- Let `q = get_quantile(r)`, the estimated quantile of rank `r`.
- Let `eps = get_normalized_rank_error(false)`.
- Let `q_lo = estimated quantile of rank (r - eps)`.
- Let `q_hi = estimated quantile of rank (r + eps)`.
- Then `q_lo ≤ q ≤ q_hi`, with 99% confidence.
.. note::
For the :class:`kll_items_sketch`, objects must be comparable with ``__lt__``.
.. note::
Serializing and deserializing a :class:`kll_items_sketch` requires the use of a :class:`PyObjectSerDe`.
.. autoclass:: kll_ints_sketch
:members:
:undoc-members:
:exclude-members: deserialize, get_normalized_rank_error
.. rubric:: Static Methods:
.. automethod:: deserialize
.. automethod:: get_normalized_rank_error
.. rubric:: Non-static Methods:
.. automethod:: __init__
.. autoclass:: kll_floats_sketch
:members:
:undoc-members:
:exclude-members: deserialize, get_normalized_rank_error
.. rubric:: Static Methods:
.. automethod:: deserialize
.. automethod:: get_normalized_rank_error
.. rubric:: Non-static Methods:
.. automethod:: __init__
.. autoclass:: kll_doubles_sketch
:members:
:undoc-members:
:exclude-members: deserialize, get_normalized_rank_error
.. rubric:: Static Methods:
.. automethod:: deserialize
.. automethod:: get_normalized_rank_error
.. rubric:: Non-static Methods:
.. automethod:: __init__
.. autoclass:: kll_items_sketch
:members:
:undoc-members:
:exclude-members: deserialize, get_normalized_rank_error
.. rubric:: Static Methods:
.. automethod:: deserialize
.. automethod:: get_normalized_rank_error
.. rubric:: Non-static Methods:
.. automethod:: __init__