| KLL Sketch |
| ---------- |
| |
| .. currentmodule:: datasketches |
| |
| Implementation of a very compact quantiles sketch with lazy compaction scheme |
| and nearly optimal accuracy per retained item. |
| See `Optimal Quantile Approximation in Streams`. |
| |
| This is a stochastic streaming sketch that enables near real-time analysis of the |
| approximate distribution of items from a very large stream in a single pass, requiring only |
| that the items are comparable. |
| The analysis is obtained using `get_quantile()` function or the |
| inverse functions `get_rank()`, `get_pmf()` (Probability Mass Function), and `get_cdf()` |
| (Cumulative Distribution Function). |
| |
| Given an input stream of `N` items, the `natural rank` of any specific |
| item is defined as its index `(1 to N)` in inclusive mode |
| or `(0 to N-1)` in exclusive mode |
| in the hypothetical sorted stream of all `N` input items. |
| |
| The `normalized rank` (`rank`) of any specific item is defined as its |
| `natural rank` divided by `N`. |
| Thus, the `normalized rank` is between zero and one. |
| In the documentation for this sketch `natural rank` is never used so any |
| reference to just `rank` should be interpreted to mean `normalized rank`. |
| |
| This sketch is configured with a parameter `k`, which affects the size of the sketch |
| and its estimation error. |
| |
| The estimation error is commonly called `epsilon` (or `eps`) and is a fraction |
| between zero and one. Larger values of `k` result in smaller values of `epsilon`. |
| Epsilon is always with respect to the rank and cannot be applied to the |
| corresponding items. |
| |
| The relationship between the `normalized rank` and the corresponding items can be viewed |
| as a two-dimensional monotonic plot with the `normalized rank` on one axis and the |
| corresponding items on the other axis. If the y-axis is specified as the item-axis and |
| the x-axis as the `normalized rank`, then `y = get_quantile(x)` is a monotonically |
| increasing function. |
| |
| The function `get_quantile(rank)` translates ranks into |
| corresponding quantiles. The functions `get_rank(item)`, |
| `get_cdf(...)` (Cumulative Distribution Function), and `get_pmf(...)` |
| (Probability Mass Function) perform the opposite operation and translate items into ranks. |
| |
| The `get_pmf(...)` function has about 13 to 47% worse rank error (depending |
| on `k`) than the other queries because the mass of each "bin" of the PMF has |
| "double-sided" error from the upper and lower edges of the bin as a result of a subtraction, |
| as the errors from the two edges can sometimes add. |
| |
| The default `k` of 200 yields a "single-sided" `epsilon` of about 1.33% and a |
| "double-sided" (PMF) `epsilon` of about 1.65%. |
| |
| A `get_quantile(rank)` query has the following guarantees: |
| - Let `q = get_quantile(r)` where `r` is the rank between zero and one. |
| - The quantile `q` will be an item from the input stream. |
| - Let `true_rank` be the true rank of `q` derived from the hypothetical sorted |
| stream of all `N` items. |
| - Let `eps = get_normalized_rank_error(false)`. |
| - Then `r - eps ≤ true_rank ≤ r + eps` with a confidence of 99%. Note that the |
| error is on the rank, not the quantile. |
| |
| A `get_rank(item)` query has the following guarantees: |
| - Let `r = get_rank(i)` where `i` is an item between the min and max items of |
| the input stream. |
| - Let `true_rank` be the true rank of `i` derived from the hypothetical sorted |
| stream of all `N` items. |
| - Let `eps = get_normalized_rank_error(false)`. |
| - Then `r - eps ≤ true_rank ≤ r + eps` with a confidence of 99%. |
| |
| A `get_pmf()` query has the following guarantees: |
| - Let `{r1, r2, ..., r(m+1)} = get_pmf(s1, s2, ..., sm)` where `s1, s2` are |
| split points (items from the input domain) between the min and max items of |
| the input stream. |
| - Let `mass_i = estimated mass between s_i and s_i+1`. |
| - Let `true_mass` be the true mass between the items of `s_i`, |
| `s_i+1` derived from the hypothetical sorted stream of all `N` items. |
| - Let `eps = get_normalized_rank_error(true)`. |
| - then `mass - eps ≤ true_mass ≤ mass + eps` with a confidence of 99%. |
| - `r(m+1)` includes the mass of all points larger than `s_m`. |
| |
| A `get_cdf(...)` query has the following guarantees; |
| - Let `{r1, r2, ..., r(m+1)} = get_cdf(s1, s2, ..., sm)` where `s1, s2, ...` are |
| split points (items from the input domain) between the min and max items of |
| the input stream. |
| - Let `mass_i = r_(i+1) - r_i`. |
| - Let `true_mass` be the true mass between the true ranks of `s_i`, |
| `s_i+1` derived from the hypothetical sorted stream of all `N` items. |
| - Let `eps = get_normalized_rank_error(true)`. |
| - then `mass - eps ≤ true_mass ≤ mass + eps` with a confidence of 99%. |
| - `1 - r(m+1)` includes the mass of all points larger than `s_m`. |
| |
| From the above, it might seem like we could make some estimates to bound the |
| `item` returned from a call to `get_quantile()`. The sketch, however, does not |
| let us derive error bounds or confidences around items. Because errors are independent, we |
| can approximately bracket a value as shown below, but there are no error estimates available. |
| Additionally, the interval may be quite large for certain distributions. |
| - Let `q = get_quantile(r)`, the estimated quantile of rank `r`. |
| - Let `eps = get_normalized_rank_error(false)`. |
| - Let `q_lo = estimated quantile of rank (r - eps)`. |
| - Let `q_hi = estimated quantile of rank (r + eps)`. |
| - Then `q_lo ≤ q ≤ q_hi`, with 99% confidence. |
| |
| .. note:: |
| For the :class:`kll_items_sketch`, objects must be comparable with ``__lt__``. |
| |
| .. note:: |
| Serializing and deserializing a :class:`kll_items_sketch` requires the use of a :class:`PyObjectSerDe`. |
| |
| |
| .. autoclass:: kll_ints_sketch |
| :members: |
| :undoc-members: |
| :exclude-members: deserialize, get_normalized_rank_error |
| |
| .. rubric:: Static Methods: |
| |
| .. automethod:: deserialize |
| .. automethod:: get_normalized_rank_error |
| |
| .. rubric:: Non-static Methods: |
| |
| .. automethod:: __init__ |
| |
| .. autoclass:: kll_floats_sketch |
| :members: |
| :undoc-members: |
| :exclude-members: deserialize, get_normalized_rank_error |
| |
| .. rubric:: Static Methods: |
| |
| .. automethod:: deserialize |
| .. automethod:: get_normalized_rank_error |
| |
| .. rubric:: Non-static Methods: |
| |
| .. automethod:: __init__ |
| |
| .. autoclass:: kll_doubles_sketch |
| :members: |
| :undoc-members: |
| :exclude-members: deserialize, get_normalized_rank_error |
| |
| .. rubric:: Static Methods: |
| |
| .. automethod:: deserialize |
| .. automethod:: get_normalized_rank_error |
| |
| .. rubric:: Non-static Methods: |
| |
| .. automethod:: __init__ |
| |
| .. autoclass:: kll_items_sketch |
| :members: |
| :undoc-members: |
| :exclude-members: deserialize, get_normalized_rank_error |
| |
| .. rubric:: Static Methods: |
| |
| .. automethod:: deserialize |
| .. automethod:: get_normalized_rank_error |
| |
| .. rubric:: Non-static Methods: |
| |
| .. automethod:: __init__ |