docs/source/quantiles/kll.rst - datasketches-python - Git at Google

 KLL Sketch
 ----------

 .. currentmodule:: datasketches

 Implementation of a very compact quantiles sketch with lazy compaction scheme
 and nearly optimal accuracy per retained item.
 See `Optimal Quantile Approximation in Streams`.

 This is a stochastic streaming sketch that enables near real-time analysis of the
 approximate distribution of items from a very large stream in a single pass, requiring only
 that the items are comparable.
 The analysis is obtained using `get_quantile()` function or the
 inverse functions `get_rank()`, `get_pmf()` (Probability Mass Function), and `get_cdf()`
 (Cumulative Distribution Function).

 Given an input stream of `N` items, the `natural rank` of any specific
 item is defined as its index `(1 to N)` in inclusive mode
 or `(0 to N-1)` in exclusive mode
 in the hypothetical sorted stream of all `N` input items.

 The `normalized rank` (`rank`) of any specific item is defined as its
 `natural rank` divided by `N`.
 Thus, the `normalized rank` is between zero and one.
 In the documentation for this sketch `natural rank` is never used so any
 reference to just `rank` should be interpreted to mean `normalized rank`.

 This sketch is configured with a parameter `k`, which affects the size of the sketch
 and its estimation error.

 The estimation error is commonly called `epsilon` (or `eps`) and is a fraction
 between zero and one. Larger values of `k` result in smaller values of `epsilon`.
 Epsilon is always with respect to the rank and cannot be applied to the
 corresponding items.

 The relationship between the `normalized rank` and the corresponding items can be viewed
 as a two-dimensional monotonic plot with the `normalized rank` on one axis and the
 corresponding items on the other axis. If the y-axis is specified as the item-axis and
 the x-axis as the `normalized rank`, then `y = get_quantile(x)` is a monotonically
 increasing function.

 The function `get_quantile(rank)` translates ranks into
 corresponding quantiles. The functions `get_rank(item)`,
 `get_cdf(...)` (Cumulative Distribution Function), and `get_pmf(...)`
 (Probability Mass Function) perform the opposite operation and translate items into ranks.

 The `get_pmf(...)` function has about 13 to 47% worse rank error (depending
 on `k`) than the other queries because the mass of each "bin" of the PMF has
 "double-sided" error from the upper and lower edges of the bin as a result of a subtraction,
 as the errors from the two edges can sometimes add.

 The default `k` of 200 yields a "single-sided" `epsilon` of about 1.33% and a
 "double-sided" (PMF) `epsilon` of about 1.65%.

 A `get_quantile(rank)` query has the following guarantees:
 - Let `q = get_quantile(r)` where `r` is the rank between zero and one.
 - The quantile `q` will be an item from the input stream.
 - Let `true_rank` be the true rank of `q` derived from the hypothetical sorted
 stream of all `N` items.
 - Let `eps = get_normalized_rank_error(false)`.
 - Then `r - eps ≤ true_rank ≤ r + eps` with a confidence of 99%. Note that the
 error is on the rank, not the quantile.

 A `get_rank(item)` query has the following guarantees:
 - Let `r = get_rank(i)` where `i` is an item between the min and max items of
 the input stream.
 - Let `true_rank` be the true rank of `i` derived from the hypothetical sorted
 stream of all `N` items.
 - Let `eps = get_normalized_rank_error(false)`.
 - Then `r - eps ≤ true_rank ≤ r + eps` with a confidence of 99%.

 A `get_pmf()` query has the following guarantees:
 - Let `{r1, r2, ..., r(m+1)} = get_pmf(s1, s2, ..., sm)` where `s1, s2` are
 split points (items from the input domain) between the min and max items of
 the input stream.
 - Let `mass_i = estimated mass between s_i and s_i+1`.
 - Let `true_mass` be the true mass between the items of `s_i`,
 `s_i+1` derived from the hypothetical sorted stream of all `N` items.
 - Let `eps = get_normalized_rank_error(true)`.
 - then `mass - eps ≤ true_mass ≤ mass + eps` with a confidence of 99%.
 - `r(m+1)` includes the mass of all points larger than `s_m`.

 A `get_cdf(...)` query has the following guarantees;
 - Let `{r1, r2, ..., r(m+1)} = get_cdf(s1, s2, ..., sm)` where `s1, s2, ...` are
 split points (items from the input domain) between the min and max items of
 the input stream.
 - Let `mass_i = r_(i+1) - r_i`.
 - Let `true_mass` be the true mass between the true ranks of `s_i`,
 `s_i+1` derived from the hypothetical sorted stream of all `N` items.
 - Let `eps = get_normalized_rank_error(true)`.
 - then `mass - eps ≤ true_mass ≤ mass + eps` with a confidence of 99%.
 - `1 - r(m+1)` includes the mass of all points larger than `s_m`.

 From the above, it might seem like we could make some estimates to bound the
 `item` returned from a call to `get_quantile()`. The sketch, however, does not
 let us derive error bounds or confidences around items. Because errors are independent, we
 can approximately bracket a value as shown below, but there are no error estimates available.
 Additionally, the interval may be quite large for certain distributions.
 - Let `q = get_quantile(r)`, the estimated quantile of rank `r`.
 - Let `eps = get_normalized_rank_error(false)`.
 - Let `q_lo = estimated quantile of rank (r - eps)`.
 - Let `q_hi = estimated quantile of rank (r + eps)`.
 - Then `q_lo ≤ q ≤ q_hi`, with 99% confidence.

 .. note::
     For the :class:`kll_items_sketch`, objects must be comparable with ``__lt__``.

 .. note::
     Serializing and deserializing a :class:`kll_items_sketch` requires the use of a :class:`PyObjectSerDe`.


 .. autoclass:: kll_ints_sketch
     :members:
     :undoc-members:
     :exclude-members: deserialize, get_normalized_rank_error

     .. rubric:: Static Methods:

     .. automethod:: deserialize
     .. automethod:: get_normalized_rank_error

     .. rubric:: Non-static Methods:

     .. automethod:: __init__

 .. autoclass:: kll_floats_sketch
     :members:
     :undoc-members:
     :exclude-members: deserialize, get_normalized_rank_error

     .. rubric:: Static Methods:

     .. automethod:: deserialize
     .. automethod:: get_normalized_rank_error

     .. rubric:: Non-static Methods:

     .. automethod:: __init__

 .. autoclass:: kll_doubles_sketch
     :members:
     :undoc-members:
     :exclude-members: deserialize, get_normalized_rank_error

     .. rubric:: Static Methods:

     .. automethod:: deserialize
     .. automethod:: get_normalized_rank_error

     .. rubric:: Non-static Methods:

     .. automethod:: __init__

 .. autoclass:: kll_items_sketch
     :members:
     :undoc-members:
     :exclude-members: deserialize, get_normalized_rank_error

     .. rubric:: Static Methods:

     .. automethod:: deserialize
     .. automethod:: get_normalized_rank_error

     .. rubric:: Non-static Methods:

     .. automethod:: __init__
	KLL Sketch
	----------

	.. currentmodule:: datasketches

	Implementation of a very compact quantiles sketch with lazy compaction scheme
	and nearly optimal accuracy per retained item.
	See `Optimal Quantile Approximation in Streams`.

	This is a stochastic streaming sketch that enables near real-time analysis of the
	approximate distribution of items from a very large stream in a single pass, requiring only
	that the items are comparable.
	The analysis is obtained using `get_quantile()` function or the
	inverse functions `get_rank()`, `get_pmf()` (Probability Mass Function), and `get_cdf()`
	(Cumulative Distribution Function).

	Given an input stream of `N` items, the `natural rank` of any specific
	item is defined as its index `(1 to N)` in inclusive mode
	or `(0 to N-1)` in exclusive mode
	in the hypothetical sorted stream of all `N` input items.

	The `normalized rank` (`rank`) of any specific item is defined as its
	`natural rank` divided by `N`.
	Thus, the `normalized rank` is between zero and one.
	In the documentation for this sketch `natural rank` is never used so any
	reference to just `rank` should be interpreted to mean `normalized rank`.

	This sketch is configured with a parameter `k`, which affects the size of the sketch
	and its estimation error.

	The estimation error is commonly called `epsilon` (or `eps`) and is a fraction
	between zero and one. Larger values of `k` result in smaller values of `epsilon`.
	Epsilon is always with respect to the rank and cannot be applied to the
	corresponding items.

	The relationship between the `normalized rank` and the corresponding items can be viewed
	as a two-dimensional monotonic plot with the `normalized rank` on one axis and the
	corresponding items on the other axis. If the y-axis is specified as the item-axis and
	the x-axis as the `normalized rank`, then `y = get_quantile(x)` is a monotonically
	increasing function.

	The function `get_quantile(rank)` translates ranks into
	corresponding quantiles. The functions `get_rank(item)`,
	`get_cdf(...)` (Cumulative Distribution Function), and `get_pmf(...)`
	(Probability Mass Function) perform the opposite operation and translate items into ranks.

	The `get_pmf(...)` function has about 13 to 47% worse rank error (depending
	on `k`) than the other queries because the mass of each "bin" of the PMF has
	"double-sided" error from the upper and lower edges of the bin as a result of a subtraction,
	as the errors from the two edges can sometimes add.

	The default `k` of 200 yields a "single-sided" `epsilon` of about 1.33% and a
	"double-sided" (PMF) `epsilon` of about 1.65%.

	A `get_quantile(rank)` query has the following guarantees:
	- Let `q = get_quantile(r)` where `r` is the rank between zero and one.
	- The quantile `q` will be an item from the input stream.
	- Let `true_rank` be the true rank of `q` derived from the hypothetical sorted
	stream of all `N` items.
	- Let `eps = get_normalized_rank_error(false)`.
	- Then `r - eps ≤ true_rank ≤ r + eps` with a confidence of 99%. Note that the
	error is on the rank, not the quantile.

	A `get_rank(item)` query has the following guarantees:
	- Let `r = get_rank(i)` where `i` is an item between the min and max items of
	the input stream.
	- Let `true_rank` be the true rank of `i` derived from the hypothetical sorted
	stream of all `N` items.
	- Let `eps = get_normalized_rank_error(false)`.
	- Then `r - eps ≤ true_rank ≤ r + eps` with a confidence of 99%.

	A `get_pmf()` query has the following guarantees:
	- Let `{r1, r2, ..., r(m+1)} = get_pmf(s1, s2, ..., sm)` where `s1, s2` are
	split points (items from the input domain) between the min and max items of
	the input stream.
	- Let `mass_i = estimated mass between s_i and s_i+1`.
	- Let `true_mass` be the true mass between the items of `s_i`,
	`s_i+1` derived from the hypothetical sorted stream of all `N` items.
	- Let `eps = get_normalized_rank_error(true)`.
	- then `mass - eps ≤ true_mass ≤ mass + eps` with a confidence of 99%.
	- `r(m+1)` includes the mass of all points larger than `s_m`.

	A `get_cdf(...)` query has the following guarantees;
	- Let `{r1, r2, ..., r(m+1)} = get_cdf(s1, s2, ..., sm)` where `s1, s2, ...` are
	split points (items from the input domain) between the min and max items of
	the input stream.
	- Let `mass_i = r_(i+1) - r_i`.
	- Let `true_mass` be the true mass between the true ranks of `s_i`,
	`s_i+1` derived from the hypothetical sorted stream of all `N` items.
	- Let `eps = get_normalized_rank_error(true)`.
	- then `mass - eps ≤ true_mass ≤ mass + eps` with a confidence of 99%.
	- `1 - r(m+1)` includes the mass of all points larger than `s_m`.

	From the above, it might seem like we could make some estimates to bound the
	`item` returned from a call to `get_quantile()`. The sketch, however, does not
	let us derive error bounds or confidences around items. Because errors are independent, we
	can approximately bracket a value as shown below, but there are no error estimates available.
	Additionally, the interval may be quite large for certain distributions.
	- Let `q = get_quantile(r)`, the estimated quantile of rank `r`.
	- Let `eps = get_normalized_rank_error(false)`.
	- Let `q_lo = estimated quantile of rank (r - eps)`.
	- Let `q_hi = estimated quantile of rank (r + eps)`.
	- Then `q_lo ≤ q ≤ q_hi`, with 99% confidence.

	.. note::
	For the :class:`kll_items_sketch`, objects must be comparable with ``__lt__``.

	.. note::
	Serializing and deserializing a :class:`kll_items_sketch` requires the use of a :class:`PyObjectSerDe`.


	.. autoclass:: kll_ints_sketch
	:members:
	:undoc-members:
	:exclude-members: deserialize, get_normalized_rank_error

	.. rubric:: Static Methods:

	.. automethod:: deserialize
	.. automethod:: get_normalized_rank_error

	.. rubric:: Non-static Methods:

	.. automethod:: __init__

	.. autoclass:: kll_floats_sketch
	:members:
	:undoc-members:
	:exclude-members: deserialize, get_normalized_rank_error

	.. rubric:: Static Methods:

	.. automethod:: deserialize
	.. automethod:: get_normalized_rank_error

	.. rubric:: Non-static Methods:

	.. automethod:: __init__

	.. autoclass:: kll_doubles_sketch
	:members:
	:undoc-members:
	:exclude-members: deserialize, get_normalized_rank_error

	.. rubric:: Static Methods:

	.. automethod:: deserialize
	.. automethod:: get_normalized_rank_error

	.. rubric:: Non-static Methods:

	.. automethod:: __init__

	.. autoclass:: kll_items_sketch
	:members:
	:undoc-members:
	:exclude-members: deserialize, get_normalized_rank_error

	.. rubric:: Static Methods:

	.. automethod:: deserialize
	.. automethod:: get_normalized_rank_error

	.. rubric:: Non-static Methods:

	.. automethod:: __init__