docs/Quantiles/QuantilesAccuracy.md

layout: doc_page

Quantiles Accuracy and Size

The accuracy of this sketch is a function of the configured value k, which also affects the overall size of the sketch. Accuracy of this quantile sketch is always with respect to the normalized rank. A k of 256 produces a normalized rank error of less than 1%. For example, the median value returned from getQuantile(0.5) will be between the actual values from the hypothetically sorted array of input values at normalized ranks of 0.49 and 0.51, with a confidence of about 99%.

The approximate error (epsilon) values listed in the second row of the header in the table below can be computed using the function double DoubleSketch.getNormalizedRankError(int k). The values in this table match very closely with empirical measurements up to k = 1024 at the 99% confidence level. Beyond k = 1024, the estimated error is somewhat speculative, but should be reasonable guidance. These error values simultaneously apply to all half-open intervals (-Infinity, Q] and closed intervals [Q1, Q2].

The following graphs illustrate the ability of the Quantiles DoublesSketch to characterize value distributions.

1024 (n) values (trueUnsortedValues) were generated from Random's nextGaussian(). These values were then sorted (trueSortedValues) and assigned their true normalized ranks (trueRanks) from 0 to (n-1)/n.
A DoublesSketch (sketch) was then created with K = 32 and updated with the trueUnsortedValues.
sketch.getCDF(trueSortedValues) produced an ordered array of estimated ranks (estimatedRanks) . The estimatedRanks (red) were then plotted against the trueRanks (black). The upper bound error was plotted in blue and the lower bound error was plotted in green.
The error of the estimation from the sketch is called “Epsilon” and is always with respect to the rank axis. It was plotted as a visual bar on the graph to illustrate its size.
The size of this sketch, if stored, would be about 1832 bytes.

A getQuantiles(trueRanks) produced an ordered array estimatedSortedValues, which correspond to the trueRanks. Plotting the estimatedSortedValues against the trueSortedValues produces the inverse CDF plot as follows:

The absolute rank error vs the trueRanks produced the following graph.

All of the above plots were generated from one trial, and is not a test of the error bounds.

The following plot illustrates the 99th percentile of observed maximum normalized rank error of DoublesSketch with k=128 in 1000 trials at each stream length. The code to reproduce this measurement is available in the DataSketches/characterization repository. Note that these measurements are not directly comparable to the values in the table above as this graph plots the error for only the half-open intervals (-Infinity, Q], which is relevant to simple queries such as getRank(value).