layout: doc_page

Original Quantiles Sketch

Quantiles Sketches Accuracy and Size

Please review the Quantiles Definitions.

The accuracy of a quantile sketch is a function of the configured value k, which also affects the overall size of the sketch.

Accuracy for quantiles sketches is specified and measured with respect to the rank only, not the values.

Absolute vs Relative Error

The Quantiles/DoublesSketch and the KLL Sketch have absolute error. For example, a specified accuracy of 1% at the median (rank = 0.50) means that the true value (if you could extract it from the set) should be between getQuantile(0.49) and getQuantile(0.51). This same 1% error applied at a rank of 0.95 means that the true value should be between getQuantile(0.94) and getQuantile(0.96). In other words, the error is a fixed +/- epsilon for the entire range of rank values.

The ReqSketch, however, has relative rank error and the user can choose which end of the rank domain should have high accuracy. Refer to the sketch documentation for more information.

Quantiles Sketches and Data Independence

A sketch is an implementation of a streaming algorithm. By definition, a sketch has only one chance to examine each item of the stream. It is this property that makes a sketch a streaming algorithm and useful for real-time analysis of very large streams that may be impractical to actually store.

We also assume that the sketch knows nothing about the input data stream: its length, the range of the values or how the values are distributed. If the authors of a particular algorithm require the user to know any of the above attributes of the input data stream in order to “tune” the algorithm, the algorithm is not data independent.

The only thing the user needs to know is how to extract the values from the stream so that they can be fed into the sketch. It is reasonable that the user knows the type of values in the stream: e.g., are they alphanumeric strings, numeric strings, or numeric primitives. These properties may determine the type of sketch to use as well as how to extract the appropriate quantities to feed into the sketch.

Accuracy Information for the org.apache.datasketches.quantiles Sketch Package

A k of 256 produces a normalized rank error of less than 1%. For example, the median value returned from getQuantile(0.5) will be between the actual values from the hypothetically sorted array of input values at normalized ranks of 0.49 and 0.51, with a confidence of about 99%.

The approximate error (epsilon) values listed in the second row of the header in the table below can be computed using the function double DoubleSketch.getNormalizedRankError(int k). The values in this table match very closely with empirical measurements up to k = 1024 at the 99% confidence level. Beyond k = 1024, the estimated error is somewhat speculative, but should be reasonable guidance. These error values simultaneously apply to all half-open intervals (-Infinity, Q] and closed intervals [Q1, Q2].

The following graphs illustrate the ability of the Quantiles DoublesSketch to characterize value distributions.

  • 1024 (n) values (trueUnsortedValues) were generated from Random's nextGaussian(). These values were then sorted (trueSortedValues) and assigned their true normalized ranks (trueRanks) from 0 to (n-1)/n.

  • A DoublesSketch (sketch) was then created with K = 32 and updated with the trueUnsortedValues.

  • sketch.getCDF(trueSortedValues) produced an ordered array of estimated ranks (estimatedRanks) . The estimatedRanks (red) were then plotted against the trueRanks (black). The upper bound error was plotted in blue and the lower bound error was plotted in green.

  • The error of the estimation from the sketch is called “Epsilon” and is always with respect to the rank axis. It was plotted as a visual bar on the graph to illustrate its size.

  • The size of this sketch, if stored, would be about 1832 bytes.

  • A getQuantiles(trueRanks) produced an ordered array estimatedSortedValues, which correspond to the trueRanks. Plotting the estimatedSortedValues against the trueSortedValues produces the inverse CDF plot as follows:

The absolute rank error vs the trueRanks produced the following graph.

All of the above plots were generated from one trial, and is not a test of the error bounds.

The following plot illustrates the 99th percentile of observed maximum normalized rank error of DoublesSketch with k=128 in 1000 trials at each stream length. The code to reproduce this measurement is available in the DataSketches/characterization repository. Note that these measurements are not directly comparable to the values in the table above as this graph plots the error for only the half-open intervals (-Infinity, Q], which is relevant to simple queries such as getRank(value).