layout: doc_page

ReqSketch Accuracy with Random Shuffled Streams

This set of tests characterize the accuracy (or more precisely the rank error) of the ReqSketch using random shuffled streams. The goal of this suite of tests is to understand how the rank error of the sketch behaves across all ranks. All of these tests are run with the same stream length. The two major parameters that are varied are the sketch's K, which affects size and accuracy of the sketch, and the HRA / LRA parameter, which selects the region of highest accuracy is the high ranks close to 1.0 High Ranks Accuracy (HRA), or the low ranks close to zero Low Ranks Accuracy (LRA).

These tests also confirm that the a priori prediction of the error bounds are reasonable and relatively conservative. The computation of these bounds are based on empirical measurements derived from tests such as these and are subject to some tuning as we understand the sketch's error behavior over a wider selection of streams.

For those that are interested in the actual code that run these tests can examine the following links.

  • Code: The code used to generate these characterization studies.
  • Config: The human readable and editable configuration file that instructs the above code with the specific properties used to run the test. These configuration properties are different for each of the following plots and summarized below with each plot.

Test Design

  • Stream Length (SL): 2^20
  • Stream Values: Natural numbers, ℕ1, from 1 to SL, expressed as 32-bit floats.
  • Y-axis: The absolute error of the sketch getRank(value) method.
  • X-axis: The normalized rank [0.0, 1.0]
  • Plot Points (PP): 100. Equally spaced points along the X-axis starting at 0.0 and ending at 1.0.
  • Trial:
    • Randomly Shuffles the input stream and presents it to the sketch.
    • Execute estRanks[] = getRanks(trueValues[]) from the sketch where the array of trueValues[] are the 100 equally spaced values along the x-axis that exactly correspond to the above Plot Points. The estRanks[] is the array of the sketch's estimates of the ranks of the given trueValues.
    • Computes the absolute error: estRanks[i] - trueRanks[i] for each Plot Point and submits this error to a corresponding array of standard qantile sketches, with one quantile sketch assigned to each Plot Point.
  • Trials:
    • 2^12 trials are executed. This produces a distribution of error at each Plot Point held by the error quantile skeches.
    • Note: With these tests randomization is due to two sources, the shuffling of the stream and the internal random process of the sketch itself.
    • Extract 7 quantiles from the error quantile sketches at each Plot Point. These error quantiles correspond to the standard normal distribuion at the median, +/- 1SD, +/- 2SD and +/- 3SD, where SD stands for Standard Deviation from the mean.
  • Plotting:
    • Each of the error quantiles are connected by lines to form contours of the error distribution where the area between the +/- 1SD contours is the size of the 68% confidence interval; between the +/- 2SD coutours is the 95.4% confidence interval; and between the +/- 3SD contours is the 99.7% confidence interval.
    • In addition to the error contours. 6 dashed contours (with colors corresponding to the error contours) represent the a priori estimates of the error at each of the +/- standard deviations computed from the sketch's getRankLowerBound(double, int) and getRankUpperBound(double, int) methods.

Common Configuration for the following plots

  • SL=2^20: StreamLength
  • Eq Spaced: Equally spaced Plot Points (PP)
  • PP=100: Number of plot points on the x-axis
  • LgT=12: Number of trials = 2^12
  • Shuffled: Random shuffle of the input stream for each trial

Results

Plot 1: K=12, HRA, LT

Specific Configuration

  • K=12: the sketch sizing & accuracy parameter
  • HRA: High Rank Accuracy
  • Crit=LT: Comparison criterion: LT = Less-Than
  • Retained Items: 2015
  • Serialization Bytes: 8676
  • Run Time: 14:01
  • Unique Error Value Count Per Plot Point: Min=459, Max=3055

Plot 2: K=12, LRA, LE

Specific Configuration

  • K=12: the sketch sizing & accuracy parameter
  • LRA: Low Rank Accuracy
  • Crit=LE: Comparison criterion: LE = Less-Than or Equal
  • Retained Items: 2015
  • Serialization Bytes: 8676
  • Run Time: 14:52
  • Unique Error Value Count Per Plot Point: Min=447, Max=3106

Plot 1 and 2, Final Compactor Profile & Size

Because these two tests are run with the same exact sketch configuration and the same input stream length they both have the same final compactor profile, retained items and Serialization Bytes.

Lg WtItemsNominal SizeSection SizeNum SectionsNum Compactions
019219242485812
118819242436251
218419242416596
32051924247509
41881924243185
51401446121340
6173144612590
7138144612261
8187144612108
913814461235
1094968613
113696864
12128721231
1324721230
  • Retained Items: 2015
  • Serialization Bytes: 8676

Plot 3: K=50, HRA, LT

Specific Configuration

  • K=50: the sketch sizing & accuracy parameter
  • HRA: High Rank Accuracy
  • Crit=LT: Comparison criterion: LT = Less-Than
  • Retained Items: 6903
  • Serialization Bytes: 28144
  • Run Time: 8:38
  • Unique Error Value Count Per Plot Point: Min=156, Max=2465

Plot 4: K=50, LRA, LE

Specific Configuration

  • K=50: the sketch sizing & accuracy parameter
  • LRA: Low Rank Accuracy
  • Crit=LE: Comparison criterion: LE = Less-Than or Equal
  • Retained Items: 6903
  • Serialization Bytes: 28144
  • Run Time: 27:20
  • Unique Error Value Count Per Plot Point: Min=167, Max=2475

Plot 4B: K=50, LRA, LE, Using Theoretical Bounds

  • K=50: the sketch sizing & accuracy parameter
  • LRA: Low Rank Accuracy
  • Crit=LE: Comparison criterion: LE = Less-Than or Equal

Plot 3 and 4, Final Compactor Profile & Size

Because these two tests are run with the same exact sketch configuration and the same input stream length they both have the same final compactor profile, retained items and Serialization Bytes.

Lg WtItemsNominal SizeSection SizeNum SectionsNum Compactions
0786864182416288
191986418246755
284686418242979
353157624121308
46135762412578
55295762412246
65285762412106
7781576241233
862143236613
93644323664
102503005031
111353005030
  • Retained Items: 6903
  • Serialization Bytes: 28144