layout: doc_page

Druid Approximate Histogram Study

The goal of this article is to compare the accuracy performance of the Druid built-in Approximate Histogram to an exact, brute-force computation using actual data extracted from one of our back-end servers.

Please get familiar with the Definitions for quantiles.

Versions

  • Druid ApproximateHistogram.java Nov 6, 2018

The Paper

The implementation of this Approximate Histogram was based on the February, 2010 paper by Yael Ben-Haim and Elad Tom-Tov of the IBM Haifa Research Lab. However, it should be noted that the authors of the paper specifially mention two caveats:

  • “The second category, to which our algorithm belongs, consists of heuristics that work well empirically and demand low amounts of space, but lack any rigorous accuracy analysis.” Page 856, section 2.4.1.
  • “...when the data distribution is highly skewed, the accuracy of [our] on-line histogram decays ...[so] practitioners may prefer to resort to guaranteed accuracy algorithms.” Page 857, first paragraph.

Unfortunately, a lot of the data that we tend to analyze is highly skewed so these caveats don't bode well for this study. Nonetheless, we thought it would be useful to test it anyway.

The Input Data

All we know about this stream is that it consists of consecutive strings of numeric values separated by a line-feed. Stored as a file, its size is about 2GB.

Creating the “Gold Standard” of Truth

In order to measure the accuracy of the Approximate Histogram, we performed an exact, brute-force analysis of StreamA and computed the correct values for 101 quantiles (0, .01, .02, ... .99, 1.00). Examining these percentile values allows us to understand the actual distribution of values. From this we can then compute appropriate split-points from which we can compute and plot a reasonable looking PMF histogram. This turned out to be a exponential power series.

The Results

The green dots in the above plot represents the Gold Standard cumulative distribution (CDF) of ranks to quantile values. The black circles in the upper right corner of the plot represent the values returned from the getQuantiles(float[]) function.

The plot reveals a dramatic failure of the Approximate Histogram. Below rank = 0.89, the returned array had all zeros and it missed the very top of the distribution above rank = 0.99.

The next plot is the Histogram generated from the values returned by the Histogram class generated by the toHistogram(float[]) function.

Compared to the true histogram (green bars) the histogram produced by the AH algorithm is highly distorted. Note that the green spike at the low end is caused by zeros in the raw data, which is a common occurrence in a lot of data we collect. To make the distortion easier to visualize I plotted the same data using lines instead of bars in the following:

Note that the Approximate Histogram algorithm gives the user no warning that these serious distortions might be occurring, it just fails silently.
Users would be well-advised to not use this algorithm for serious analysis.