The goal of this article is to compare the accuracy performance of the Druid built-in Approximate Histogram to an exact, brute-force computation using actual data extracted from one of our back-end servers.
Please get familiar with the Definitions for quantiles.
The implementation of this Approximate Histogram was based on the February, 2010 paper by Yael Ben-Haim and Elad Tom-Tov of the IBM Haifa Research Lab. However, it should be noted that the authors of the paper specifially mention two caveats:
Unfortunately, a lot of the data that we tend to analyze is highly skewed so these caveats don't bode well for this study. Nonetheless, we thought it would be useful to test it anyway.
All we know about this stream is that it consists of consecutive strings of numeric values separated by a line-feed. Stored as a file, its size is about 2GB.
In order to measure the accuracy of the Approximate Histogram, we performed an exact, brute-force analysis of StreamA and computed the correct values for 101 quantiles (0, .01, .02, ... .99, 1.00). Examining these percentile values allows us to understand the actual distribution of values. From this we can then compute appropriate split-points from which we can compute and plot a reasonable looking PMF histogram. This turned out to be a exponential power series.
The green dots in the above plot represents the Gold Standard cumulative distribution (CDF) of ranks to quantile values. The black circles in the upper right corner of the plot represent the values returned from the getQuantiles(float[]) function.
The plot reveals a dramatic failure of the Approximate Histogram. Below rank = 0.89, the returned array had all zeros and it missed the very top of the distribution above rank = 0.99.
The next plot is the Histogram generated from the values returned by the Histogram class generated by the toHistogram(float[]) function.
Compared to the true histogram (green bars) the histogram produced by the AH algorithm is highly distorted. Note that the green spike at the low end is caused by zeros in the raw data, which is a common occurrence in a lot of data we collect. To make the distortion easier to visualize I plotted the same data using lines instead of bars in the following:
Note that the Approximate Histogram algorithm gives the user no warning that these serious distortions might be occurring, it just fails silently.
Users would be well-advised to not use this algorithm for serious analysis.