layout: doc_page

Druid Approximate Histogram Study

The goal of this article is to compare the accuracy performance of the Druid built-in Approximate Histogram to an exact, brute-force computation using actual data extracted from one of our back-end servers.

Please get familiar with the Definitions for quantiles.

Versions

  • Druid ApproximateHistogram.java Nov 6, 2018

The Paper

The implementation of this Approximate Histogram was based on the February, 2010 paper by Yael Ben-Haim and Elad Tom-Tov of the IBM Haifa Research Lab. However, it should be noted that the authors of the paper specifially mention two caveats:

  • “The second category, to which our algorithm belongs, consists of heuristics that work well empirically and demand low amounts of space, but lack any rigorous accuracy analysis.” Page 856, section 2.4.1.
  • “...when the data distribution is highly skewed, the accuracy of [our] on-line histogram decays ...[so] practitioners may prefer to resort to guaranteed accuracy algorithms.” Page 857, first paragraph.

Unfortunately, a lot of the data that we tend to analyze is highly skewed so these caveats don't bode well for this study. Nonetheless, we thought it would be useful to test it anyway.

The Input Data

The data file used for this evaluation, streamA.txt, is real data extracted from one of our back-end servers. It represents one hour of time-spent data. The data in this file has a smooth and well-behaved value distribution and has a wide dynamic range. It is a text file and consists of consecutive strings of numeric values separated by a line-feeds. Its size is about 2GB.

Creating the “Gold Standard” of Truth

In order to measure the accuracy of the Approximate Histogram, we performed an exact, brute-force analysis of streamA.txt.

  • 1st scan; We scan the file to determine the number of items, n in the file and to determine the type of numerics in the data (they are all integers).
  • Then we created a temporary primitive integer array of size n.
  • 2nd scan: extract the values from the file and load into the integer array.
  • The array is sorted.
  • 3rd scan; extract 101 quantile values at the ranks: (0, .01, .02, ... .99, 1.00).
  • This allows us to plot the rank to quantile cumulative distribution. And from this we choose appropriate splitpoints values that would produce a good-looking PMF plot. This turned out to be a exponential power series. This requires some judgement but could be automated. We did not bother to do this.
  • 4th scan: extract the ranks at the chosen split-point values. Denormalize each rank by multiplying by n. The difference in the denormalized ranks is the mass between the respective splitpoints.
  • This allows us to plot the PMF.

This whole process is executed by the Exact Analysis profiler (bottom of page).

The Results

The green dots in the above plot represents the Gold Standard cumulative distribution (CDF) of ranks to quantile values. The black circles in the upper right corner of the plot represent the values returned from the Approximate Histogram getQuantiles(float[]) function.

The plot reveals a dramatic failure of the Approximate Histogram. Below rank = 0.89, the returned array had all zeros.

The next plot is the Histogram generated from the values returned by the Histogram class generated by the toHistogram(float[]) function.

Compared to the true histogram (green bars) the histogram produced by the AH algorithm is highly distorted. Note that the green spike at the low end is caused by zeros in the raw data, which is a common occurrence in a lot of data we collect. To make the distortion easier to visualize the same data are plotted using lines instead of bars in the following:

Note that the Approximate Histogram tool gives the user no warning that these serious distortions might be occurring, it just fails silently.

Users would be well-advised to not use this tool for serious analysis.

The Evaluation Source

The following are used to create the plots above.

Run the above profilers as a java application and supply the config file as the single argument. The program will check if the data file has been unzipped, and if not it will unzip it for you.