Add Druid Approximate Histogram Study

commit: f948f0e33e81283b72210b37cb2e238efafae17f [log] [tgz]
author: lrhodes <lrhodes@yahoo-inc.com> Thu Jan 31 18:32:52 2019 -0800
committer: lrhodes <lrhodes@yahoo-inc.com> Thu Jan 31 18:32:52 2019 -0800
tree: 451a76d99c36b121ca05491682925a6ba465ed9b
parent: 8b5c9e43f130d3f7744280ab3edde8b691b4c75e [diff]
diff --git a/.gitignore b/.gitignore
index 35239a9..c7456c0 100644
--- a/.gitignore
+++ b/.gitignore

@@ -38,3 +38,4 @@
 *.old
 .paper/
 tmp/
+/local/

diff --git a/docs/Quantiles/Definitions.md b/docs/Quantiles/Definitions.md
new file mode 100644
index 0000000..143e4e1
--- /dev/null
+++ b/docs/Quantiles/Definitions.md

@@ -0,0 +1,119 @@
+---
+layout: doc_page
+---
+
+# Definitions for Quantiles Studies
+
+The goal of this article is to compare the accuracy performance of the Druid built-in Approximate Histogram to an exact, brute-force computation using actual data extracted from one of our back-end servers. 
+
+
+## Hypothetically Sorted Stream (HSS) of Values
+Consider an input stream of *n* values. Now imagine if this input stream was sorted from the smallest to largest value. This is what we call the *hypothetical sorted stream* (HSS) of input values. We call it *hypothetical* because we never actually perform this sort (it would be very expensive) so we ask you to imagine it. 
+
+## Rank and Mass
+We define the *natural rank* of a value in this HSS as its zero-based index that ranges from 0 to *n-1*. If we divide the natural rank by *n*, it becomes the *normalized or fractional rank*, and would range from 0 to *(n-1)/n* and be in the interval [0, 1). In our documentation when we refer to *rank* we are referring to the *normalized or fractional rank* and not the *natural rank*.
+
+This *normalized rank* is closely associated with the concept of *mass*. If there are 1 million values in the HSS, then the value associated with the rank 0.5 represents the median value, or the center-point of the HSS *mass*, where half of the values (500,000) are below the median and half are above. 
+
+## getQuantile(rank) and getRank(value)
+We now have two hypothetical streams of values, a HSS of values and the stream of ranks where each value in the HSS is assigned a unique rank. 
+The definiton of *quantile* is the *value* associated with a particular *rank*. So the function *getQuantile(r)* returns the value associated rank *r*.
+Similarly, the function *getRank(v)* returns the rank associated with the value *v*.
+
+If a particular value <i>v</i> is repeated several times in the HSS, there is obviously a range of ranks that are assigned to each of the identical <i>v</i> values. This results in asymmetry between *v = getQuantile(r)* and *r = getRank(v)*. So how do we answer the following two queries when there are duplicate values in the HSS?? 
+
+* What rank is returned from the query <i>getRank(v)</i>? 
+* What value is returned from the query <i>getQuantile(r)</i>? 
+
+It turns out that there are 3 different rules for *getRank(v)* that you will find in the literature when *v* has duplicates in the HSS:
+
+* <i>min rank rule</i>: return the smallest rank associated with the range of duplicate values <i>v</i>.
+* <i>max rank rule</i>: return the largest rank associated with the range of duplicate values <i>v</i>.
+* <i>mid rank rule</i>: return the rank closest to the midpoint of the range of duplicate values <i>v</i>.
+
+In our analysis and algorithms we use the <i>min rank rule</i>.
+
+If the value provided to *getRank(v)* does not exist in the HSS, the rule is to return the rank of the next largest value greater than *v*.
+This is the rank that would include *v* as if it existed.
+
+For example: given the following HSS and associated ranks:
+
+```
+HSS
+   Val  Rank
+  1.00  0.00
+  2.00  0.11
+  3.00  0.22
+  5.00  0.33
+  5.00  0.44
+  5.00  0.56
+  7.00  0.67
+  8.00  0.78
+  9.00  0.89
+```
+
+The *getRank(v)* will return the following:
+
+```
+getRank()
+   Val  Rank
+  0.00  0.00
+  0.50  0.00
+  1.00  0.00
+  1.50  0.11
+  2.00  0.11
+  2.50  0.22
+  3.00  0.22
+  3.50  0.33
+  4.00  0.33
+  4.50  0.33
+  5.00  0.33
+  5.50  0.67
+  6.00  0.67
+  6.50  0.67
+  7.00  0.67
+  7.50  0.78
+  8.00  0.78
+  8.50  0.89
+  9.00  0.89
+```
+
+For the *getQuantile(r)* query:
+
+* <i>Strictly less-than quantile rule</i>: return the value associated with the largest rank that is strictly less-than *r*.
+
+This can be observed in the following:
+
+```
+getQuantile()
+  Rank   Val
+  0.00  1.00
+  0.10  1.00
+  0.20  2.00
+  0.30  3.00
+  0.40  5.00
+  0.50  5.00
+  0.60  5.00
+  0.70  7.00
+  0.80  8.00
+  0.90  9.00
+  1.00  9.00
+```
+
+### Accuracy
+Accuracy for quantiles sketches is measured with respect to the *rank* only.  Not the values.  
+
+So a specified accuracy of 1% at the median (rank = 0.50) means that the true value (if you could compute it from the HSS) should be 
+between *getQuantile(0.49)* and *getQuantile(0.51)*. Note that this is an absolute rank error and not a relative rank error. 
+Measured at the 10th percentile means that the true value (if you could compute it from the HSS) should be 
+between *getQuantile(0.09)* and *getQuantile(0.11)*. 
+
+### The Sketch and Data Independence
+A *sketch* is an implementation of a *streaming algorithm*. By definition, a sketch has only one chance to examine each item of the stream.  It is this property that makes a sketch a *streaming* algorithm and useful for real-time analysis of very large streams that may be impractical to actually store. 
+
+We also assume that the sketch knows nothing about the input data stream: its length, the range of the values or how the values are distributed. 
+If the authors of a particular algorithm require the user to know any of the above attributes of the input data stream in order to "tune" the algorithm, the algorithm is not a sketch as it would require multiple passes through the input data in order to produce correct results.
+
+The only thing the sketch user needs to know is how to extract the values from the stream so that they can be fed into the sketch. It is also reasonable that the user knows the *type* of values in the stream: e.g., are they alphanumeric strings, numeric strings, or numeric primitives. These properties may determine the type of sketch to use as well as how to extract the appropriate quantities to feed into the sketch.  
+
+

diff --git a/docs/Quantiles/DruidApproxHistogramStudy.md b/docs/Quantiles/DruidApproxHistogramStudy.md
new file mode 100644
index 0000000..39b7616
--- /dev/null
+++ b/docs/Quantiles/DruidApproxHistogramStudy.md

@@ -0,0 +1,58 @@
+---
+layout: doc_page
+---
+
+# Druid Approximate Histogram Study
+
+The goal of this article is to compare the accuracy performance of the Druid built-in Approximate Histogram to an exact, brute-force computation using actual data extracted from one of our back-end servers. 
+
+Please get familiar with the [Definitions]({{site.docs_dir}}/Quantiles/Definitions.html) for quantiles.
+
+## Versions
+
+* Druid <a href="https://github.com/apache/incubator-druid/blob/master/extensions-core/histogram/src/main/java/org/apache/druid/query/aggregation/histogram/ApproximateHistogram.java">ApproximateHistogram.java Nov 6, 2018</a>
+
+
+## The Paper
+
+The implementation of this Approximate Histogram was based on the February, 2010 
+[paper](http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf) 
+by Yael Ben-Haim and Elad Tom-Tov of the IBM Haifa Research Lab. However, it should be noted that the authors of the paper specifially mention two caveats:
+
+* "The second category, to which our algorithm belongs, consists of heuristics that work well empirically and demand low amounts of space, but lack any rigorous accuracy analysis." Page 856, section 2.4.1.
+* "...when the data distribution is highly skewed, the accuracy of [our] on-line histogram decays ...[so] practitioners may prefer to resort to guaranteed accuracy algorithms." Page 857, first paragraph.
+
+Unfortunately, a lot of the data that we tend to analyze is highly skewed so these caveats don't bode well for this study.  Nonetheless, we thought it would be useful to test it anyway.
+
+## The Input Data = StreamA
+All we know about this stream is that it consists of consecutive strings of numeric values separated by a line-feed. Stored as a file, its size is about 2GB.
+
+
+## Creating the "Gold Standard" of <i>Truth</i> 
+In order to measure the accuracy of the Approximate Histogram, we performed an exact, brute-force analysis of StreamA and computed the correct values for 101 quantiles (0, .01, .02, ... .99, 1.00). 
+Examining these percentile values allows us to understand the actual distribution of values.  From this we can then compute appropriate split-points from which we can compute and plot a reasonable looking PMF histogram. This turned out to be a exponential power series.
+
+## The Results
+
+<img class="doc-img-full" src="{{site.docs_img_dir}}/quantiles/DruidAH_StreamA_CDF.png" alt="Druid Approx Hist CDF of ranks to quantiles" />  
+
+The green dots in the above plot represents the Gold Standard cumulative distribution (CDF) of ranks to quantile values. The black circles in the upper right corner of the plot represent the values returned from the *getQuantiles(float[])* function. 
+
+The plot reveals a dramatic failure of the Approximate Histogram. Below rank = 0.89, the returned array had all zeros and it missed the very top of the distribution above rank = 0.99. 
+
+The next plot is the Histogram generated from the values returned by the *Histogram* class generated by the *toHistogram(float[])* function.
+
+<img class="doc-img-full" src="{{site.docs_img_dir}}/quantiles/DruidAH_StreamA_PMF1.png" alt="Druid Approx Hist PMF of values to counts" />
+
+Compared to the true histogram (green bars) the histogram produced by the AH algorithm is highly distorted. Note that the green spike at the low end is caused by zeros in the raw data, which is a common occurrence in a lot of data we collect.  To make the distortion easier to visualize I plotted the same data using lines instead of bars in the following:
+
+<img class="doc-img-full" src="{{site.docs_img_dir}}/quantiles/DruidAH_StreamA_PMF2.png" alt="Druid Approx Hist PMF of values to counts" />
+
+Note that the Approximate Histogram algorithm gives the user no warning that these serious distortions might be occurring, it just fails silently.  
+Users would be well-advised to not use this algorithm for serious analysis.
+
+
+
+
+
+ 

diff --git a/docs/img/quantiles/DSQsketch_StreamA_CDF.png b/docs/img/quantiles/DSQsketch_StreamA_CDF.png
new file mode 100644
index 0000000..bda1c07
--- /dev/null
+++ b/docs/img/quantiles/DSQsketch_StreamA_CDF.png
Binary files differ

diff --git a/docs/img/quantiles/DSQsketch_StreamA_PMF.png b/docs/img/quantiles/DSQsketch_StreamA_PMF.png
new file mode 100644
index 0000000..bc6faff
--- /dev/null
+++ b/docs/img/quantiles/DSQsketch_StreamA_PMF.png
Binary files differ

diff --git a/docs/img/quantiles/DruidAH_StreamA_CDF.png b/docs/img/quantiles/DruidAH_StreamA_CDF.png
new file mode 100644
index 0000000..341c8a8
--- /dev/null
+++ b/docs/img/quantiles/DruidAH_StreamA_CDF.png
Binary files differ

diff --git a/docs/img/quantiles/DruidAH_StreamA_PMF1.png b/docs/img/quantiles/DruidAH_StreamA_PMF1.png
new file mode 100644
index 0000000..f07336b
--- /dev/null
+++ b/docs/img/quantiles/DruidAH_StreamA_PMF1.png
Binary files differ

diff --git a/docs/img/quantiles/DruidAH_StreamA_PMF2.png b/docs/img/quantiles/DruidAH_StreamA_PMF2.png
new file mode 100644
index 0000000..a583e51
--- /dev/null
+++ b/docs/img/quantiles/DruidAH_StreamA_PMF2.png
Binary files differ

diff --git a/docs/img/quantiles/MSketch_StreamA_CDF.png b/docs/img/quantiles/MSketch_StreamA_CDF.png
new file mode 100644
index 0000000..10dfd37
--- /dev/null
+++ b/docs/img/quantiles/MSketch_StreamA_CDF.png
Binary files differ

diff --git a/docs/toc.md b/docs/toc.md
index 4afe97e..c10bd8d 100644
--- a/docs/toc.md
+++ b/docs/toc.md

@@ -56,11 +56,12 @@
   <li><a href="{{site.docs_dir}}/Quantiles/QuantilesHiveUDFs.html">Quantiles Sketch Hive UDFs</a></li>
   <li><a href="{{site.docs_dir}}/Quantiles/KLLSketch.html">New KLL sketch and comparison with DoublesSketch</a></li>
   <li><a href="{{site.docs_dir}}/Quantiles/KllSketchVsTDigest.html">KLL sketch vs t-digest</a></li>
+  <li><a href="{{site.docs_dir}}/Quantiles/KllSketchVsTDigest.html">KLL sketch vs t-digest</a></li>
 
 <h3><a data-toggle="collapse" class="menu collapsed" href="#collapse_quantilesTheory">Quantiles Sketch Theory</a></h3>
 <div class="collapse" id="collapse_quantilesTheory">
   <li><a href="{{site.docs_pdf_dir}}/Quantiles_KLL.pdf">Optimal Quantile Approximation in Streams</a></li>
-  <li><a href="{{site.docs_dir}}/Quantiles/QuantilesReferences.html">Quantiles References</a></li>
+  <li><a href="{{site.docs_dir}}/Quantiles/DruidApproxHistogramStudy.html">Druid Approximate Histogram Study</a></li>
 </div>
 </div>
commit	f948f0e33e81283b72210b37cb2e238efafae17f	[log] [tgz]
author	lrhodes <lrhodes@yahoo-inc.com>	Thu Jan 31 18:32:52 2019 -0800
committer	lrhodes <lrhodes@yahoo-inc.com>	Thu Jan 31 18:32:52 2019 -0800
tree	451a76d99c36b121ca05491682925a6ba465ed9b
parent	8b5c9e43f130d3f7744280ab3edde8b691b4c75e [diff]