| <!DOCTYPE HTML> |
| <!-- NewPage --> |
| <html lang="en"> |
| <head> |
| <!-- Generated by javadoc --> |
| <title>org.apache.datasketches.hll (datasketches-java 4.0.0 API)</title> |
| <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> |
| <link rel="stylesheet" type="text/css" href="../../../../stylesheet.css" title="Style"> |
| <link rel="stylesheet" type="text/css" href="../../../../jquery/jquery-ui.min.css" title="Style"> |
| <link rel="stylesheet" type="text/css" href="../../../../jquery-ui.overrides.css" title="Style"> |
| <script type="text/javascript" src="../../../../script.js"></script> |
| <script type="text/javascript" src="../../../../jquery/jszip/dist/jszip.min.js"></script> |
| <script type="text/javascript" src="../../../../jquery/jszip-utils/dist/jszip-utils.min.js"></script> |
| <!--[if IE]> |
| <script type="text/javascript" src="../../../../jquery/jszip-utils/dist/jszip-utils-ie.min.js"></script> |
| <![endif]--> |
| <script type="text/javascript" src="../../../../jquery/jquery-3.6.0.min.js"></script> |
| <script type="text/javascript" src="../../../../jquery/jquery-ui.min.js"></script> |
| </head> |
| <body> |
| <script type="text/javascript"><!-- |
| try { |
| if (location.href.indexOf('is-external=true') == -1) { |
| parent.document.title="org.apache.datasketches.hll (datasketches-java 4.0.0 API)"; |
| } |
| } |
| catch(err) { |
| } |
| //--> |
| var pathtoroot = "../../../../"; |
| var useModuleDirectories = true; |
| loadScripts(document, 'script');</script> |
| <noscript> |
| <div>JavaScript is disabled on your browser.</div> |
| </noscript> |
| <header role="banner"> |
| <nav role="navigation"> |
| <div class="fixedNav"> |
| <!-- ========= START OF TOP NAVBAR ======= --> |
| <div class="topNav"><a id="navbar.top"> |
| <!-- --> |
| </a> |
| <div class="skipNav"><a href="#skip.navbar.top" title="Skip navigation links">Skip navigation links</a></div> |
| <a id="navbar.top.firstrow"> |
| <!-- --> |
| </a> |
| <ul class="navList" title="Navigation"> |
| <li><a href="../../../../index.html">Overview</a></li> |
| <li class="navBarCell1Rev">Package</li> |
| <li>Class</li> |
| <li><a href="package-use.html">Use</a></li> |
| <li><a href="package-tree.html">Tree</a></li> |
| <li><a href="../../../../deprecated-list.html">Deprecated</a></li> |
| <li><a href="../../../../index-all.html">Index</a></li> |
| <li><a href="../../../../help-doc.html">Help</a></li> |
| </ul> |
| </div> |
| <div class="subNav"> |
| <ul class="navList" id="allclasses_navbar_top"> |
| <li><a href="../../../../allclasses.html">All Classes</a></li> |
| </ul> |
| <ul class="navListSearch"> |
| <li><label for="search">SEARCH:</label> |
| <input type="text" id="search" value="search" disabled="disabled"> |
| <input type="reset" id="reset" value="reset" disabled="disabled"> |
| </li> |
| </ul> |
| <div> |
| <script type="text/javascript"><!-- |
| allClassesLink = document.getElementById("allclasses_navbar_top"); |
| if(window==top) { |
| allClassesLink.style.display = "block"; |
| } |
| else { |
| allClassesLink.style.display = "none"; |
| } |
| //--> |
| </script> |
| <noscript> |
| <div>JavaScript is disabled on your browser.</div> |
| </noscript> |
| </div> |
| <a id="skip.navbar.top"> |
| <!-- --> |
| </a></div> |
| <!-- ========= END OF TOP NAVBAR ========= --> |
| </div> |
| <div class="navPadding"> </div> |
| <script type="text/javascript"><!-- |
| $('.navPadding').css('padding-top', $('.fixedNav').css("height")); |
| //--> |
| </script> |
| </nav> |
| </header> |
| <main role="main"> |
| <div class="header"> |
| <h1 title="Package" class="title">Package org.apache.datasketches.hll</h1> |
| </div> |
| <div class="contentContainer"> |
| <section><a id="package.description"> |
| <!-- --> |
| </a> |
| <div class="block"><h1>The DataSketches™ HLL sketch family package</h1> |
| <a href="HllSketch.html" title="class in org.apache.datasketches.hll"><code>HllSketch</code></a> and <a href="Union.html" title="class in org.apache.datasketches.hll"><code>Union</code></a> |
| are the public facing classes of this high performance implementation of Phillipe Flajolet's |
| HyperLogLog algorithm[1] but with significantly improved error behavior and important features that can be |
| essential for large production systems that must handle massive data. |
| |
| <h2>Key Features of the DataSketches™ HLL Sketch and its companion Union</h2> |
| |
| <h3>Advanced Estimation Algorithms for Optimum Accuracy</h3> |
| |
| <h4>Zero error at low cardinalities</h4> |
| The HLL sketch leverages highly compact arrays and hash tables to keep exact counts until the transition to |
| dense mode is required for space reasons. The result is perfect accuracy for very low cardinalities. |
| |
| <p>Accuracy for very small streams can be important because Big Data is often fragmented into millions of smaller |
| streams (or segments) that inevitably are power-law distributed in size. If you are sketching all these fragments, |
| as a general rule, more than 80% of your sketches will be very small, 20% will be much larger, and only a few very |
| large in cardinality. |
| |
| <h4>HIP / Martingale Estimator</h4> |
| When obtaining a cardinality estimate, the sketch automatically determines if it was the result of the capture of |
| a single stream, or if was the result of certain qualifying union operations. If this is the case the sketch will |
| take advantage of Edith Cohen's Historical Inverse Probability (HIP) estimation algorithm[2], which was |
| also independently developed by Daniel Ting as the Martingale estimation algorithm[3]. |
| This will result in a 20% improvement in accuracy over the standard Flajolet estimator. |
| If it is not a single stream or if the specific union operation did not qualify, |
| the estimator will default to the Composite Estimator. |
| |
| <h4>Composite Estimator</h4> |
| This advanced estimator is a blend of several algorithms including new algorithms developed by Kevin Lang for his |
| Compressed Probabilistic Counting (CPC) sketch[4]. These algorithms provide near optimal estimation accuracy |
| for cases that don't qualify for HIP / Martingale estimation. |
| |
| <p>As a result of all of this work on accuracy, one will get a very smooth curve of the underlying accuracy of the |
| sketch once the statistical randomness is removed through multiple trials. This can be observed in the |
| following graph.</p> |
| |
| <p><img src="doc-files/HLL_HIP_K12T20U20.png" width="500" alt="HLL Accuracy">[6]</p> |
| |
| <p>The above graph has 7 curves. At y = 0, is the median line that hugs the x-axis so closely that it can't be seen. |
| The two curves, just above and just below the x-axis, correspond to +/- 1 standard deviation (SD) of error. |
| The distance between either one of this pair and the x-axis is also known as the Relative Standard Error (RSE). |
| This type of graph for illustrating sketch error we call a "pitchfork plot".</p> |
| |
| <p>The next two curves above and below correspond to +/- 2 SD, and |
| the top-most and bottom-most curves correspond to +/- 3 SD. |
| The chart grid lines are set at +/- multiples of Relative Standard Error (RSE) that correspond to +/- 1,2,3 SD. |
| Below the cardinality of about 512 there is no error at all. This is the point where this particular |
| sketch transitions from sparse to dense (or estimation) mode.</p> |
| |
| <h3>Three HLL Types</h3> |
| This HLL implementation offers three different types of HLL sketch, each with different |
| trade-offs with accuracy, space and performance. These types are selected with the |
| <a href="TgtHllType.html" title="enum in org.apache.datasketches.hll"><code>TgtHllType</code></a> parameter. |
| |
| <p>In terms of accuracy, all three types, for the same <i>lgConfigK</i>, have the same error |
| distribution as a function of cardinality.</p> |
| |
| <p>The configuration parameter <i>lgConfigK</i> is the log-base-2 of <i>K</i>, |
| where <i>K</i> is the number of buckets or slots for the sketch. <i>lgConfigK</i> impacts both accuracy and |
| the size of the sketch in memory and when stored.</p> |
| |
| <h4>HLL 8</h4> |
| This uses an 8-bit byte per HLL bucket. It is generally the |
| fastest in terms of update time but has the largest storage footprint of about <i>K</i> bytes. |
| |
| <h4>HLL 6</h4> |
| This uses a 6-bit field per HLL bucket. It is the generally the next fastest |
| in terms of update time with a storage footprint of about <i>3/4 * K</i> bytes. |
| |
| <h4>HLL 4</h4> |
| This uses a 4-bit field per HLL bucket and for large counts may require |
| the use of a small internal auxiliary array for storing statistical exceptions, which are rare. |
| For the values of <i>lgConfigK > 13</i> (<i>K</i> = 8192), |
| this additional array adds about 3% to the overall storage. It is generally the slowest in |
| terms of update time, but has the smallest storage footprint of about <i>K/2 * 1.03</i> bytes. |
| |
| <h3>Off-Heap Operation</h3> |
| This HLL sketch also offers the capability of operating off-heap. Given a <i>WritableMemory[5]</i> object |
| created by the user, the sketch will perform all of its updates and internal phase transitions |
| in that object, which can actually reside either on-heap or off-heap based on how it was |
| configured. In large systems that must update and union many millions of sketches, having the |
| sketch operate off-heap avoids the serialization and deserialization costs of moving sketches from heap to |
| off-heap and back, and reduces the need for garbage collection. |
| |
| <h3>Merging sketches with different configured <i>lgConfigK</i></h3> |
| This enables a user to union a HLL sketch that was configured with, say, <i>lgConfigK = 12</i> |
| with another loaded HLL sketch that was configured with, say, <i>lgConfigK = 14</i>. |
| |
| <p>Why is this important? Suppose you have been building a history of sketches of your customer's |
| data that go back a full year (or 5 or 10!) that were all configured with <i>lgConfigK = 12</i>. Because sketches |
| are so much smaller than the raw data it is possible that the raw data was discarded keeping only the sketches. |
| Even if you have the raw data, it might be very expensive and time consuming to reload and rebuild all your |
| sketches with a larger more accurate size, say, <i>lgConfigK = 14</i>. |
| This capability enables you to merge last year's data with this year's data built with larger sketches and still |
| have meaningful results.</p> |
| |
| <p>In other words, you can change your mind about what size sketch you need for your application at any time and |
| will not lose access to the data contained in your older historical sketches.</p> |
| |
| <p>This capability does come with a caveat: The resulting accuracy of the merged sketch will be the accuracy of the |
| smaller of the two sketches. Without this capability, you would either be stuck with the configuration you first |
| chose forever, or you would have to rebuild all your sketches from scratch, or worse, not be able to recover your |
| historical data.</p> |
| |
| <h3>Multi-language, multi-platform.</h3> |
| The binary structures for our sketch serializations are language and platform independent. |
| This means it is possible to generate an HLL sketch on a C++ Windows platform and it can be used on a |
| Java or Python Unix platform. |
| |
| <p>[1] Philippe Flajolet, et al, <a href="https://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf"> |
| <i>HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm.</i></a> |
| DMTCS proc. <b>AH</b>, 2007, 127-146. |
| |
| <p>[2] Edith Cohen, <a href="https://arxiv.org/pdf/1306.3284.pdf"> |
| <i>All-Distances Sketches, Revisited: HIP Estimators for Massive Graphs Analysis.</i></a> |
| PODS'14, June 22-27, Snowbird, UT, USA. |
| |
| <p>[3] Daniel Ting, |
| <a href="https://research.facebook.com/publications/streamed-approximate-counting-of-distinct-elements"> |
| <i>Streamed Approximate Counting of Distinct Elements, Beating Optimal Batch Methods.</i></a> |
| KDD'14 August 24, 2014 New York, New York USA. |
| |
| <p>[4] Kevin Lang, |
| <a href="https://arxiv.org/abs/1708.06839"> |
| <i>Back to the Future: an Even More Nearly Optimal Cardinality Estimation Algorithm.</i></a> |
| arXiv 1708.06839, August 22, 2017, Yahoo Research. |
| |
| <p>[5] Memory Component, |
| <a href="https://datasketches.apache.org/docs/Memory/MemoryComponent.html"> |
| <i>DataSketches Memory Component</i></a> |
| |
| <p>[6] MacBook Pro 2.3 GHz 8-Core Intel Core i9</div> |
| <dl> |
| <dt><span class="simpleTagLabel">Author:</span></dt> |
| <dd>Lee Rhodes, Kevin Lang</dd> |
| <dt><span class="seeLabel">See Also:</span></dt> |
| <dd><a href="../cpc/CpcSketch.html" title="class in org.apache.datasketches.cpc"><code>CpcSketch</code></a></dd> |
| </dl> |
| </section> |
| <ul class="blockList"> |
| <li class="blockList"> |
| <table class="typeSummary"> |
| <caption><span>Class Summary</span><span class="tabEnd"> </span></caption> |
| <tr> |
| <th class="colFirst" scope="col">Class</th> |
| <th class="colLast" scope="col">Description</th> |
| </tr> |
| <tbody> |
| <tr class="altColor"> |
| <th class="colFirst" scope="row"><a href="HllSketch.html" title="class in org.apache.datasketches.hll">HllSketch</a></th> |
| <td class="colLast"> |
| <div class="block">The HllSketch is actually a collection of compact implementations of Phillipe Flajolet’s HyperLogLog (HLL) |
| sketch but with significantly improved error behavior and excellent speed performance.</div> |
| </td> |
| </tr> |
| <tr class="rowColor"> |
| <th class="colFirst" scope="row"><a href="Union.html" title="class in org.apache.datasketches.hll">Union</a></th> |
| <td class="colLast"> |
| <div class="block">This performs union operations for all HllSketches.</div> |
| </td> |
| </tr> |
| </tbody> |
| </table> |
| </li> |
| <li class="blockList"> |
| <table class="typeSummary"> |
| <caption><span>Enum Summary</span><span class="tabEnd"> </span></caption> |
| <tr> |
| <th class="colFirst" scope="col">Enum</th> |
| <th class="colLast" scope="col">Description</th> |
| </tr> |
| <tbody> |
| <tr class="altColor"> |
| <th class="colFirst" scope="row"><a href="TgtHllType.html" title="enum in org.apache.datasketches.hll">TgtHllType</a></th> |
| <td class="colLast"> |
| <div class="block">Specifies the target type of HLL sketch to be created.</div> |
| </td> |
| </tr> |
| </tbody> |
| </table> |
| </li> |
| </ul> |
| </div> |
| </main> |
| <footer role="contentinfo"> |
| <nav role="navigation"> |
| <!-- ======= START OF BOTTOM NAVBAR ====== --> |
| <div class="bottomNav"><a id="navbar.bottom"> |
| <!-- --> |
| </a> |
| <div class="skipNav"><a href="#skip.navbar.bottom" title="Skip navigation links">Skip navigation links</a></div> |
| <a id="navbar.bottom.firstrow"> |
| <!-- --> |
| </a> |
| <ul class="navList" title="Navigation"> |
| <li><a href="../../../../index.html">Overview</a></li> |
| <li class="navBarCell1Rev">Package</li> |
| <li>Class</li> |
| <li><a href="package-use.html">Use</a></li> |
| <li><a href="package-tree.html">Tree</a></li> |
| <li><a href="../../../../deprecated-list.html">Deprecated</a></li> |
| <li><a href="../../../../index-all.html">Index</a></li> |
| <li><a href="../../../../help-doc.html">Help</a></li> |
| </ul> |
| </div> |
| <div class="subNav"> |
| <ul class="navList" id="allclasses_navbar_bottom"> |
| <li><a href="../../../../allclasses.html">All Classes</a></li> |
| </ul> |
| <div> |
| <script type="text/javascript"><!-- |
| allClassesLink = document.getElementById("allclasses_navbar_bottom"); |
| if(window==top) { |
| allClassesLink.style.display = "block"; |
| } |
| else { |
| allClassesLink.style.display = "none"; |
| } |
| //--> |
| </script> |
| <noscript> |
| <div>JavaScript is disabled on your browser.</div> |
| </noscript> |
| </div> |
| <a id="skip.navbar.bottom"> |
| <!-- --> |
| </a></div> |
| <!-- ======== END OF BOTTOM NAVBAR ======= --> |
| </nav> |
| <p class="legalCopy"><small>Copyright © 2015–2022 <a href="https://www.apache.org/">The Apache Software Foundation</a>. All rights reserved.</small></p> |
| </footer> |
| </body> |
| </html> |