blob: 404b25b4dce5b4d91b7cdd19568f2db41ace9c96 [file] [log] [blame]
<!DOCTYPE html>
<!-- Start _layouts/doc_page.html-->
<html lang="en">
<head>
<!-- Start _include/site_head.html -->
<meta charset="UTF-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta name="description" content="">
<meta name="author" content="datasketches">
<title>DataSketches | </title>
<link rel="shortcut icon" href="/img/favicon.png">
<!-- original source: https://maxcdn.bootstrapcdn.com/font-awesome/4.1.0/css/font-awesome.min.css -->
<link rel="stylesheet" href="/css/font-awesome.min.css">
<!-- original source: https://maxcdn.bootstrapcdn.com/bootstrap/3.2.0/css/bootstrap.min.css -->
<link rel="stylesheet" href="/css/bootstrap.min.css">
<link rel="stylesheet" href="/css/fonts.css" type="text/css">
<link rel="stylesheet" href="/css/main.css">
<link rel="stylesheet" href="/css/header.css">
<link rel="stylesheet" href="/css/footer.css">
<link rel="stylesheet" href="/css/syntax.css">
<link rel="stylesheet" href="/css/docs.css">
<script type="text/x-mathjax-config">
MathJax.Hub.Config({tex2jax: {inlineMath: [['$','$'], ['\\(','\\)']]},showMathMenu:false,showMathMenuMSIE:false,showProcessingMessages:false});
</script>
<!-- original source: https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMX_HTML-full -->
<script type="text/javascript" src="/js/MathJax.js?config=TeX-AMS_HTML"></script>
<!-- original source: https://code.jquery.com/jquery.min.js -->
<script src="/js/jquery.min.js"></script>
<!-- original source: https://maxcdn.bootstrapcdn.com/bootstrap/3.2.0/js/bootstrap.min.js -->
<script src="/js/bootstrap.min.js"></script> <!-- 3.2.0-->
<!-- End _include/site_head.html -->
</head>
<body>
<!-- Start _include/nav_bar.html -->
<div class="navbar navbar-inverse navbar-static-top ds-nav">
<div class="container">
<div class="navbar-header">
<button type="button" class="navbar-toggle" data-toggle="collapse" data-target=".navbar-collapse">
<span class="sr-only">Toggle navigation</span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</button>
<a href="/" style="padding-top: 0px; padding-bottom: 0px;">
<span class="ds-small-h-logo"></span></a>
</div>
<div class="navbar-collapse collapse">
<ul class="nav navbar-nav navbar-right">
<li>
<a href="/docs/Background/TheChallenge.html">
<span class="fa fa-info-circle"></span> DOCUMENTATION</a>
</li>
<li>
<a href="/docs/Community/Downloads.html">
<span class="fa fa-download"></span> DOWNLOAD</a>
</li>
<!--
<li>
<a href="/docs/Architecture/Components.html">
<span class="fa fa-github"></span> GITHUB</a>
</li>
-->
<li>
<a href="/docs/Community/Research.html">
<span class="fa fa-paper-plane"></span> RESEARCH</a>
</li>
<li>
<a href="/docs/Community/index.html" style="padding-top: 0; padding-bottom: 0;">
<img class="ds-small-man" src="/img/datasketches-ManWhite.svg"/>COMMUNITY</a>
</li>
<li>
<ul class="nav navbar-nav navbar-right ds-nav">
<li class="dropdown ds-nav" >
<a href="#" class="dropdown-toggle" data-toggle="dropdown" role="button" aria-haspopup="true" aria-expanded="false" style="padding-top: 0; padding-bottom: 0;"><img class="apache-logo" src="/img/feather.svg"/>Apache <span class="caret"></span></a>
<ul class="dropdown-menu ds-nav">
<li><a href="https://www.apache.org/" target="_blank">Foundation</a></li>
<li><a href="https://www.apache.org/events/current-event" target="_blank">Events</a></li>
<li><a href="https://www.apache.org/licenses/" target="_blank">License</a></li>
<li><a href="https://privacy.apache.org/policies/privacy-policy-public.html" target="_blank">Privacy Policy</a></li>
<li><a href="https://www.apache.org/foundation/thanks.html" target="_blank">Thanks</a></li>
<li><a href="https://www.apache.org/security/" target="_blank">Security</a></li>
<li><a href="https://www.apache.org/foundation/sponsorship.html" target="_blank">Sponsorship</a></li>
</ul>
</li>
</ul>
</li>
</ul>
</div>
</div>
</div>
<!-- End _include/nav_bar.html -->
<!-- Start _include/javadocs.html -->
<div class="ds-header">
<div class="container">
<h4>API Snapshots:
<a href="https://apache.github.io/datasketches-java/4.2.0/">Java Core</a>,
<a href="https://apache.github.io/datasketches-cpp/5.0.0/">C++ Core</a>,
<a href="https://apache.github.io/datasketches-python/main/">Python</a>,
<a href="https://apache.github.io/datasketches-memory/master/">Memory</a>,
<a href="/api/pig/snapshot/apidocs/index.html">Pig</a>,
<a href="/api/hive/snapshot/apidocs/index.html">Hive</a>,
</h4>
</div>
</div>
<!-- End _include/javadocs.html -->
<div class="container">
<div class="row">
<!-- Start ToC Block -->
<div class="col-md-3">
<div class="searchbox" style="position:relative">
<gcse:searchbox-only></gcse:searchbox-only>
</div>
<!-- Start _includes/toc.html -->
<!-- Computer Generated File, Do Not Edit! -->
<link rel="stylesheet" href="/css/toc.css">
<div id="toc" class="nav toc hidden-print">
<p id="background">
<a data-toggle="collapse" class="menu collapsed" href="#collapse_background">Background</a>
</p>
<div class="collapse" id="collapse_background">
<li><a href="/docs/Background/TheChallenge.html">•The Challenge</a></li>
<li><a href="/docs/Background/SketchOrigins.html">•Sketch Origins</a></li>
<li><a href="/docs/Background/SketchElements.html">•Sketch Elements</a></li>
<li><a href="/docs/Background/Presentations.html">•Presentations</a></li>
<li><a href="https://github.com/apache/datasketches-website/tree/master/docs/pdf/DataSketches_deck.pdf">•Overview Slide Deck</a></li>
</div>
<p id="architecture-and-design">
<a data-toggle="collapse" class="menu collapsed" href="#collapse_architecture_and_design">Architecture And Design</a>
</p>
<div class="collapse" id="collapse_architecture_and_design">
<li><a href="/docs/Architecture/MajorSketchFamilies.html">•The Major Sketch Families</a></li>
<li><a href="/docs/Architecture/LargeScale.html">•Large Scale Computing</a></li>
<li><a href="/docs/Architecture/KeyFeatures.html">•Key Features</a></li>
<li><a href="/docs/Architecture/SketchFeaturesMatrix.html">•Sketch Features Matrix</a></li>
<li><a href="/docs/Architecture/Components.html">•Components</a></li>
<li><a href="/docs/Architecture/SketchesByComponent.html">•Sketches by Component</a></li>
<li><a href="/docs/Architecture/SketchCriteria.html">•Sketch Criteria</a></li>
<p id="memory-component">
<a data-toggle="collapse" class="menu collapsed" href="#collapse_memory_component">Memory Component</a>
</p>
<div class="collapse" id="collapse_memory_component">
<li><a href="/docs/Memory/MemoryComponent.html">•Memory Component</a></li>
<li><a href="/docs/Memory/MemoryPerformance.html">•Memory Component Performance</a></li>
</div>
<li><a href="/docs/Architecture/OrderSensitivity.html">•Notes on Order Sensitivity</a></li>
<li><a href="/docs/Architecture/Concurrency.html">•Notes on Concurrency</a></li>
</div>
<p id="sketch-families">
<a data-toggle="collapse" class="menu collapsed" href="#collapse_sketch_families">Sketch Families</a>
</p>
<div class="collapse" id="collapse_sketch_families">
<p id="distinct-counting">
<a data-toggle="collapse" class="menu collapsed" href="#collapse_distinct_counting">Distinct Counting</a>
</p>
<div class="collapse" id="collapse_distinct_counting">
<li><a href="/docs/DistinctCountFeaturesMatrix.html">•Features Matrix</a></li>
<li><a href="/docs/DistinctCountMeritComparisons.html">•Figures-of-Merit Comparison</a></li>
<p id="cpc-sketches">
<a data-toggle="collapse" class="menu collapsed" href="#collapse_cpc_sketches">CPC Sketches</a>
</p>
<div class="collapse" id="collapse_cpc_sketches">
<li><a href="/docs/CPC/CPC.html">•CPC Sketch</a></li>
<li><a href="/docs/CPC/CpcPerformance.html">•CPC Sketch Performance</a></li>
<p id="cpc-examples">
<a data-toggle="collapse" class="menu collapsed" href="#collapse_cpc_examples">CPC Examples</a>
</p>
<div class="collapse" id="collapse_cpc_examples">
<li><a href="/docs/CPC/CpcJavaExample.html">•CPC Sketch Java Example</a></li>
<li><a href="/docs/CPC/CpcCppExample.html">•CPC Sketch C++ Example</a></li>
<li><a href="/docs/CPC/CpcPigExample.html">•CPC Sketch Pig UDFs</a></li>
<li><a href="/docs/CPC/CpcHiveExample.html">•CPC Sketch Hive UDFs</a></li>
</div>
</div>
<p id="hyperloglog-sketches">
<a data-toggle="collapse" class="menu collapsed" href="#collapse_hyperloglog_sketches">HyperLogLog Sketches</a>
</p>
<div class="collapse" id="collapse_hyperloglog_sketches">
<li><a href="/docs/HLL/HLL.html">•HLL Sketch</a></li>
<li><a href="/docs/HLL/HllMap.html">•HLL Map Sketch</a></li>
<p id="hll-examples">
<a data-toggle="collapse" class="menu collapsed" href="#collapse_hll_examples">HLL Examples</a>
</p>
<div class="collapse" id="collapse_hll_examples">
<li><a href="/docs/HLL/HllJavaExample.html">•HLL Sketch Java Example</a></li>
<li><a href="/docs/HLL/HllCppExample.html">•HLL Sketch C++ Example</a></li>
<li><a href="/docs/HLL/HllPigUDFs.html">•HLL Sketch Pig UDFs</a></li>
<li><a href="/docs/HLL/HllHiveUDFs.html">•HLL Sketch Hive UDFs</a></li>
</div>
<p id="hll-studies">
<a data-toggle="collapse" class="menu collapsed" href="#collapse_hll_studies">HLL Studies</a>
</p>
<div class="collapse" id="collapse_hll_studies">
<li><a href="/docs/HLL/HllPerformance.html">•HLL Sketch Performance</a></li>
<li><a href="/docs/HLL/Hll_vs_CS_Hllpp.html">•HLL vs Clearspring HLL++</a></li>
<li><a href="/docs/HLL/HllSketchVsDruidHyperLogLogCollector.html">•HLL Sketch vs Druid HyperLogLogCollector</a></li>
</div>
</div>
<p id="theta-sketches">
<a data-toggle="collapse" class="menu collapsed" href="#collapse_theta_sketches">Theta Sketches</a>
</p>
<div class="collapse" id="collapse_theta_sketches">
<li><a href="/docs/Theta/ThetaSketchFramework.html">•Theta Sketch Framework</a></li>
<p id="theta-examples">
<a data-toggle="collapse" class="menu collapsed" href="#collapse_theta_examples">Theta Examples</a>
</p>
<div class="collapse" id="collapse_theta_examples">
<li><a href="/docs/Theta/ConcurrentThetaSketch.html">•Concurrent Theta Sketch</a></li>
<li><a href="/docs/Theta/ThetaJavaExample.html">•Theta Sketch Java Example</a></li>
<li><a href="/docs/Theta/ThetaSparkExample.html">•Theta Sketch Spark Example</a></li>
<li><a href="/docs/Theta/ThetaPigUDFs.html">•Theta Sketch Pig UDFs</a></li>
<li><a href="/docs/Theta/ThetaHiveUDFs.html">•Theta Sketch Hive UDFs</a></li>
</div>
<p id="kmv-tutorial">
<a data-toggle="collapse" class="menu collapsed" href="#collapse_kmv_tutorial">KMV Tutorial</a>
</p>
<div class="collapse" id="collapse_kmv_tutorial">
<li><a href="/docs/Theta/InverseEstimate.html">•The Inverse Estimate</a></li>
<li><a href="/docs/Theta/KMVempty.html">•Empty Sketch</a></li>
<li><a href="/docs/Theta/KMVfirstEst.html">•First Estimator</a></li>
<li><a href="/docs/Theta/KMVbetterEst.html">•Better Estimator</a></li>
<li><a href="/docs/Theta/KMVrejection.html">•Rejection Rules</a></li>
<li><a href="/docs/Theta/KMVupdateVkth.html">•Update V(kth) Rule</a></li>
</div>
<p id="set-operations-and-p-sampling">
<a data-toggle="collapse" class="menu collapsed" href="#collapse_set_operations_and_p-sampling">Set Operations and P-sampling</a>
</p>
<div class="collapse" id="collapse_set_operations_and_p-sampling">
<li><a href="/docs/Theta/ThetaSketchSetOps.html">•Set Operations</a></li>
<li><a href="/docs/Theta/ThetaSetOpsCornerCases.html">•Model & Test Set Operations</a></li>
<li><a href="/docs/Theta/ThetaPSampling.html"><i>p</i>-Sampling</a></li>
</div>
<p id="accuracy">
<a data-toggle="collapse" class="menu collapsed" href="#collapse_accuracy">Accuracy</a>
</p>
<div class="collapse" id="collapse_accuracy">
<li><a href="/docs/Theta/ThetaAccuracy.html">•Basic Accuracy</a></li>
<li><a href="/docs/Theta/ThetaAccuracyPlots.html">•Accuracy Plots</a></li>
<li><a href="/docs/Theta/ThetaErrorTable.html">•Relative Error Table</a></li>
<li><a href="/docs/Theta/ThetaSketchSetOpsAccuracy.html">•SetOp Accuracy</a></li>
<li><a href="/docs/Theta/AccuracyOfDifferentKUnions.html">•Unions With Different k</a></li>
</div>
<p id="size">
<a data-toggle="collapse" class="menu collapsed" href="#collapse_size">Size</a>
</p>
<div class="collapse" id="collapse_size">
<li><a href="/docs/Theta/ThetaSize.html">•Theta Sketch Size</a></li>
</div>
<p id="speed">
<a data-toggle="collapse" class="menu collapsed" href="#collapse_speed">Speed</a>
</p>
<div class="collapse" id="collapse_speed">
<li><a href="/docs/Theta/ThetaUpdateSpeed.html">•Update Speed</a></li>
<li><a href="/docs/Theta/ThetaMergeSpeed.html">•Merge Speed</a></li>
</div>
<p id="theta-sketch-theory">
<a data-toggle="collapse" class="menu collapsed" href="#collapse_theta_sketch_theory">Theta Sketch Theory</a>
</p>
<div class="collapse" id="collapse_theta_sketch_theory">
<li><a href="https://github.com/apache/datasketches-website/tree/master/docs/pdf/ThetaSketchFramework.pdf">•Theta Sketch Framework (PDF)</a></li>
<li><a href="https://github.com/apache/datasketches-website/tree/master/docs/pdf/ThetaSketchEquations.pdf">•Theta Sketch Equations (PDF)</a></li>
<li><a href="https://github.com/apache/datasketches-website/tree/master/docs/pdf/DataSketches.pdf">•DataSketches (PDF)</a></li>
<li><a href="/docs/Theta/ThetaConfidenceIntervals.html">•Confidence Intervals Notes</a></li>
<li><a href="/docs/Theta/ThetaMergingAlgorithm.html">•Merging Algorithm Notes</a></li>
<li><a href="/docs/Theta/ThetaReferences.html">•Theta References</a></li>
</div>
</div>
<p id="tuple-sketches">
<a data-toggle="collapse" class="menu collapsed" href="#collapse_tuple_sketches">Tuple Sketches</a>
</p>
<div class="collapse" id="collapse_tuple_sketches">
<li><a href="/docs/Tuple/TupleOverview.html">•Tuple Overview</a></li>
<p id="tuple-examples">
<a data-toggle="collapse" class="menu collapsed" href="#collapse_tuple_examples">Tuple Examples</a>
</p>
<div class="collapse" id="collapse_tuple_examples">
<li><a href="/docs/Tuple/TupleJavaExample.html">•Tuple Java Example</a></li>
<li><a href="/docs/Tuple/TupleEngagementExample.html">•Tuple Engagement Example</a></li>
<li><a href="/docs/Tuple/TuplePigUDFs.html">•Tuple Pig UDFs</a></li>
<li><a href="/docs/Tuple/TupleHiveUDFs.html">•Tuple Hive UDFs</a></li>
</div>
</div>
</div>
<p id="most-frequent">
<a data-toggle="collapse" class="menu collapsed" href="#collapse_most_frequent">Most Frequent</a>
</p>
<div class="collapse" id="collapse_most_frequent">
<li><a href="/docs/Frequency/FrequencySketchesOverview.html">•Frequency Sketches Overview</a></li>
<p id="frequent-item-sketches">
<a data-toggle="collapse" class="menu collapsed" href="#collapse_frequent_item_sketches">Frequent Item Sketches</a>
</p>
<div class="collapse" id="collapse_frequent_item_sketches">
<li><a href="/docs/Frequency/FrequentItemsOverview.html">•Frequent Items Overview</a></li>
<li><a href="/docs/Frequency/FrequentItemsErrorTable.html">•Frequent Items Error Table</a></li>
<li><a href="/docs/Frequency/FrequentItemsReferences.html">•Frequent Items References</a></li>
<li><a href="/docs/Frequency/FrequentItemsPerformance.html">•Frequent Items Performance</a></li>
<p id="most-frequent-examples">
<a data-toggle="collapse" class="menu collapsed" href="#collapse_most_frequent_examples">Most Frequent Examples</a>
</p>
<div class="collapse" id="collapse_most_frequent_examples">
<li><a href="/docs/Frequency/FrequentItemsJavaExample.html">•Frequent Items Java Example</a></li>
<li><a href="/docs/Frequency/FrequentItemsCppExample.html">•Frequent Items C++ Example</a></li>
<li><a href="/docs/Frequency/FrequentItemsPigUDFs.html">•Frequent Items Pig UDFs</a></li>
<li><a href="/docs/Frequency/FrequentItemsHiveUDFs.html">•Frequent Items Hive UDFs</a></li>
</div>
</div>
<p id="frequent-distinct-sketches">
<a data-toggle="collapse" class="menu collapsed" href="#collapse_frequent_distinct_sketches">Frequent Distinct Sketches</a>
</p>
<div class="collapse" id="collapse_frequent_distinct_sketches">
<li><a href="/docs/Frequency/FrequentDistinctTuplesSketch.html">•Frequent Distinct Tuples Sketch</a></li>
</div>
</div>
<p id="quantiles-and-histograms">
<a data-toggle="collapse" class="menu collapsed" href="#collapse_quantiles_and_histograms">Quantiles And Histograms</a>
</p>
<div class="collapse" id="collapse_quantiles_and_histograms">
<li><a href="/docs/Quantiles/SketchingQuantilesAndRanksTutorial.html">•Quantiles and Ranks Tutorial</a></li>
<li><a href="/docs/Quantiles/QuantilesOverview.html">•Quantiles Overview</a></li>
<li><a href="/docs/KLL/KLLSketch.html">•KLL Floats sketch</a></li>
<li><a href="/docs/KLL/KLLAccuracyAndSize.html">•KLL Sketch Accuracy and Size</a></li>
<li><a href="/docs/REQ/ReqSketch.html">•REQ Floats sketch</a></li>
<li><a href="/docs/Quantiles/OrigQuantilesSketch.html">•Original QuantilesSketch</a></li>
<p id="quantiles-examples">
<a data-toggle="collapse" class="menu collapsed" href="#collapse_quantiles_examples">Quantiles Examples</a>
</p>
<div class="collapse" id="collapse_quantiles_examples">
<li><a href="/docs/Quantiles/QuantilesJavaExample.html">•Quantiles Sketch Java Example</a></li>
<li><a href="/docs/KLL/KLLCppExample.html">•KLL Quantiles Sketch C++ Example</a></li>
<li><a href="/docs/Quantiles/QuantilesPigUDFs.html">•Quantiles Sketch Pig UDFs</a></li>
<li><a href="/docs/Quantiles/QuantilesHiveUDFs.html">•Quantiles Sketch Hive UDFs</a></li>
</div>
<p id="quantiles-studies">
<a data-toggle="collapse" class="menu collapsed" href="#collapse_quantiles_studies">Quantiles Studies</a>
</p>
<div class="collapse" id="collapse_quantiles_studies">
<li><a href="/docs/QuantilesStudies/DruidApproxHistogramStudy.html">•Druid Approximate Histogram</a></li>
<li><a href="/docs/QuantilesStudies/MomentsSketchStudy.html">•Moments Sketch Study</a></li>
<li><a href="/docs/QuantilesStudies/QuantilesStreamAStudy.html">•Quantiles StreamA Study</a></li>
<li><a href="/docs/QuantilesStudies/ExactQuantiles.html">•Exact Quantiles for Studies</a></li>
</div>
<p id="quantiles-sketch-theory">
<a data-toggle="collapse" class="menu collapsed" href="#collapse_quantiles_sketch_theory">Quantiles Sketch Theory</a>
</p>
<div class="collapse" id="collapse_quantiles_sketch_theory">
<li><a href="https://github.com/apache/datasketches-website/tree/master/docs/pdf/Quantiles_KLL.pdf">•Optimal Quantile Approximation in Streams</a></li>
<li><a href="/docs/Quantiles/QuantilesReferences.html">•Quantiles References</a></li>
</div>
</div>
<p id="sampling">
<a data-toggle="collapse" class="menu collapsed" href="#collapse_sampling">Sampling</a>
</p>
<div class="collapse" id="collapse_sampling">
<li><a href="/docs/Sampling/ReservoirSampling.html">•Reservoir Sampling</a></li>
<li><a href="/docs/Sampling/ReservoirSamplingPerformance.html">•Reservoir Sampling Performance</a></li>
<li><a href="/docs/Sampling/VarOptSampling.html">•VarOpt Sampling</a></li>
<p id="sampling-examples">
<a data-toggle="collapse" class="menu collapsed" href="#collapse_sampling_examples">Sampling Examples</a>
</p>
<div class="collapse" id="collapse_sampling_examples">
<li><a href="/docs/Sampling/ReservoirSamplingJava.html">•Reservoir Sampling Java Example</a></li>
<li><a href="/docs/Sampling/ReservoirSamplingPigUDFs.html">•Reservoir Sampling Pig UDFs</a></li>
<li><a href="/docs/Sampling/VarOptSamplingJava.html">•VarOpt Sampling Java Example</a></li>
<li><a href="/docs/Sampling/VarOptPigUDFs.html">•VarOpt Sampling Pig UDFs</a></li>
</div>
</div>
</div>
<p id="system-integrations">
<a data-toggle="collapse" class="menu collapsed" href="#collapse_system_integrations">System Integrations</a>
</p>
<div class="collapse" id="collapse_system_integrations">
<li><a href="/docs/SystemIntegrations/ApacheDruidIntegration.html">•Using Sketches in ApacheDruid</a></li>
<li><a href="/docs/SystemIntegrations/ApacheHiveIntegration.html">•Using Sketches in Apache Hive</a></li>
<li><a href="/docs/SystemIntegrations/ApachePigIntegration.html">•Using Sketches in Apache Pig</a></li>
<li><a href="/docs/SystemIntegrations/PostgreSQLIntegration.html">•Using Sketches in PostgreSQL</a></li>
</div>
<p id="community">
<a data-toggle="collapse" class="menu collapsed" href="#collapse_community">Community</a>
</p>
<div class="collapse" id="collapse_community">
<li><a href="/docs/Community/index.html">•Community</a></li>
<li><a href="/docs/Community/Downloads.html">•Downloads</a></li>
<li><a href="/docs/Community/NewCommitterProcess.html">•Committer Process</a></li>
<li><a href="/docs/Community/ReleaseProcessForCppComponents.html">•Release Process For CPP Components</a></li>
<li><a href="/docs/Community/ReleaseProcessForJavaComponents.html">•Release Process For Java Components</a></li>
<li><a href="/docs/Community/Transitioning.html">•Transitioning from prior GitHub Site</a></li>
</div>
<p id="research">
<a data-toggle="collapse" class="menu collapsed" href="#collapse_research">Research</a>
</p>
<div class="collapse" id="collapse_research">
<li><a href="/docs/Community/Research.html">•Research</a></li>
</div>
</div>
<!-- End _includes/toc.html -->
<!-- Start _includes/tocScript.html -->
<script>
(function () {
var findLineItem = function (path) {
return document.querySelector(`#toc [href="${path}"]`);
};
function findNavItem(path) {
return document.querySelector(`.nav [href="${path}"]`);
}
var highlighLineItem = function (element) {
element.classList.add('highlight');
};
var checkHasClass = function (element, className) {
return element.className.split(' ').find(function (item) { return item === className || '' })
}
var findAllCollapseParents = function (element) {
var collapseMenus = [];
var elementPointer = element;
while (elementPointer !== document.body) {
if (checkHasClass(elementPointer, 'collapse')) {
collapseMenus.push(elementPointer);
}
elementPointer = elementPointer.parentElement
}
return collapseMenus
};
var openMenuItem = function (element) {
// $(element).collapse('show') would start a transition, adding `in` class instead.
element.classList.add('in');
};
var openAllFromList = function (elementList) {
elementList.forEach(openMenuItem);
};
var highlightAndOpenMenu = function () {
// Highlight & expand nav item in the TOC
var currentLineItem = findLineItem(document.location.pathname);
highlighLineItem(currentLineItem);
openAllFromList(findAllCollapseParents(currentLineItem));
// Highlight nav item in top navigation
highlighLineItem(findNavItem(document.location.pathname));
};
$(highlightAndOpenMenu);
}());
</script>
<!-- End _includes/tocScript.html -->
</div>
<!-- End ToC Block -->
<div class="col-md-9 doc-content">
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
<h1 id="hllsketch-vs-clearspring-hll-sketch"><em>HllSketch</em> vs Clearspring <em>HLL++</em> Sketch</h1>
<p>The DataSketches HyperLogLog <em>HllSketch</em>[1][2] implemented in this library has been highly optimized for speed, accuracy and size. The goal of this paper is to do an objective comparison of the <em>HllSketch</em> versus the popular Clearspring Technologies’ <em>HyperLogLogPlus</em>[3] implementation, which is based on Google’s HyperLogLog++ paper[4]. These tests were performed on the <em>HllSketch</em> release version 0.10.1, and on the <em>HyperLogLogPlus</em> version 2.9.5.</p>
<h2 id="hllsketch-vs-hyperloglogplus-error-behavior"><em>HllSketch</em> vs. <em>HyperLogLogPlus</em> Error Behavior</h2>
<h3 id="high-level-error-summary">High-Level Error Summary</h3>
<p>The <em>HllSketch</em> sketch has superior error properties compared to the <em>HyperLogLogPlus</em> sketch. This can be easily seen from the following side-by-side comparison. The <em>HllSketch</em> error curves are on the left.
The <em>HyperLogLogPlus</em> error curves are on the right.
<img class="doc-img-full" src="/docs/img/hll/HllVsHllppAcc.png" alt="/hll/HllVsHllppAcc.png" /></p>
<p>If the image is too small to read, right-click on the image and open it in a separate window.</p>
<blockquote>
<p>NOTE: These sketches were configured with <em>K = 2^21</em>, which is a VERY large sketch in order to be able to zoom in on the error detail in the low range, which is where most actual usage will occur.</p>
</blockquote>
<p>These plots are what we call <em>pitch-fork</em> plots. The X-axis is the number of unique counts presented to the sketch. The Y-axis is the Root-Mean-Squared of Relative Error (RMS-RE)[5] plotted as +/- percent where values above zero represent an overestimate and values below zero represent an underestimate.</p>
<p>The colored curves represent different quantile contours of the empirical error distribution. The orange and green curves are the contours corresponding to +/- one standard deviation from the mean error, and which define the 68% confidence bounds. The red and blue curves are the contours at +/- two standard deviations and define the 95.4% confidence bounds. The brown and purple curves are the contours at +/- three standard deviations and define the 99.7% confidence bounds. The mean (gray) and median (black) overlap each other and hug the axis close to zero.</p>
<h3 id="introduction-to-detailed-accuracy-measurements">Introduction to Detailed Accuracy Measurements</h3>
<p>Measuring the error properties of these stochastic algorithms is tricky and requires a great deal of thought into the design of the program that measures it. Getting smooth-looking plots requires many tens of thousands of trials, which even with fast CPUs requires a lot of time. The code used to produce the data for the plots in this paper can be found on the <a href="https://github.com/apache/datasketches-characterization">characterization repository</a></p>
<p>For accuracy purposes, the <em>HllSketch</em> sketch is configured with one parameter, <em>Log_2(K)</em> which we abbreviate as <em>LgK</em>. This defines the number of bins of the final HyperLogLog-Array (HLL-Array)[6] mode, and defines the bounds[7] on the accuracy of the sketch as well as its ultimate size. Thus, specifying a <em>LgK = 12</em>, means that the final HyperLogLog mode of the sketch will have <em>k = 2<sup>12</sup> = 4096</em> bins. A sketch with <em>LgK = 21</em> will ultimately have <em>k =2,097,152</em> bins, which is a very large sketch.</p>
<p>Similarly, the <em>HyperLogLogPlus</em> sketch has a parameter <em>p</em>, which acts the same as <em>LgK</em>.</p>
<p>An important property of a well-implemented HyperLogLog sketch is the error per bit of storage space for all values of <em>n</em>, where <em>n</em> is the number of uniques presented to the sketch. It would be wasteful to allocate the full HLL-Array of bins when the sketch has been presented with only a few values. As a result, typical implementations of these sketches have a <em>sparse</em> mode, where the early counts are encoded and cached. When the number of cached sparse values reaches a certain threshold (depending on the implementation), the sparse representations are flushed to the full HLL-Array. After this transition all new arriving input values are directly offered to the HLL-Array. The HLL-Array is effectively fixed in size thereafter no matter how large <em>n</em> gets. Thus, the sparse mode of operation allows the sketch to maintain a low error per stored-bit ratio for all values of <em>n</em>. We will compute this Measure of Merit for both sketches.</p>
<p>Both the <em>HllSketch</em> and the <em>HyperLogLogPlus</em> sketches have a sparse mode. The sparse mode is automatic for the <em>HllSketch</em> sketch, but the <em>HyperLogLogPlus</em> sketch requires a second configuration parameter <em>sp</em>. This is a number with similar properties to <em>p</em>. It is an exponent of 2 and it determines the error properties of the sparse mode. The documentation states that <em>sp</em> can be zero, in which case the sparse mode is disabled, otherwise, <em>p &lt;= sp &lt;= 32</em>.
This will become more clear when we see some of the plots.</p>
<p>As stated above, the X-axis is <em>n</em>, the true number of uniques presented to the sketch. The range of values on this X-axis is from 1 to 2<sup>24</sup>. There are 16 <em>trial-points</em> between each power of 2 along the X-axis (except at the very left end of the axis where there are not 16 distinct integers between the powers of 2). At each trial-point, 2<sup>16</sup> = 65536 trials are executed. The number of trials per trial-point is noted in the chart title as <em>LgT=16</em>. Each trial feeds a new sketch configured with either <em>LgK=21</em> or <em>p=21</em>, which is 2<sup>21</sup> = 2,097,152 bins. This is noted in the chart title as <em>LgK=21</em> or <em>p=21</em>. The program ensures that no two trials use the same uniques.</p>
<p>The Y-axis is a measure of the <em>Relative Error</em> of the sketch, where</p>
<p><em>RE</em> = Relative Error = <em>Measured</em> / <em>Truth</em> - 1.0</p>
<p>Since these sketches are stochastic, each trial will produce a probabilistic estimate of what the true value of <em>n</em> is. If the estimate is larger than <em>n</em> it is an overestimate and the resulting RE will be positive. If the estimate is an underestimate, the RE will be negative.</p>
<p>It is not practical to plot billions of RE values on a single chart. What we plot instead are the quantiles at chosen Fractional Ranks (FR) of the error distribution at each trial-point. When the quantiles at the same FR are connected by lines they form contours of the error distribution. For example, <em>FR = 0.5</em> defines the median of the distribution. In these plots the median is black and mostly hidden behind the mean, which is gray, both of which hug the X-axis close to zero.</p>
<p>At each X-axis trial-point the specified number of trials are executed. For each trial the number of uniques defined by the trial-point are <em>offered</em> (or <em>updated</em>) to the sketch under test. At the end of each trial the cardinality estimate is retrieved from the sketch and stored. At the end of the set of trials at the trial-point the statistics of the distribution of estimated cardinality values are computed.</p>
<p>From the cardinality estimate data of each trial-point, six different FR values have been chosen in addition to the median (0.5). These values correspond to the FR values of a Standard Normal Distribution at +/- 1, 2, and 3 Standard Deviations from the mean. These FR values can be computed from the CDF of the Normal Distribution as</p>
<p><em>FR = (1 + erf( s/√2 ))/2</em>, where <em>s = standard deviation</em>.</p>
<p>The translation from +/- <em>s</em> to fractional ranks is as follows:</p>
<table>
<thead>
<tr>
<th style="text-align: center">Std Dev (s)</th>
<th style="text-align: left">Fractional Rank (FR)</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: center">+3</td>
<td style="text-align: left">0.998650102</td>
</tr>
<tr>
<td style="text-align: center">+2</td>
<td style="text-align: left">0.977249868</td>
</tr>
<tr>
<td style="text-align: center">+1</td>
<td style="text-align: left">0.841344746</td>
</tr>
<tr>
<td style="text-align: center">0</td>
<td style="text-align: left">0.5</td>
</tr>
<tr>
<td style="text-align: center">-1</td>
<td style="text-align: left">0.158655254</td>
</tr>
<tr>
<td style="text-align: center">-2</td>
<td style="text-align: left">0.022750132</td>
</tr>
<tr>
<td style="text-align: center">-3</td>
<td style="text-align: left">0.001349898</td>
</tr>
</tbody>
</table>
<p>With sufficiently large values of <em>n</em> and <em>k</em>, the error distribution will approach the Normal Distribution due to the Central Limit Theorem. The resulting quantile curves using the above FR values allows us to associate the error distribution of the sketch with conventional confidence intervals commonly used in statistics.</p>
<p>For example, the area between the orange and the green curves corresponds to +/- 1 Standard Deviations or (0.841 - .158) = 68.3% confidence. The area between the red and the blue curves corresponds to +/- 2 Standard Deviations or (0.977 - 0.023) = 95.4% confidence. Similarly, the area between the brown and the purple curves corresponds to +/- 3 Standard Deviations or (0.998 - 0.001) = 99.7% confidence. This implies that out of 65,536 trials, about 196 (0.3%) of the estimates will be outside the brown and purple curves.</p>
<p>The mathematical theory[8] behind the HyperLogLog sketch predicts a <em>Relative Standard Error</em> (RSE) defined as</p>
<p><em>RSE<sub>HLL</sub> = F / √k</em>.</p>
<p>From Flajolet’s paper and using his estimator, <em>F = √(3 ln(2) - 1) ≈ 1.03896</em>.</p>
<p>When calculating the error from the trial set of estimates we compute the <em>Root Mean Square</em> of the <em>Relative Error</em> or <em>RMS-RE</em>. The RMS is related to the RSE as follows:</p>
<p><em>RSE<sup>2</sup> = σ<sup>2</sup> = 1/n Σ(x<sub>i</sub> - μ)<sup>2</sup> = Σ((x<sub>i</sub>)<sup>2</sup>) - μ<sup>2</sup></em>, where <em>x<sub>i</sub> = RE<sub>i</sub></em>.</p>
<p><em>RMS<sup>2</sup> = Σ((x<sub>i</sub>)<sup>2</sup>)</em></p>
<p>This means that the RSE can be computed from the RMS via the <em>mean</em> and visa-versa:</p>
<p><em>RMS<sup>2</sup> = RSE<sup>2</sup> + μ<sup>2</sup></em></p>
<p>If the mean is indeed zero then <em>RMS = RSE</em>.</p>
<p>The reason we use the RMS calculation is that it takes any bias into account, while a RSE calculation would not.</p>
<p>We chose a very large sketch for these experiments where <em>LgK = p = 21</em> so that the behavior of the sparse mode and the final HLL-Array mode can be easily observed.</p>
<p>This means that from the theory, a sketch of this size using Flajolet’s estimator will have an RSE of</p>
<p><em>RSE = 1.03896 / √(2<sup>21</sup>) ≈ .0007174 = 717.4 ppm</em></p>
<p>Any HLL sketch implementation that relies on the Flajolet HLL estimator will not be able to do better than this.</p>
<h3 id="hllsketch-measured-error"><em>HllSketch</em> Measured Error</h3>
<p>With the discussion of how these plots are generated, the plot of the accuracy of <em>HllSketch</em> sketch is as follows:
<img class="doc-img-full" src="/docs/img/hll/HllK21T16U24.png" alt="HllK21T16U24.png" /></p>
<p>For the right-hand portion of the plot where the sketch is in HLL-Array mode, instead of the Flajolet estimator, the <em>HllSketch</em> uses the newer <em>Historical Inverse Probability</em> (HIP)
estimator[9] where <em>F = 0.8326</em> resulting in an RSE = 575 ppm. This represents a 20% improvement in error over the Flajolet estimator.</p>
<p>The gridlines in this plot are at set to be multiples of the predicted RSE of the sketch.
Note that each of the curves appears to be approaching their respective gridline.
The green curve is approaching the first positive gridline, and the orange curve is approaching the first negative gridline, etc.
These curves will actually asymptote to these gridlines, but we would have to extend the X-axis out to nearly 100 million uniques to see it.
Because this would have taken weeks to compute, we won’t show this effect here. It would be more obvious with much smaller sketches.</p>
<p>(To generate this plot required &gt;10<sup>12</sup> updates and took 9 hours to complete.)</p>
<h4 id="the-measured-error-of-the-hllsketch-sparse-region">The Measured Error of the <em>HllSketch</em> Sparse Region</h4>
<p>Starting from the left, the error appears to be zero until the value 693 where the -3 StdDev curve (Q(.00135)) drops suddenly to about 0.14% and then gradually recovers. This is a normal phenomenon caused by quantization error as a result of counting discrete events, and can be explained as follows.</p>
<p>At the trial-point 693 there are exactly 693 unique values presented to a single sketch and then its cardinality estimate is retrieved.
This is repeated for every trial for 65,536 trials. The input values presented to the sketch are all unique, however, the internal representation of these input values has a finite precision of about 26 bits. The probability of two different input values producing the same 26-bit value, thus a collision, is determined by the Birthday Paradox, which in this case would predict a collision roughly once in <em>√(2<sup>26</sup>) = 8192</em> uniques. When this occurs, the sparse mode of the sketch rejects the input unique value as a duplicate when it actually is not. Thus, the sketch undercounts by one. Without collisions all the 65,536 values would be the correct value of 693. When collisions occur, some small fraction of these values become 692. When more than 0.135% of the total trials have have experienced one or more collisions, the quantile value at the Q(0.00135) contour will suddently jump from 693 to 692. This difference of one out of 693 is 0.144% which is the error value at that point. This is a perfectly normal occurrence for any stochastic counting process with a finite precision.</p>
<p>Moving to the right from this first downward spike reveals that this same quantization phenomenon occurs eventually on the the Q(.02275) contour and then later on the Q(.15866) contour.
The impact is smaller for these higher contours because the unique counts are significantly higher and the impact of the addition of one more collision is proportionally less.</p>
<p>Continuing to move to the right in the unique count range of about 32K to about 350K we observe that the 7 contours become parallel and smoother.
To explain what is going on in this region we need to zoom in.
<img class="doc-img-full" src="/docs/img/hll/HllK21T16U24_closeup.png" alt="HllK21T16U24_closeup.png" /></p>
<p>For this zoomed in plot each gridline is again multiples of the RMS-RE (or RSE-RE), but here the factor <em>F = 0.408</em> due to discoveries of new classes of estimators[10].
With the precision of 2<sup>26</sup>, the predicted RSE is <em>0.408 / 2<sup>13</sup> = 49.8 ppm</em> or about 50 ppm.
And as you can see the 7 quantile contours nicely approach their predicted asymptotes.</p>
<p>All of this demonstrates that the sketch is behaving as it should and matches the mathematical predictions.</p>
<h3 id="hyperloglogplus-sketch-measured-error"><em>HyperLogLogPlus</em> Sketch Measured Error</h3>
<p>Let’s see how the <em>HyperLogLogPlus</em> sketch behaves under the same test conditions. Here <em>LgK = p = 21</em> and <em>sp = 25</em>.</p>
<p>There is one caveat: Because the <em>HyperLogLogPlus</em> sketch is so slow, We had to reduce the number of trials from 65K to 16K per trial-point and it still took over 20 hours to produce the following graph:
<img class="doc-img-full" src="/docs/img/hll/HllppK21T14.png" alt="HllppK21T14.png" /></p>
<p>Look closely at the Y-axis scale, for this plot the Y-axis ranges from -0.5% to +0.5%. Compare the scale for the <em>HllSketch</em> plot where the Y-axis ranges from -0.1725% to +0.1725%!
The gridlines are spaced at an RSE of 717 ppm while the <em>HllSketch</em> sketch RSE and gridlines are spaced at 575 ppm. However, something is clearly amiss with the <em>HyperLogLogPlus</em> internal estimator which causes the estimates to zoom up exponentially to a high peak before finally settling down to the predicted quantile contours.</p>
<p>The plot below is the close-up of the sparse region of the <em>HyperLogLogPlus</em>. The sparse mode is indeed behaving with a precision of 25 bits, specified by <em>sp</em>.
Here the predicted <em>RSE = 0.707 / √2<sup>25</sup> = 122 ppm</em>, which is 2.2 times larger than that of the <em>HllSketch</em> sketch at 49.8 ppm.
The factor <em>0.707 = 1/√2</em> comes from the use of the “bitmap” estimator[10].
<img class="doc-img-full" src="/docs/img/hll/HllppK21T14_closeup.png" alt="HllppK21T14_closeup.png" /></p>
<p>The <em>HyperLogLogPlus</em> documentation claims that <em>sp</em> can be set as large as 32. However any value larger than 25 causes dramatic failure in estimation.
The following demonstrates what happens with a much smaller sketch of <em>LgK = p = 14</em> and <em>sp = 26</em>.
<img class="doc-img-full" src="/docs/img/hll/HllppK14T13SP26.png" alt="HllppK14T13SP26.png" /></p>
<p>The sketch fails when attempting to transition from sparse mode to normal HLL mode at about 3/4 K = 12288 uniques.
The error dives to - 35% when a sketch of this size should have an RSE of 0.8%.
The sketch provides no warning to the user that this is happening!</p>
<h3 id="a-measure-of-merit-error-for-a-given-size">A <em>Measure of Merit</em>: Error for a Given Size</h3>
<p>As described earlier, <em>RSE<sub>HLL</sub> = F / √k</em>.
If at every trial-point along the X-axis we multiply the measured RSE by the square-root of the serialized sketch size, we will have a <em>Measure of Merit</em> of the error efficiency of the sketch. For HLL-type sketches and large <em>n</em> this value should be asymptotic to a constant. In HLL-Array mode the space the sketch consumes is constant and the relative error will be a constant as well. Ideally, as the sketch grows as a function of <em>n</em>, the Measure of Merit in sparse mode and in HLL-Array mode will never be larger than its asymptotic value for large <em>n</em>.</p>
<p>This next plot computes this Measure of Merit for both the <em>HllSketch</em> sketch and the <em>HyperLogLogPlus</em> sketch.
<img class="doc-img-full" src="/docs/img/hll/HllVsHllppMerit.png" alt="HllVsHllppMerit.png" /></p>
<p>Observe that the <em>HllSketch</em> sketch is lower (i.e, better merit score) than the <em>HyperLogLogPlus</em> sketch except for the region that is roughly from 10% of <em>k</em> to about <em>3k/4</em>, where the the <em>HyperLogLogPlus</em> sketch is better.
This is because the designers of the <em>HyperLogLogPlus</em> sketch chose to do compression of the sparse data array every time a new value is entered into the sketch, which needs to be decompressed when an estimate is requested.</p>
<p>The advantage of compression is that it allows the switch from sparse to normal HLL-Array mode to be deferred to a larger fraction of <em>k</em>. But this decision comes with a severe penalty in the speed performance of the sketch. Opting out of using sparse mode will achieve higher speed performance, but at the cost of very poor Measure of Merit performance for small <em>n</em>.</p>
<p>Above <em>3k/4</em>, the <em>HyperLogLogPlus</em> sketch is not only considerably worse, but it fails the objective of always being less than the asymptotic value.</p>
<h2 id="hllsketch-vs-hyperloglogplus-update-speed-behavior"><em>HllSketch</em> vs. <em>HyperLogLogPlus</em> Update Speed Behavior</h2>
<p>Fortunately, the Update Speed performance is much easier to explain and show.</p>
<p>For these tests, at each trial point the sketch under test is presented with the number of uniques at that trial-point and the total time for the trial is measured, then divided by the number of uniques. This produces a measure of the time required per update. Also, we use a more moderate sized sketch of <em>LgK = p = 12</em> that is more typical of what might be used in practice. The <em>HllSketch</em> has 3 different types (HLL_4, HLL_6, and HLL_8) representing the respective sizes of the HLL-Array bins in bits. The speed behavior of all three of these types is very similar.</p>
<p>This first plot compares the resulting update speed.
Note the Y-axis scale is 250 nanoseconds.
<img class="doc-img-full" src="/docs/img/hll/upspeed/HllVsHllpUpdateSpeed.png" alt="HllVsHllpUpdateSpeed.png" /></p>
<p>The top black curve is the update speed performance of the <em>HyperLogLogPlus</em> sketch, which asymptotes at about 105 nanoseconds.
The lower curves are the update speed performance of the <em>HllSketch</em>, of which the HLL_8 and HLL_4 types asymptote to 10.5 nanoseconds.</p>
<p>This can be seen from a plot of just the <em>HllSketch</em> speed performance curves below.
Note the the Y-axis scale is now 50 nanoseconds.
<img class="doc-img-full" src="/docs/img/hll/upspeed/HllUpdateSpeed.png" alt="HllUpdateSpeed.png" />
The <em>HllSketch</em> is 2 orders-of-magnitude faster than the <em>HyperLogLogPlus</em> sketch.</p>
<h2 id="hllsketch-vs-hyperloglogplus-serialize-deserialize-speed-behavior"><em>HllSketch</em> vs. <em>HyperLogLogPlus</em> Serialize /Deserialize Speed Behavior</h2>
<p>The serialization and deserialization (ser-de) speed of the <em>HyperLogLogPlus</em> sketch is shown in the following plot.
Note that the Y-axis scale is 100,000 nanoseconds.
<img class="doc-img-full" src="/docs/img/hll/serde/HllpSerDeP12SP25T14.png" alt="HllpSerDeP12SP25T14.png" /></p>
<p>This next plot is the ser-de speed of the <em>HllSketch</em> in compact HLL_4 mode.
Note that the Y-axis scale is now 10,000 nanoseconds.
<img class="doc-img-full" src="/docs/img/hll/serde/Hll4SerDeK12T14.png" alt="Hll4SerDeK12T14.png" /></p>
<p>The <em>HllSketch</em> provides multiple options for serialization and deserialization.
The following ser-de is performed in Updatable form (Compact = false) which is much faster and requires a little more space.
Note that the Y-axis scale is now 1,000 nanoseconds.
<img class="doc-img-full" src="/docs/img/hll/serde/Hll8SerDeK12T16CompactF.png" alt="Hll8SerDeK12T16CompactF.png" /></p>
<p>In addition the <em>HllSketch</em> can operate fully <em>off-heap</em>.
In this mode, there is effectively no serialization, since the sketch can be updated off-heap.
The “deserialization”, when required, is just a wrapper class that contains a pointer to the off-heap data structure.
This deserialization is quite fast as shown in this next plot.
Note that the Y-axis scale is now 100 nanoseconds. Some of the peaks in these plots are due to GC pauses by the JVM.
<img class="doc-img-full" src="/docs/img/hll/serde/Hll8SerDeK12T16CompactFWrapT.png" alt="Hll8SerDeK12T16CompactFWrapT.png" /></p>
<p>Presenting the various <em>HllSketch</em> deserialization modes along with the <em>HyperLogLogPlus</em> sketch deserialization produces the following plot.
<img class="doc-img-full" src="/docs/img/hll/serde/HllVsHllpSummary.png" alt="HllVsHllpSummary.png" /></p>
<p>Depending on the chosen configuration, the <em>HllSketch</em> can be from one to almost three orders-of-magnitude faster than the <em>HyperLogLogPlus</em> sketch.</p>
<hr />
<ul>
<li>[1] <a href="https://github.com/apache/datasketches-java/tree/master/src/main/java/org/apache/datasketches/hll">DataSketches HllSketch GitHub</a></li>
<li>[2] <a href="/api/java/snapshot/apidocs/index.html">DataSketches HllSketch JavaDocs</a></li>
<li>[3] <a href="https://github.com/addthis/stream-lib/blob/master/src/main/java/com/clearspring/analytics/stream/cardinality/HyperLogLogPlus.java">HyperLogLogPlus GitHub</a></li>
<li>[4] <a href="https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/40671.pdf">Google: HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm</a></li>
<li>[5] The Root-Mean-Square of the Relative Error (RMS-RE) is sensitive to bias of the mean if there is any. However, if the bias is zero RMS-RE will produce the same results as the theoretical Relative Standard Error (RSE) of the stochastic process.</li>
<li>[6] In this context “HLL-Array” refers to the stochastic process defined by Flajolet[8], where log-2(K) bits of the incoming hash select a bin in an array of size K, and the number of leading zeros plus one of the remaining hash bits define the value stored in the array. Both sketches described in this paper also implement a <em>sparse</em> mode process for low-valued counts that is distinct from the HLL-Array process.</li>
<li>[7] The word “bounds” is used to define a quantile contour of the empirical error distribution. A symmetrical pair of these bounds about the median then can be used to define a confidence interval.</li>
<li>[8] Philippe Flajolet, E ́ric Fusy, Olivier Gandouet, and Fre ́de ́ric Meunier. Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm. In <em>DMTCS Conference on Analysis of Algorithms</em>, pages 137–156, 2007.</li>
<li>[9] Edith Cohen, All-Distances Sketches, Revisited: HIP Estimators for Massive Graphs Analysis, <em>ACM PODS 2014</em>.</li>
<li>[10] Kevin Lang, Back to the Future: an Even More Nearly Optimal Cardinality Estimation Algorithm. <a href="https://arxiv.org/abs/1708.06839">https://arxiv.org/abs/1708.06839</a>. References to the “bitmap” estimator can be found in Appendix C.</li>
</ul>
</div> <!-- End content -->
</div> <!-- End row -->
</div> <!-- End Container -->
<!-- Start _include/page_footer.html -->
<footer class="ds-footer">
<div class="container">
<div class="text-center">
<p>
<div>Copyright © 2024 <a href="https://www.apache.org">Apache Software Foundation</a>,
Licensed under the Apache License, Version 2.0. All Rights Reserved.
| <a href="https://privacy.apache.org/policies/privacy-policy-public.html">Privacy Policy</a><br/>
Apache DataSketches, Apache, the Apache feather logo, and the Apache DataSketches project logos are trademarks of The Apache Software Foundation.<br/>
All other marks mentioned may be trademarks or registered trademarks of their respective owners.
</div>
</p>
</div>
</div>
</footer>
<!-- End _include/page_footer.html -->
</body>
</html>
<!-- End _layouts/doc_page.html-->