blob: 9c77f631367f7c44f174d59a0a2462b8e2618b44 [file] [log] [blame]
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta name="description" content="Apache Druid">
<meta name="keywords" content="druid,kafka,database,analytics,streaming,real-time,real time,apache,open source">
<meta name="author" content="Apache Software Foundation">
<title>Druid | How We Scaled HyperLogLog: Three Real-World Optimizations</title>
<link rel="alternate" type="application/atom+xml" href="/feed">
<link rel="shortcut icon" href="/img/favicon.png">
<link rel="stylesheet" href="/assets/css/font-awesome-5.css">
<link href='//fonts.googleapis.com/css?family=Open+Sans+Condensed:300,700,300italic|Open+Sans:300italic,400italic,600italic,400,300,600,700' rel='stylesheet' type='text/css'>
<link rel="stylesheet" href="/css/bootstrap-pure.css?v=1.1">
<link rel="stylesheet" href="/css/base.css?v=1.1">
<link rel="stylesheet" href="/css/header.css?v=1.1">
<link rel="stylesheet" href="/css/footer.css?v=1.1">
<link rel="stylesheet" href="/css/syntax.css?v=1.1">
<link rel="stylesheet" href="/css/docs.css?v=1.1">
<script>
(function() {
var cx = '000162378814775985090:molvbm0vggm';
var gcse = document.createElement('script');
gcse.type = 'text/javascript';
gcse.async = true;
gcse.src = (document.location.protocol == 'https:' ? 'https:' : 'http:') +
'//cse.google.com/cse.js?cx=' + cx;
var s = document.getElementsByTagName('script')[0];
s.parentNode.insertBefore(gcse, s);
})();
</script>
</head>
<body>
<!-- Start page_header include -->
<script src="//ajax.googleapis.com/ajax/libs/jquery/2.2.4/jquery.min.js"></script>
<div class="top-navigator">
<div class="container">
<div class="left-cont">
<a class="logo" href="/"><span class="druid-logo"></span></a>
</div>
<div class="right-cont">
<ul class="links">
<li class=""><a href="/technology">Technology</a></li>
<li class=""><a href="/use-cases">Use Cases</a></li>
<li class=""><a href="/druid-powered">Powered By</a></li>
<li class=""><a href="/docs/latest/design/">Docs</a></li>
<li class=""><a href="/community/">Community</a></li>
<li class="header-dropdown">
<a>Apache</a>
<div class="header-dropdown-menu">
<a href="https://www.apache.org/" target="_blank">Foundation</a>
<a href="https://www.apache.org/events/current-event" target="_blank">Events</a>
<a href="https://www.apache.org/licenses/" target="_blank">License</a>
<a href="https://www.apache.org/foundation/thanks.html" target="_blank">Thanks</a>
<a href="https://www.apache.org/security/" target="_blank">Security</a>
<a href="https://www.apache.org/foundation/sponsorship.html" target="_blank">Sponsorship</a>
</div>
</li>
<li class=" button-link"><a href="/downloads.html">Download</a></li>
</ul>
</div>
</div>
<div class="action-button menu-icon">
<span class="fa fa-bars"></span> MENU
</div>
<div class="action-button menu-icon-close">
<span class="fa fa-times"></span> MENU
</div>
</div>
<script type="text/javascript">
var $menu = $('.right-cont');
var $menuIcon = $('.menu-icon');
var $menuIconClose = $('.menu-icon-close');
function showMenu() {
$menu.fadeIn(100);
$menuIcon.fadeOut(100);
$menuIconClose.fadeIn(100);
}
$menuIcon.click(showMenu);
function hideMenu() {
$menu.fadeOut(100);
$menuIconClose.fadeOut(100);
$menuIcon.fadeIn(100);
}
$menuIconClose.click(hideMenu);
$(window).resize(function() {
if ($(window).width() >= 840) {
$menu.fadeIn(100);
$menuIcon.fadeOut(100);
$menuIconClose.fadeOut(100);
}
else {
$menu.fadeOut(100);
$menuIcon.fadeIn(100);
$menuIconClose.fadeOut(100);
}
});
</script>
<!-- Stop page_header include -->
<link rel="stylesheet" href="/css/blogs.css">
<div class="blog druid-header">
<div class="row">
<div class="col-md-8 col-md-offset-2">
<div class="title-image-wrap">
<div class="title-spacer"></div>
<img class="title-image" src="http://metamarkets.com/wp-content/uploads/2014/02/sequoia-600x400.jpg" alt="How We Scaled HyperLogLog: Three Real-World Optimizations"/>
</div>
</div>
</div>
</div>
<div class="container blog">
<div class="row">
<div class="col-md-8 col-md-offset-2">
<div class="blog-entry">
<h1>How We Scaled HyperLogLog: Three Real-World Optimizations</h1>
<p class="text-muted">by <span class="author text-uppercase">NELSON RAY AND FANGJIN YANG</span> · February 18, 2014</p>
<p>At Metamarkets, we specialize in converting mountains of programmatic ad data
into real-time, explorable views. Because these datasets are so large and
complex, we’re always looking for ways to maximize the speed and efficiency of
how we deliver them to our clients.  In this post, we’re going to continue our
discussion of some of the techniques we use to calculate critical metrics such
as unique users and device IDs with maximum performance and accuracy.</p>
<p>Approximation algorithms are rapidly gaining traction as the preferred way to
determine the unique number of elements in high cardinality sets. In the space
of cardinality estimation algorithms, HyperLogLog has quickly emerged as the
de-facto standard. Widely discussed by <a href="http://research.google.com/pubs/pub40671.html">technology companies</a> and
<a href="http://highscalability.com/blog/2012/4/5/big-data-counting-how-to-count-a-billion-distinct-objects-us.html">popular blogs</a>, HyperLogLog trades
accuracy in data and query results for massive reductions in data storage and
vastly improved <a href="http://strataconf.com/stratany2013/public/schedule/detail/30045">system performance</a>.</p>
<p>In our <a href="http://metamarkets.com/2012/fast-cheap-and-98-right-cardinality-estimation-for-big-data">previous</a> investigation of HyperLogLog, we briefly
discussed our motivations for using approximate algorithms and how we leveraged
HyperLogLog in <a href="http://druid.io/">Druid</a>, Metamarkets’ open source, distributed data
store. Since implementing and deploying HyperLogLog last year, we’ve made
several optimizations to further improve performance and reduce storage cost.
This blog post will share some of those optimizations. This blog post assumes
that you are already familiar with how HyperLogLog works. If you are not
familiar with the algorithm, there are plenty of resources <a href="http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf">online</a>.</p>
<h2 id="compacting-registers">Compacting Registers</h2>
<p>In our initial implementation of HLL, we allocated 8 bits of memory for each
register. Recall that each value stored in a register indicates the position of
the first ‘1’ bit of a hashed input. Given that 2^255 ~== 10^76, a single 8 bit
register could approximate (not well, though) a cardinality close to the number
of atoms in the entire <a href="http://www.universetoday.com/36302/atoms-in-the-universe/">observable universe</a>. Martin
Traverso, et. al. of <a href="https://www.facebook.com/notes/facebook-engineering/presto-interacting-with-petabytes-of-data-at-facebook/10151786197628920">Facebook’s Presto</a> , realized that this was a bit
wasteful and proposed an optimization, exploiting the fact that the registers
increment in near lockstep.</p>
<p>Given that each register is initially initialized with value 0, with 0 uniques,
there is no change in any of the registers. Let’s say we have 8 registers. Then
with 8 * 2^10 uniques, each register will have values ~ 10. Of course, there
will be some variance, which can be calculated exactly if one were so inclined,
given that the distribution in each register is an independent maximum of
<a href="http://en.wikipedia.org/wiki/Negative_binomial_distribution">Negative Binomial</a> (1, .5) draws.</p>
<p>With 4 bit registers, each register can only approximate up to 2^15 = 32,768
uniques. In fact, the reality is worse because the higher numbers cannot be
represented and are lost, impacting accuracy. Even with 2,048 registers, we
can’t do much better than ~60M, which is one or two orders of magnitude lower
than what we need.</p>
<p>Since the register values tend to increase together, the FB folks decided to
introduce an offset counter and only store positive differences from it in the
registers. That is, if we have register values of 8, 7, and 9, this corresponds
to having an offset of 7 and using register difference values of 1, 0, and 2.
Given the smallish spread that we expect to see, we typically won’t observe a
difference of more than 15 among register values. So we feel comfortable using
2,048 4 bit registers with an 8 bit offset, for 1025 bytes of storage &lt; 2048
bytes (no offset and 8 bit registers).</p>
<p>In fact, others have commented on the concentrated distribution of the register
values as well. In her <a href="http://algo.inria.fr/durand/Articles/these.ps">thesis</a>, Marianne Durand suggested using
a variable bit prefix encoding. Researchers at <a href="http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/40671.pdf">Google</a> have had
success with difference encodings and variable length encodings.</p>
<h3 id="problem">Problem</h3>
<p>This optimization has served us well, with no appreciable loss in accuracy when
streaming many uniques into a single HLL object, because the offset increments
when all the registers get hit. Similarly, we can combine many HLL objects of
moderate size together and watch the offsets increase. However, a curious
phenomenon occurs when we try to combine many “small” HLL objects together.</p>
<p>Suppose each HLL object stores a single unique value. Then its offset will be
0, one register will have a value between 1 and 15, and the remaining registers
will be 0. No matter how many of these we combine together, our aggregate HLL
object will never be able to exceed a value of 15 in each register with a 0
offset, which is equivalent to an offset of 15 with 0’s in each register. Using
2,048 registers, this means we won’t be able to produce estimates greater than
~ .7 * 2048^2 * 1 / (2048 / 2^15) ~ 47M. (<a href="http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf"><em>Flajolet, et al. 2007</em></a>)</p>
<p>Not good, because this means our estimates are capped at 10^7 instead of 10^80,
irrespective of the number of true uniques. And this isn’t just some
pathological edge case. Its untimely appearance in production a while ago was
no fun trying to fix.</p>
<h3 id="floating-max">Floating Max</h3>
<p>The root problem in the above scenario is that the high values (&gt; 15) are
being clipped, with no hope of making it into a “small” HLL object, since the
offset is 0. Although they are rare, many cumulative misses can have a
noticeably large effect. Our solution involves storing one additional pair, a
“floating max” bucket with higher resolution. Previously, a value of 20 in
bucket 94 would be clipped to 15. Now, we store (20, 94) as the floating max,
requiring at most an additional 2 bytes, bringing our total up to 1027 bytes.
With enough small HLL objects so that each position is covered by a floating
max, the combined HLL object can exceed the previous limit of 15 in each
position. It also turns out that just one floating max is sufficient to largely
fix the problem.</p>
<p>Let’s take a look at one measure of the accuracy of our approximations. We
simulate 1,000 runs of streaming 1B uniques into an HLL object and look at the
proportion of cases in which we observed clipping with the offset approximation
(black) and the addition of the floating max (red). So for 1e9 uniques, the max
reduced clipping from 95%+ to ~15%. That is, in 85% of cases, the much smaller
HLL objects with the floating max agreed with HLL versus less than 5% without
the floating max.</p>
<p><img src="http://metamarkets.com/wp-content/uploads/2014/02/FJblogpost-600x560.png" alt="Clipping on Cardinality" title="Clipping on Cardinality"></p>
<p>For the cost of only 2 bytes, the floating max register allowed us to union
millions of HLL objects with minimal measurable loss in accuracy.</p>
<h2 id="sparse-and-dense-storage">Sparse and Dense Storage</h2>
<p>We first discussed the concept of representing HLL buckets in either a sparse
or dense format in our <a href="http://metamarkets.com/2012/fast-cheap-and-98-right-cardinality-estimation-for-big-data">first blog post</a>. Since that time,
Google has also written a <a href="http://research.google.com/pubs/pub40671.html">great paper</a> on the matter. Data undergoes
a <a href="/blog/2011/05/20/druid-part-deux.html">summarization process</a> when it is ingested in Druid. It is
unnecessarily expensive to store raw event data and instead, Druid rolls
ingested data up to some time granularity.</p>
<p><img src="https://lh6.googleusercontent.com/O2YefUQdRdmCTXzh6xdxthD0VJY0Vq96DTXkhhPVAL_JXaJ1JuAWfFaxZDSmf9NDZgrmHS61RMFLqivacqsOw7evy1Ff73KNb1MdjoLchpCwc-YE8d9eCLiAAA" alt=""></p>
<p>In practice, we see tremendous reductions in data volume by summarizing our
<a href="http://strataconf.com/stratany2013/public/schedule/detail/30045">data</a>. For a given summarized row, we can maintain HLL objects
where each object represents the estimated number of unique elements for a
column of that row.</p>
<p>When the summarization granularity is sufficiently small, only a limited number
of unique elements may be seen for a dimension. In this case, a given HLL
object may have registers that contain no values. The HLL registers are thus
‘sparsely’ populated.</p>
<p>Our normal storage representation of HLL stores 2 register values per byte. In
the sparse representation, we instead store the explicit indexes of buckets
that have valid values in them as (index, value) pairs. When the sparse
representation exceeds the size of the normal or ‘dense’ representation (1027
bytes), we can switch to using only the dense representation. Our actual
implementation uses a heuristic to determine when this switch occurs, but the
idea is the same. In practice, many dimensions in real world data sets are of
low cardinality, and this optimization can greatly reduce storage versus only
storing the dense representation.</p>
<h2 id="faster-lookups">Faster Lookups</h2>
<p>One of the simpler optimizations that we implemented for faster cardinality
calculations was to use lookups for register values. Instead of computing the
actual register value by summing the register offset with the stored register
value, we instead perform a lookup into a precalculated map. Similarly, to
determine the number of zeros in a register value, we created a secondary
lookup table. Given the number of registers we have, the cost of storing these
lookup tables is near trivial. This problem is often known as the <a href="http://en.wikipedia.org/wiki/Hamming_weight">Hamming
Weight problem</a>.</p>
<h2 id="lessons">Lessons</h2>
<p>Many of our optimizations came out of necessity, both to provide the
interactive query latencies that Druid users have come to expect, and to keep
our storage costs reasonable. If you have any further improvements to our
optimizations, please share them with us! We strongly believe that as data sets
get increasingly larger, estimation algorithms are key to keeping query times
acceptable. The approximate algorithm space remains relatively new, but it is
something we can build together.</p>
<p>For more information on Druid, please visit <a href="http://druid.io/">druid.io</a> and follow
<a href="https://twitter.com/druidio">@druidio</a>. We’d also like to thank Eric Tschetter and Xavier Léauté
for their contributions to this work. Featured image courtesy of <a href="http://donasdays.blogspot.com/2012/10/are-you-sprinter-or-long-distance-runner.html">Donna L
Martin</a>.</p>
</div>
</div>
</div>
</div>
<!-- Start page_footer include -->
<footer class="druid-footer">
<div class="container">
<div class="text-center">
<p>
<a href="/technology">Technology</a>&ensp;·&ensp;
<a href="/use-cases">Use Cases</a>&ensp;·&ensp;
<a href="/druid-powered">Powered by Druid</a>&ensp;·&ensp;
<a href="/docs/latest/">Docs</a>&ensp;·&ensp;
<a href="/community/">Community</a>&ensp;·&ensp;
<a href="/downloads.html">Download</a>&ensp;·&ensp;
<a href="/faq">FAQ</a>
</p>
</div>
<div class="text-center">
<a title="Join the user group" href="https://groups.google.com/forum/#!forum/druid-user" target="_blank"><span class="fa fa-comments"></span></a>&ensp;·&ensp;
<a title="Follow Druid" href="https://twitter.com/druidio" target="_blank"><span class="fab fa-twitter"></span></a>&ensp;·&ensp;
<a title="GitHub" href="https://github.com/apache/druid" target="_blank"><span class="fab fa-github"></span></a>
</div>
<div class="text-center license">
Copyright © 2020 <a href="https://www.apache.org/" target="_blank">Apache Software Foundation</a>.<br>
Except where otherwise noted, licensed under <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">CC BY-SA 4.0</a>.<br>
Apache Druid, Druid, and the Druid logo are either registered trademarks or trademarks of The Apache Software Foundation in the United States and other countries.
</div>
</div>
</footer>
<script async src="https://www.googletagmanager.com/gtag/js?id=UA-131010415-1"></script>
<script>
window.dataLayer = window.dataLayer || [];
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
gtag('config', 'UA-131010415-1');
</script>
<script>
function trackDownload(type, url) {
ga('send', 'event', 'download', type, url);
}
</script>
<script src="//code.jquery.com/jquery.min.js"></script>
<script src="//maxcdn.bootstrapcdn.com/bootstrap/3.2.0/js/bootstrap.min.js"></script>
<script src="/assets/js/druid.js"></script>
<!-- stop page_footer include -->
</body>
</html>