blob: ec506a8b14ee2b7dab9fbc3424c8319c55427646 [file] [log] [blame]
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta name="description" content="Apache Druid">
<meta name="keywords" content="druid,kafka,database,analytics,streaming,real-time,real time,apache,open source">
<meta name="author" content="Apache Software Foundation">
<title>Druid | Scaling the Druid Data Store</title>
<link rel="alternate" type="application/atom+xml" href="/feed">
<link rel="shortcut icon" href="/img/favicon.png">
<link rel="stylesheet" href="https://use.fontawesome.com/releases/v5.7.2/css/all.css" integrity="sha384-fnmOCqbTlWIlj8LyTjo7mOUStjsKC4pOpQbqyi7RrhN7udi9RwhKkMHpvLbHG9Sr" crossorigin="anonymous">
<link href='//fonts.googleapis.com/css?family=Open+Sans+Condensed:300,700,300italic|Open+Sans:300italic,400italic,600italic,400,300,600,700' rel='stylesheet' type='text/css'>
<link rel="stylesheet" href="/css/bootstrap-pure.css?v=1.1">
<link rel="stylesheet" href="/css/base.css?v=1.1">
<link rel="stylesheet" href="/css/header.css?v=1.1">
<link rel="stylesheet" href="/css/footer.css?v=1.1">
<link rel="stylesheet" href="/css/syntax.css?v=1.1">
<link rel="stylesheet" href="/css/docs.css?v=1.1">
<script>
(function() {
var cx = '000162378814775985090:molvbm0vggm';
var gcse = document.createElement('script');
gcse.type = 'text/javascript';
gcse.async = true;
gcse.src = (document.location.protocol == 'https:' ? 'https:' : 'http:') +
'//cse.google.com/cse.js?cx=' + cx;
var s = document.getElementsByTagName('script')[0];
s.parentNode.insertBefore(gcse, s);
})();
</script>
</head>
<body>
<!-- Start page_header include -->
<script src="//ajax.googleapis.com/ajax/libs/jquery/2.2.4/jquery.min.js"></script>
<div class="top-navigator">
<div class="container">
<div class="left-cont">
<a class="logo" href="/"><span class="druid-logo"></span></a>
</div>
<div class="right-cont">
<ul class="links">
<li class=""><a href="/technology">Technology</a></li>
<li class=""><a href="/use-cases">Use Cases</a></li>
<li class=""><a href="/druid-powered">Powered By</a></li>
<li class=""><a href="/docs/latest/design/">Docs</a></li>
<li class=""><a href="/community/">Community</a></li>
<li class="header-dropdown">
<a>Apache</a>
<div class="header-dropdown-menu">
<a href="https://www.apache.org/" target="_blank">Foundation</a>
<a href="https://www.apache.org/events/current-event" target="_blank">Events</a>
<a href="https://www.apache.org/licenses/" target="_blank">License</a>
<a href="https://www.apache.org/foundation/thanks.html" target="_blank">Thanks</a>
<a href="https://www.apache.org/security/" target="_blank">Security</a>
<a href="https://www.apache.org/foundation/sponsorship.html" target="_blank">Sponsorship</a>
</div>
</li>
<li class=" button-link"><a href="/downloads.html">Download</a></li>
</ul>
</div>
</div>
<div class="action-button menu-icon">
<span class="fa fa-bars"></span> MENU
</div>
<div class="action-button menu-icon-close">
<span class="fa fa-times"></span> MENU
</div>
</div>
<script type="text/javascript">
var $menu = $('.right-cont');
var $menuIcon = $('.menu-icon');
var $menuIconClose = $('.menu-icon-close');
function showMenu() {
$menu.fadeIn(100);
$menuIcon.fadeOut(100);
$menuIconClose.fadeIn(100);
}
$menuIcon.click(showMenu);
function hideMenu() {
$menu.fadeOut(100);
$menuIconClose.fadeOut(100);
$menuIcon.fadeIn(100);
}
$menuIconClose.click(hideMenu);
$(window).resize(function() {
if ($(window).width() >= 840) {
$menu.fadeIn(100);
$menuIcon.fadeOut(100);
$menuIconClose.fadeOut(100);
}
else {
$menu.fadeOut(100);
$menuIcon.fadeIn(100);
$menuIconClose.fadeOut(100);
}
});
</script>
<!-- Stop page_header include -->
<link rel="stylesheet" href="/css/blogs.css">
<div class="blog druid-header">
<div class="row">
<div class="col-md-8 col-md-offset-2">
<div class="title-image-wrap">
<div class="title-spacer"></div>
<img class="title-image" src="http://metamarkets.com/wp-content/uploads/2012/01/scaling2.jpg" alt="Scaling the Druid Data Store"/>
</div>
</div>
</div>
</div>
<div class="container blog">
<div class="row">
<div class="col-md-8 col-md-offset-2">
<div class="blog-entry">
<h1>Scaling the Druid Data Store</h1>
<p class="text-muted">by <span class="author text-uppercase">Eric Tschetter</span> · January 19, 2012</p>
<blockquote>
<p><em>“Give me a lever long enough… and I shall move the world”</em>
— Archimedes</p>
</blockquote>
<p>Parallelism is computing’s leverage, a force multiplier acting against the
weight of big data. Cloud-hosted, horizontally scalable systems have the power
to move even planetary sized data sets with speed.</p>
<p>This blog post discusses our efforts to lift one such data set, achieving a
scan rate of 26 billions records per second, with our distributed, in-memory
data store called Druid. Our main conclusions are:</p>
<p>Horizontally-scalable architectures are an ideal fit for the Cloud Our data
store’s performance scales up well to a 6TB in-memory cluster and degrades
gracefully under memory pressure The flexibility of a Cloud environment enables
pain-free tuning of cost versus performance Benchmarking our infrastructure
against a big data set in the wild provides validation of the power achievable
on a Cloud computing fabric of commodity hardware.</p>
<p>For those who are curious as to what our infrastructure powers, Metamarkets
offers a SaaS analytics solution to gaming, social, and digital media firms. A
public example is our dashboard for exploring Wikipedia edits.</p>
<h3 id="i-the-data">I) The Data</h3>
<p>We began our experiment with 6TB of uncompressed data, representing tens of
billions of fact rows, which we aimed to host and make fully explorable through
our dashboard. By way of comparison, the Wikipedia edit feed we host consists
of 6GB of uncompressed data, representing ~36 million fact rows.</p>
<p>The first hurdle to overcome with a data set of this scale is co-locating the
data with the compute power. Most of the trillions events we’ve analyzed on
our platform have been delivered over months of parallel, continuous feeds. In
rare cases, we have had to transform the data locally and sneaker-net the disks
to our data center. Pushing terabytes over a standard office uplink can take
weeks.</p>
<p>Once on the cloud, we performed some cardinality analysis to make sure we
understood the parameters of the data. There were more than a dozen
dimensions, with cardinalities ranging from tens of millions, to hundreds of
thousands, all the way down to tens. This kind of Zipfian distribution in
cardinalities is common in naturally occurring data. We then computed four
metrics for each row (consisting of counts, sums, and averages) and loaded the
data up into Druid.</p>
<p>We sharded the data into chunks and then sub-sharded those chunks by the
dimension with cardinality &gt;&gt; 1M, creating thousands of shards of roughly 8M
fact rows apiece.</p>
<h3 id="ii-the-cluster">II) The Cluster</h3>
<p>We then spun up a cluster of compute nodes to load the data up and keep it in
memory for querying. The cluster consisted of 100 nodes, each with 16 cores,
60GB of RAM, 10 GigE ethernet, and 1TB of disk space. So, collectively the
cluster comprised 1600 cores, 6TB of RAM, fast ethernet and more than enough
disk space.</p>
<p>With this first cluster, we were successful in delivering an interactive
experience on our front-end dashboard, scanning billions of records per second,
as the benchmarks below attest.</p>
<p>During the course of our testing, we also reconfigured the cluster in multiple
different ways, switching from pure in memory to using memory mapping and
pulling back the number of servers to see how performance degrades as we
changed the ratio of data served to available RAM.</p>
<h3 id="iii-the-benchmarks">III) The Benchmarks</h3>
<p>First, we’ll provide some benchmarks for our 100-node configuration on simple
aggregation queries. SQL is included to describe what the query is doing.</p>
<div class="highlight"><pre><code class="language-text" data-lang="text"><span></span>Select count(*) from _table_ where timestamp &gt;= ? and timestamp &lt; ?
cluster cluster scan rate (rows/sec) core scan rate
15-core, 100 nodes, in-memory 26,610,386,635 17,740,258
15-core, 75 nodes, mmap 25,224,873,928 22,422,110
15-core, 50 nodes, mmap 20,387,152,160 27,182,870
15-core, 25 nodes, mmap 11,910,388,894 31,761,037
4-core, 131 nodes, in-memory 10,008,730,163 19,100,630
4-core, 131 nodes, mmap 10,129,695,120 19,331,479
4-core, 50 nodes, mmap 6,626,570,688 33,132,853
</code></pre></div>
<ul>
<li>The timestamp range encompasses all data.</li>
<li>15-core is a 16-core machine with 60GB RAM and 1TB of local disk. The machine was configured to only use 15
threads for processing queries.</li>
<li>4-core is a 4-core machine with 32GB RAM and 1TB of local disk.</li>
<li>in-memory means that the machine was configured to load all data up into the Java heap and have it available for querying</li>
<li>mmap means that the machine was configured to mmap the data instead of load it into the Java heap</li>
</ul>
<p><br/></p>
<div class="highlight"><pre><code class="language-text" data-lang="text"><span></span>Select count(*), sum(metric1) from _table_ where timestamp &gt;= ? and timestamp &lt; ?
cluster cluster scan rate (rows/sec) core scan rate
15-core, 100 nodes, in-memory 16,223,081,703 10,815,388
15-core, 75 nodes, mmap 9,860,968,285 8,765,305
15-core, 50 nodes, mmap 8,093,611,909 10,791,483
15-core, 25 nodes, mmap 4,126,502,352 11,004,006
4-core, 131 nodes, in-memory 5,755,274,389 10,983,348
4-core, 131 nodes, mmap 5,032,185,657 9,603,408
4-core, 50 nodes, mmap 1,720,238,609 8,601,193
Select count(*), sum(metric1), sum(metric2), sum(metric3), sum(metric4)
where timestamp &gt;= ? and timestamp &lt; ?
cluster cluster scan rate (rows/sec) core scan rate
15-core, 100 nodes, in-memory 7,591,604,822 5,061,070
15-core, 75 nodes, mmap 4,319,179,995 3,839,271
15-core, 50 nodes, mmap 3,406,554,102 4,542,072
15-core, 25 nodes, mmap 1,826,451,888 4,870,538
4-core, 131 nodes, in-memory 1,936,648,601 3,695,894
4-core, 131 nodes, mmap 2,210,367,152 4,218,258
4-core, 50 nodes, mmap 1,002,291,562 5,011,458
</code></pre></div>
<p>The first query is just a count and we see the best performance out of our
system with it, achieving scan rates of 33M rows/second/core. At first glance
it looks like fewer nodes might actually be outperforming more nodes in the
rows/sec/core metric, but that’s just because 100 nodes is overprovisioned for
the data set. Druid’s concurrency model is based on shards, one thread will
scan one shard. If a node has 15 cores, for example, and handles a query that
requires scanning 16 shards, if we assume each shard takes 1 second to process
the total time to finish the query will be 2 seconds (1 second for the first 15
shards and 1 second for the 16th shard), decreasing the global scan rate
because there are actually a number of cores that are idle.</p>
<p>As we move on to include more aggregations we see performance degrade. This is
because of the column-oriented storage format Druid employs. For the <code>count *</code>
queries, it only has to check the timestamp column to satisfy the where clause.
As we add metrics, it has to also load those metric values and scan over them,
increasing the amount of memory scanned. Next, we’ll do a top 100 query on our
high cardinality dimension:</p>
<div class="highlight"><pre><code class="language-text" data-lang="text"><span></span>Select high_card_dimension, count(*) AS cnt from _table_ where timestamp &gt;= ?
and timestamp &lt; ? group by high_card_dimension order by cnt limit 100;
cluster cluster scan rate (rows/sec) core scan rate
15-core, 100 nodes, in-memory 10,241,183,745 6,827,456
15-core, 75 nodes, mmap 4,891,097,559 4,347,642
15-core, 50 nodes, mmap 3,616,707,511 4,822,277
15-core, 25 nodes, mmap 1,665,053,263 4,440,142
4-core, 131 nodes, in-memoy 4,388,159,569 8,374,350
4-core, 131 nodes, mmap 2,444,344,232 4,664,779
4-core, 50 nodes, mmap 1,215,737,558 6,078,688
Select high_card_dimension, count(*), sum(metric1) AS cnt from _table_
where timestamp &gt;= ? and timestamp &lt; ? group by high_card_dimension order by
cnt limit 100;
cluster cluster scan rate (rows/sec) core scan rate
15-core, 100 nodes, in-memory 7,309,984,688 4,873,323
15-core, 75 nodes, mmap 3,333,628,777 2,963,226
15-core, 50 nodes, mmap 2,555,300,237 3,407,067
15-core, 25 nodes, mmap 1,384,674,717 3,692,466
4-core, 131 nodes, in-memory 3,237,907,984 6,179,214
4-core, 131 nodes, mmap 1,740,481,380 3,321,529
4-core, 50 nodes, mmap 863,170,420 4,315,852
Select high_card_dimension, count(*), sum(imetric1), sum(metric2),
sum(metric3), sum(metric4) AS cnt from _table_ where timestamp &gt;= ? and
timestamp &lt; ? group by high_card_dimension order by cnt limit 100;
cluster cluster scan rate (rows/sec) core scan rate
15-core, 100 nodes, in-memory 4,064,424,274 2,709,616
15-core, 75 nodes, mmap 2,014,067,386 1,790,282
15-core, 50 nodes, mmap 1,499,452,617 1,999,270
15-core, 25 nodes, mmap 810,143,518 2,160,383
4-core, 131 nodes, in-memory 1,670,214,695 3,187,433
4-core, 131 nodes, mmap 1,116,635,690 2,130,984
4-core, 50 nodes, mmap 531,389,163 2,656,946
</code></pre></div>
<p>Here we see the superior performance of the in-memory representation when doing
top lists versus when doing simple time-based aggregations. This is an
implementation detail, but it’s largely because of the differences in accessing
simple in-memory pointers, versus scanning and seeking through a flattened data
structure (even though it is already largely paged into memory).</p>
<h3 id="iv-conclusions">IV) Conclusions</h3>
<p>Our conclusions are three-fold. First, we demonstrate that is possible to
provide real-time, fully interactive exploration of 6TB of data with a
distributed, cloud-hosted commodity hardware.</p>
<p>Second, we highlight the flexibility offered by the cloud. Letting us stick to
our core engineering competencies and having someone else deal with the
overhead of running an actual data center is huge. The fact that we were able
to spin up 100 machines, run our benchmarks, kill 25, wait a bit, run
benchmarks, kill another 25, wait a bit, run benchmarks, rinse and repeat was
just awesome.</p>
<p>Finally, designing an architecture that horizontally scales for performance
opens up a set of nobs of cost versus performance. If we can tolerate response
times of 10 seconds instead of 1 second, we can pay less for our processing.
If we can tolerate response times of 1 minute, we pay even less. Conversely,
if we need answers in milliseconds, this is achievable at a higher price point.</p>
<h3 id="v-using-druid">V) Using Druid</h3>
<p>We currently offer Druid as a hosted service, but are exploring steps to open
up the platform to a developer community. If you would like to explore either
using our hosted service or being part of a developer community, please drop us
a note.</p>
</div>
</div>
</div>
</div>
<!-- Start page_footer include -->
<footer class="druid-footer">
<div class="container">
<div class="text-center">
<p>
<a href="/technology">Technology</a>&ensp;·&ensp;
<a href="/use-cases">Use Cases</a>&ensp;·&ensp;
<a href="/druid-powered">Powered by Druid</a>&ensp;·&ensp;
<a href="/docs/latest">Docs</a>&ensp;·&ensp;
<a href="/community/">Community</a>&ensp;·&ensp;
<a href="/downloads.html">Download</a>&ensp;·&ensp;
<a href="/faq">FAQ</a>
</p>
</div>
<div class="text-center">
<a title="Join the user group" href="https://groups.google.com/forum/#!forum/druid-user" target="_blank"><span class="fa fa-comments"></span></a>&ensp;·&ensp;
<a title="Follow Druid" href="https://twitter.com/druidio" target="_blank"><span class="fab fa-twitter"></span></a>&ensp;·&ensp;
<a title="Download via Apache" href="https://www.apache.org/dyn/closer.cgi?path=/incubator/druid/0.16.1-incubating/apache-druid-0.16.1-incubating-bin.tar.gz" target="_blank"><span class="fas fa-feather"></span></a>&ensp;·&ensp;
<a title="GitHub" href="https://github.com/apache/incubator-druid" target="_blank"><span class="fab fa-github"></span></a>
</div>
<div class="text-center license">
Copyright © 2019 <a href="https://www.apache.org/" target="_blank">Apache Software Foundation</a>.<br>
Except where otherwise noted, licensed under <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">CC BY-SA 4.0</a>.<br>
Apache Druid, Druid, and the Druid logo are either registered trademarks or trademarks of The Apache Software Foundation in the United States and other countries.
</div>
</div>
</footer>
<script async src="https://www.googletagmanager.com/gtag/js?id=UA-131010415-1"></script>
<script>
window.dataLayer = window.dataLayer || [];
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
gtag('config', 'UA-131010415-1');
</script>
<script>
function trackDownload(type, url) {
ga('send', 'event', 'download', type, url);
}
</script>
<script src="//code.jquery.com/jquery.min.js"></script>
<script src="//maxcdn.bootstrapcdn.com/bootstrap/3.2.0/js/bootstrap.min.js"></script>
<script src="/assets/js/druid.js"></script>
<!-- stop page_footer include -->
</body>
</html>