blob: 038e5e5d32e2fdd8a69f86fad0af1c963fd91943 [file] [log] [blame]
<!DOCTYPE html>
<html lang="en">
<meta charset="UTF-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta name="description" content="Apache Druid">
<meta name="keywords" content="druid,kafka,database,analytics,streaming,real-time,real time,apache,open source">
<meta name="author" content="Apache Software Foundation">
<title>Druid | Introducing Druid: Real-Time Analytics at a Billion Rows Per Second</title>
<link rel="alternate" type="application/atom+xml" href="/feed">
<link rel="shortcut icon" href="/img/favicon.png">
<link rel="stylesheet" href="" integrity="sha384-fnmOCqbTlWIlj8LyTjo7mOUStjsKC4pOpQbqyi7RrhN7udi9RwhKkMHpvLbHG9Sr" crossorigin="anonymous">
<link href='//,700,300italic|Open+Sans:300italic,400italic,600italic,400,300,600,700' rel='stylesheet' type='text/css'>
<link rel="stylesheet" href="/css/bootstrap-pure.css?v=1.1">
<link rel="stylesheet" href="/css/base.css?v=1.1">
<link rel="stylesheet" href="/css/header.css?v=1.1">
<link rel="stylesheet" href="/css/footer.css?v=1.1">
<link rel="stylesheet" href="/css/syntax.css?v=1.1">
<link rel="stylesheet" href="/css/docs.css?v=1.1">
(function() {
var cx = '000162378814775985090:molvbm0vggm';
var gcse = document.createElement('script');
gcse.type = 'text/javascript';
gcse.async = true;
gcse.src = (document.location.protocol == 'https:' ? 'https:' : 'http:') +
'//' + cx;
var s = document.getElementsByTagName('script')[0];
s.parentNode.insertBefore(gcse, s);
<!-- Start page_header include -->
<script src="//"></script>
<div class="top-navigator">
<div class="container">
<div class="left-cont">
<a class="logo" href="/"><span class="druid-logo"></span></a>
<div class="right-cont">
<ul class="links">
<li class=""><a href="/technology">Technology</a></li>
<li class=""><a href="/use-cases">Use Cases</a></li>
<li class=""><a href="/druid-powered">Powered By</a></li>
<li class=""><a href="/docs/latest/design/">Docs</a></li>
<li class=""><a href="/community/">Community</a></li>
<li class="header-dropdown">
<div class="header-dropdown-menu">
<a href="" target="_blank">Foundation</a>
<a href="" target="_blank">Events</a>
<a href="" target="_blank">License</a>
<a href="" target="_blank">Thanks</a>
<a href="" target="_blank">Security</a>
<a href="" target="_blank">Sponsorship</a>
<li class=" button-link"><a href="/downloads.html">Download</a></li>
<div class="action-button menu-icon">
<span class="fa fa-bars"></span> MENU
<div class="action-button menu-icon-close">
<span class="fa fa-times"></span> MENU
<script type="text/javascript">
var $menu = $('.right-cont');
var $menuIcon = $('.menu-icon');
var $menuIconClose = $('.menu-icon-close');
function showMenu() {
function hideMenu() {
$(window).resize(function() {
if ($(window).width() >= 840) {
else {
<!-- Stop page_header include -->
<link rel="stylesheet" href="/css/blogs.css">
<div class="blog druid-header">
<div class="row">
<div class="col-md-8 col-md-offset-2">
<div class="title-image-wrap">
<div class="title-spacer"></div>
<img class="title-image" src="" alt="Introducing Druid: Real-Time Analytics at a Billion Rows Per Second"/>
<div class="container blog">
<div class="row">
<div class="col-md-8 col-md-offset-2">
<div class="blog-entry">
<h1>Introducing Druid: Real-Time Analytics at a Billion Rows Per Second</h1>
<p class="text-muted">by <span class="author text-uppercase">Eric Tschetter</span> · April 30, 2011</p>
<p>Here at Metamarkets we have developed a web-based analytics console that
supports drill-downs and roll-ups of high dimensional data sets – comprising
billions of events – in real-time. This is the first of two blog posts
introducing Druid, the data store that powers our console. Over the last twelve
months, we tried and failed to achieve scale and speed with relational databases
(Greenplum, InfoBright, MySQL) and NoSQL offerings (HBase). So instead we did
something crazy: we rolled our own database. Druid is the distributed, in-memory
OLAP data store that resulted.</p>
<p><strong>The Challenge: Fast Roll-Ups Over Big Data</strong></p>
<p>To frame our discussion, let’s begin with an illustration of what our raw impression event logs look
like, containing many dimensions and two metrics (click and price).</p>
<div class="highlight"><pre><code class="language-text" data-lang="text"><span></span>timestamp publisher advertiser gender country dimensions click price
2011-01-01T01:01:35Z Male USA 0 0.65
2011-01-01T01:03:63Z Male USA 0 0.62
2011-01-01T01:04:51Z Male USA 1 0.45
2011-01-01T01:00:00Z Female UK 0 0.87
2011-01-01T02:00:00Z Female UK 0 0.99
2011-01-01T02:00:00Z Female UK 1 1.53
<p>We call this our <em>alpha</em> data set. We perform a first-level aggregation operation over a selected set of
dimensions, equivalent to (in pseudocode):</p>
<div class="highlight"><pre><code class="language-text" data-lang="text"><span></span>GROUP BY timestamp, publisher, advertiser, gender, country
:: impressions = COUNT(1), clicks = SUM(click), revenue = SUM(price)
<p>to yield a compacted version:</p>
<div class="highlight"><pre><code class="language-text" data-lang="text"><span></span> timestamp publisher advertiser gender country impressions clicks revenue
2011-01-01T01:00:00Z Male USA 1800 25 15.70
2011-01-01T01:00:00Z Male USA 2912 42 29.18
2011-01-01T02:00:00Z Male UK 1953 17 17.31
2011-01-01T02:00:00Z Male UK 3194 170 34.01
<p>This is our <em>beta</em> data set, filtered for five selected dimensions and compacted. In the limit, as we group
by all available dimensions, the size of this aggregated beta converges to the original <em>alpha</em>. In practice,
it is dramatically smaller (often by a factor of 100). Our <em>beta</em> data comprises three distinct parts:</p>
<p><strong>Timestamp column</strong>: We treat timestamp separately because all of our queries
center around the time axis. Timestamps are faceted by varying granularities
(hourly, in the example above).</p>
<p><strong>Dimension columns</strong>: Here we have four dimensions of publisher, advertiser,
gender, and country. They each represent an axis of the data that we’ve chosen
to slice across.</p>
<p><strong>Metric columns</strong>: These are impressions, clicks and revenue. These represent
values, usually numeric, which are derived from an aggregation operation – such
as count, sum, and mean (we also run variance and higher moment calculations).
For example, in the first row, the revenue metric of 15.70 is the sum of 1800
event-level prices.</p>
<p>Our goal is to rapidly compute drill-downs and roll-ups over this data set. We
want to answer questions like “How many impressions from males were on” and “What is the average cost to advertise to women at” But we have a hard requirement to meet: we want queries
over any arbitrary combination of dimensions at sub-second latencies.</p>
<p>Performance of such a system is dependent on the size of our beta set, and
there are two ways that this becomes large: (i) when we include additional
dimensions, and (ii) when we include a dimension whose cardinality is large.
Using our example, for every hour’s worth of data we calculate the maximum
number of rows as:</p>
<div class="highlight"><pre><code class="language-text" data-lang="text"><span></span>number_of_publishers * number_of_advertisers * number_of_genders * number of countries
<p>If we have 10 publishers, 50 advertisers, 2 genders, and 120 countries, that
would yield a maximum of 120,000 rows. If there had been 1,000,000 possible
publishers, it would become a maximum of 12 billion rows. If we add 10 more
dimensions of cardinality 10, then it becomes a maximum of 1.2 quadrillion (1.2
x 10^15) rows.</p>
<p>Luckily for us, these data sets are generally sparse, as dimension values are
not conditionally independent (few Kazakhstanis visit, for
example). Thus the combinatorial explosion is far less than the theoretical
worst-case. Nonetheless, as a rule, more dimensions and more cardinality
dramatically inflate the size of the data set.</p>
<p><strong>Failed Solution I: Dynamic Roll-Ups with a RDBMS</strong></p>
<p>Our stated goals of aggregation and drill-down are well suited to a classical
relational architecture. So about a year ago, we fired up a RDBMS instance
(actually, the Greenplum Community Edition, running on an m1.large EC2 box),
and began loading our data into it. It worked and we were able to build the
initial version of our console on this system. However, we had two problems:</p>
<li><p>We stored the data in a star schema, which meant that there was operational
overhead maintaining dimension and fact tables.</p></li>
<li><p>Whenever we needed to do a full table scan, for things like global counts,
the queries ran slow. For example, naive benchmarks showed scanning 33
million rows took 3 seconds.</p></li>
<p>We initially just decided to eat the operational overhead of (1) because that’s
how these systems work and we benefited from having the database to do our
storage and computation. But, (2) was painful. We started materializing all
dimensional roll-ups of a certain depth, and began routing queries to these
pre-aggregated tables. We also implemented a caching layer in front of our
<p>This approach generally worked and is, I believe, a fairly common strategy in
the space. Except, when things weren’t in the cache and a query couldn’t be
mapped to a pre-aggregated table, we were back to full scans and slow
performance. We tried indexing our way out of it, but given that we are
allowing arbitrary combinations of dimensions, we couldn’t really take
advantage of composite indexes. Additionally, index merge strategies are not
always implemented, or only implemented for bitmap indexes, depending on the
flavor of RDBMS.</p>
<p>We also benchmarked plain Postgres, MySQL, and InfoBright, but did not observe
dramatically better performance. Seeing no path ahead for our relational
database, we turned to one of those new-fangled, massively scalable NOSQL
<p><strong>Failed Solution II: Pre-compute the World in NoSQL</strong></p>
<p>We used a data storage schema very similar to Twitter’s
<a href="">Rainbird</a>.</p>
<p>In short, we took all of our data and pre-computed aggregates for every
combination of dimensions. At query time we need only locate the specific
pre-computed aggregate and and return it: an O(1) key-value lookup. This made
things fast and worked wonderfully when we had a six dimension beta data set.
But when we added five more dimensions – giving us 11 dimensions total – the
time to pre-compute all aggregates became unmanageably large (such that we
never waited more than 24 hours required to see it finish).</p>
<p>So we decided to limit the depth that we aggregated to. By only pre-computing
aggregates of five dimensions or less, we were able to limit some of the
exponential expansion of the data. The data became manageable again, meaning it
only took about 4 hours on 15 machines to compute the expansion of a 500k beta
rows into the full multi-billion entry output data set.</p>
<p>Then we added three more dimensions, bringing us up to 14. This turned into 9
hours on 25 machines. We realized that we were <a href="">doing it
<p>Lesson learned: massively scalable counter systems like rainbird are intended
for high cardinality data sets with pre-defined hierarchical drill-downs. But
they break down when supporting arbitrary drill downs across all dimensions.</p>
<p><strong>Introducing Druid: A Distributed, In-Memory OLAP Store</strong></p>
<p>Stepping back from our two failures, let’s examine why these systems failed to
scale for our needs:</p>
<li>Relational Database Architectures
<li>Full table scans were slow, regardless of the storage engine used</li>
<li>Maintaining proper dimension tables, indexes and aggregate tables was painful</li>
<li>Parallelization of queries was not always supported or non-trivial</li>
<li>Massive NOSQL With Pre-Computation
<li>Supporting high dimensional OLAP requires pre-computing an exponentially large amount of data</li>
<p>Looking at the problems with these solutions, it looks like the first,
RDBMS-style architecture has a simpler issue to tackle: namely, how to scan
tables fast? When we were looking at our 500k row data set, someone remarked,
“Dude, I can store that in memory”. That was the answer.</p>
<p>Keeping everything in memory provides fast scans, but it does introduce a new
problem: machine memory is limited. The corollary thus was: distribute the data
over multiple machines. Thus, our requirements were:</p>
<li>Ability to load up, store, and query data sets in memory</li>
<li>Parallelized architecture that allows us to add more machines in order to relieve memory pressure</li>
<p>And then we threw in a couple more that seemed like good ideas:</p>
<li>Parallelized queries to speed up full scan processing</li>
<li>No dimensional tables to manage</li>
<p>These are the requirements we used to implement Druid. The system makes a
number of simplifying assumptions that fit our use case (namely that all
analytics are time-based) and integrates access to real-time and historical
data for a configurable amount of time into the past.</p>
<p>The <a href="">next
will go into the architecture of Druid, how queries work and how the system can
scale out to handle query hotspots and high cardinality data sets. For now, we
leave you with a benchmark:</p>
<li>Our 40-instance (m2.2xlarge) cluster can scan, filter, and aggregate 1 billion rows in 950 milliseconds.</li>
<p><a href="">CONTINUE TO PART II…</a></p>
<!-- Start page_footer include -->
<footer class="druid-footer">
<div class="container">
<div class="text-center">
<a href="/technology">Technology</a>&ensp;·&ensp;
<a href="/use-cases">Use Cases</a>&ensp;·&ensp;
<a href="/druid-powered">Powered by Druid</a>&ensp;·&ensp;
<a href="/docs/latest/">Docs</a>&ensp;·&ensp;
<a href="/community/">Community</a>&ensp;·&ensp;
<a href="/downloads.html">Download</a>&ensp;·&ensp;
<a href="/faq">FAQ</a>
<div class="text-center">
<a title="Join the user group" href="!forum/druid-user" target="_blank"><span class="fa fa-comments"></span></a>&ensp;·&ensp;
<a title="Follow Druid" href="" target="_blank"><span class="fab fa-twitter"></span></a>&ensp;·&ensp;
<a title="GitHub" href="" target="_blank"><span class="fab fa-github"></span></a>
<div class="text-center license">
Copyright © 2020 <a href="" target="_blank">Apache Software Foundation</a>.<br>
Except where otherwise noted, licensed under <a rel="license" href="">CC BY-SA 4.0</a>.<br>
Apache Druid, Druid, and the Druid logo are either registered trademarks or trademarks of The Apache Software Foundation in the United States and other countries.
<script async src=""></script>
window.dataLayer = window.dataLayer || [];
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
gtag('config', 'UA-131010415-1');
function trackDownload(type, url) {
ga('send', 'event', 'download', type, url);
<script src="//"></script>
<script src="//"></script>
<script src="/assets/js/druid.js"></script>
<!-- stop page_footer include -->