blob: 0c7a7661f7040d096f35558bccd0f2dc3f7f6b81 [file] [log] [blame]
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta name="description" content="Apache Druid">
<meta name="keywords" content="druid,kafka,database,analytics,streaming,real-time,real time,apache,open source">
<meta name="author" content="Apache Software Foundation">
<title>Druid | Real Real-Time. For Realz.</title>
<link rel="alternate" type="application/atom+xml" href="/feed">
<link rel="shortcut icon" href="/img/favicon.png">
<link rel="stylesheet" href="https://use.fontawesome.com/releases/v5.7.2/css/all.css" integrity="sha384-fnmOCqbTlWIlj8LyTjo7mOUStjsKC4pOpQbqyi7RrhN7udi9RwhKkMHpvLbHG9Sr" crossorigin="anonymous">
<link href='//fonts.googleapis.com/css?family=Open+Sans+Condensed:300,700,300italic|Open+Sans:300italic,400italic,600italic,400,300,600,700' rel='stylesheet' type='text/css'>
<link rel="stylesheet" href="/css/bootstrap-pure.css?v=1.1">
<link rel="stylesheet" href="/css/base.css?v=1.1">
<link rel="stylesheet" href="/css/header.css?v=1.1">
<link rel="stylesheet" href="/css/footer.css?v=1.1">
<link rel="stylesheet" href="/css/syntax.css?v=1.1">
<link rel="stylesheet" href="/css/docs.css?v=1.1">
<script>
(function() {
var cx = '000162378814775985090:molvbm0vggm';
var gcse = document.createElement('script');
gcse.type = 'text/javascript';
gcse.async = true;
gcse.src = (document.location.protocol == 'https:' ? 'https:' : 'http:') +
'//cse.google.com/cse.js?cx=' + cx;
var s = document.getElementsByTagName('script')[0];
s.parentNode.insertBefore(gcse, s);
})();
</script>
</head>
<body>
<!-- Start page_header include -->
<script src="//ajax.googleapis.com/ajax/libs/jquery/2.2.4/jquery.min.js"></script>
<div class="top-navigator">
<div class="container">
<div class="left-cont">
<a class="logo" href="/"><span class="druid-logo"></span></a>
</div>
<div class="right-cont">
<ul class="links">
<li class=""><a href="/technology">Technology</a></li>
<li class=""><a href="/use-cases">Use Cases</a></li>
<li class=""><a href="/druid-powered">Powered By</a></li>
<li class=""><a href="/docs/latest/design/">Docs</a></li>
<li class=""><a href="/community/">Community</a></li>
<li class="header-dropdown">
<a>Apache</a>
<div class="header-dropdown-menu">
<a href="https://www.apache.org/" target="_blank">Foundation</a>
<a href="https://www.apache.org/events/current-event" target="_blank">Events</a>
<a href="https://www.apache.org/licenses/" target="_blank">License</a>
<a href="https://www.apache.org/foundation/thanks.html" target="_blank">Thanks</a>
<a href="https://www.apache.org/security/" target="_blank">Security</a>
<a href="https://www.apache.org/foundation/sponsorship.html" target="_blank">Sponsorship</a>
</div>
</li>
<li class=" button-link"><a href="/downloads.html">Download</a></li>
</ul>
</div>
</div>
<div class="action-button menu-icon">
<span class="fa fa-bars"></span> MENU
</div>
<div class="action-button menu-icon-close">
<span class="fa fa-times"></span> MENU
</div>
</div>
<script type="text/javascript">
var $menu = $('.right-cont');
var $menuIcon = $('.menu-icon');
var $menuIconClose = $('.menu-icon-close');
function showMenu() {
$menu.fadeIn(100);
$menuIcon.fadeOut(100);
$menuIconClose.fadeIn(100);
}
$menuIcon.click(showMenu);
function hideMenu() {
$menu.fadeOut(100);
$menuIconClose.fadeOut(100);
$menuIcon.fadeIn(100);
}
$menuIconClose.click(hideMenu);
$(window).resize(function() {
if ($(window).width() >= 840) {
$menu.fadeIn(100);
$menuIcon.fadeOut(100);
$menuIconClose.fadeOut(100);
}
else {
$menu.fadeOut(100);
$menuIcon.fadeIn(100);
$menuIconClose.fadeOut(100);
}
});
</script>
<!-- Stop page_header include -->
<link rel="stylesheet" href="/css/blogs.css">
<div class="blog druid-header">
<div class="row">
<div class="col-md-8 col-md-offset-2">
<div class="title-image-wrap">
<div class="title-spacer"></div>
<img class="title-image" src="http://metamarkets.com/wp-content/uploads/2013/05/Clocks.jpg" alt="Real Real-Time. For Realz."/>
</div>
</div>
</div>
</div>
<div class="container blog">
<div class="row">
<div class="col-md-8 col-md-offset-2">
<div class="blog-entry">
<h1>Real Real-Time. For Realz.</h1>
<p class="text-muted">by <span class="author text-uppercase">Eric Tschetter</span> · May 10, 2013</p>
<p><em>Danny Yuan, Cloud System Architect at Netflix, and I recently co-presented at
the Strata Conference in Santa Clara. <a href="http://www.youtube.com/watch?v=Dlqj34l2upk">The
presentation</a> discussed how Netflix
engineers leverage <a href="http://metamarkets.com/product/technology/">Druid</a>,
Metamarkets’ open-source, distributed, real-time, analytical data store, to
ingest 150,000 events per second (billions per day), equating to about 500MB/s
of data at peak (terabytes per hour) while still maintaining real-time,
exploratory querying capabilities. Before and after the presentation, we had
some interesting chats with conference attendees. One common theme from those
discussions was curiosity around the definition of “real-time” in the real
world and how Netflix could possibly achieve it at those volumes. This post is
a summary of the learnings from those conversations and a response to some of
those questions.</em></p>
<h3 id="what-is-real-time">What is Real-time?</h3>
<p>Real-time has become a heavily overloaded term so it is important to properly
define it. I will limit our discussion of the term to its usage in the data
space as it takes on different meanings in other arenas. In the data space, it
is now commonly used to refer to one of two kinds of latency: query latency and
data ingestion latency.</p>
<p>Query latency is the rate of return of queries. It assumes a static data set
and refers to the speed at which you can ask questions of that data set. Right
now, the vast majority of “real-time” systems are co-opting the word real-time
to refer to “fast query latency.” I do not agree with this definition of
“real-time” and prefer “interactive queries,” but it is the most prevalent use
of real-time and thus is worth noting.</p>
<p>Data ingestion latency is the amount of time it takes for an event to be
reflected in your query results. An example of this would be the amount of time
it takes from when someone visits your website to when you can run a query that
tells you about that person’s activity on your site. When that latency is close
to a few seconds, you feel like you are seeing what is going on right now or
that you are seeing things in “real-time.” This is what I believe most people
assume when they hear about “real-time data.” However, rapid data ingestion
latency is the lesser used definition due to of the lack of infrastructure to
support it at scale (tens of billions of events/terabytes of data per day),
while the infrastructure to support fast query latencies is easier to create
and readily available.</p>
<h3 id="what-s-considered-real-time">What’s Considered Real-Time?</h3>
<p>Okay, now that we have a definition of real-time and that definition depends on
latency, there’s the remaining question of which latencies are good enough to
earn the “real-time” moniker. The truth is that it’s up to interpretation. The
key point is that the people who see the output of the queries feel like they
are looking at what is going on “right now.” I don’t have any
scientifically-driven methods of understanding where this boundary is, but I do
have experience from interacting with customers at Metamarkets.</p>
<p>Conclusions first, descriptions second. To be considered real-time, query
latency must be below 5 seconds and data ingestion latency must be below 15
seconds.</p>
<h3 id="why-druid">Why Druid?</h3>
<p>Of course, in my infinite bias, I’m going to tell you about how Druid is able
to handle data ingestion latencies in the sub-15 second range. If I didn’t tell
you about that, then the blog post would be quite pointless. If you are
interested in how Druid is able to handle the query latency side of the
endeavor, please <a href="http://www.youtube.com/watch?v=eCbXoGSyHbg">watch the video</a>
from my October talk at Strata NY. I will continue with a discussion of the
data ingestion side of the story.</p>
<h3 id="how-does-druid-do-it">How does Druid do it?</h3>
<p>If you want to deeply understand Druid, then a great place to start is its
<a href="http://static.druid.io/docs/druid.pdf">whitepaper</a>
but we will provide a brief overview here of how the real-time ingestion piece
achieves its goals. Druid handles real-time data ingestion by having a separate
node type: the descriptively-named “real-time” node. Real-time nodes
encapsulate the functionality to ingest and query data streams. Therefore, data
indexed via these nodes is immediately available for querying. Typically, for
data durability purposes, a message bus such as
<a href="http://kafka.apache.org/">Kafka</a> sits between the event creation point and the
real-time node.</p>
<p>The purpose of the message bus is to act as a buffer for incoming events. In an
event stream, the message bus maintains offsets indicating the point a
real-time node has read up to. Then, the real-time nodes can update these
offsets periodically.</p>
<p>Real-time nodes pull data in from the message bus and buffer it in indexes that
do not hit disk. To minimize the impact of losing a node, the nodes will
persist their indexes to disk either periodically or after some maximum size
threshold is reached. After each persist, a real-time node updates the message
bus, informing it of everything it has consumed so far (this is done by
“committing the offset” in Kafka). If a real-time node fails and recovers, it
can simply reload any indexes that were persisted to disk and continue reading
the message bus from the point the last offset was committed.</p>
<p>Real-time nodes expose a consolidated view of the current and updated buffer
and of all of the indexes persisted to disk. This allows for a consistent view
of the data on the query side, while still allowing us to incrementally append
data. On a periodic basis, the nodes will schedule a background task that takes
all of the persisted indexes of a data source, merges them together to build a
segment and uploads it to deep storage. It then signals for the historical
compute nodes to begin serving the segment. Once the compute nodes load up the
data and start serving requests against it, the real-time node no longer needs
to maintain its older data. The real-time nodes then clean up the older segment
of data and begin work on their new segment(s). The intricate and ongoing
sequence of ingest, persist, merge, and handoff is completely fluid. The people
querying the system are unaware of what is going on behind the scenes and they
simply have a system that works.</p>
<h3 id="tl-dr-but-yet-you-somehow-made-it-to-the-end-of-the-post">TL;DR, but yet you somehow made it to the end of the post:</h3>
<p>A deep understanding of the problem, specifically the end-user’s expectations
and how that will affect their interactions, is key to designing a
technological solution to a problem. When dealing with transparency and
analytical needs for large quantities of data, the big questions around user
experience that must be answered are how soon data needs to be available and
how quickly queries need to return.</p>
<p>Hopefully this blog helped clarify the considerations around these two key
components and how infrastructure can be developed to handle it.</p>
<p>Lastly, the shameless plug for Druid: you should use Druid.</p>
<p>Druid is open source, you can download it and run it on your own infrastructure
for your own problems. If you are interested in learning more about Druid or
trying it out, the code is available on
<a href="https://github.com/metamx/druid">GitHub</a> and our <a href="https://github.com/metamx/druid/wiki">wiki with documentation is
available here</a>. Finally, to complete
the link soup at the bottom of our post,
<a href="http://www.youtube.com/watch?v=eCbXoGSyHbg">here is our introductory
presentation at Strata</a> and <a href="http://www.youtube.com/watch?v=Dlqj34l2upk">here
is our most recent Strata talk</a> with Danny about real-time in Santa Clara.</p>
<p><a href="http://www.flickr.com/photos/74586726@N00/4176786834/">Clocks photograph by Image Club Graphics via Sean
Turvey</a></p>
</div>
</div>
</div>
</div>
<!-- Start page_footer include -->
<footer class="druid-footer">
<div class="container">
<div class="text-center">
<p>
<a href="/technology">Technology</a>&ensp;·&ensp;
<a href="/use-cases">Use Cases</a>&ensp;·&ensp;
<a href="/druid-powered">Powered by Druid</a>&ensp;·&ensp;
<a href="/docs/latest">Docs</a>&ensp;·&ensp;
<a href="/community/">Community</a>&ensp;·&ensp;
<a href="/downloads.html">Download</a>&ensp;·&ensp;
<a href="/faq">FAQ</a>
</p>
</div>
<div class="text-center">
<a title="Join the user group" href="https://groups.google.com/forum/#!forum/druid-user" target="_blank"><span class="fa fa-comments"></span></a>&ensp;·&ensp;
<a title="Follow Druid" href="https://twitter.com/druidio" target="_blank"><span class="fab fa-twitter"></span></a>&ensp;·&ensp;
<a title="Download via Apache" href="https://www.apache.org/dyn/closer.cgi?path=/incubator/druid/0.16.1-incubating/apache-druid-0.16.1-incubating-bin.tar.gz" target="_blank"><span class="fas fa-feather"></span></a>&ensp;·&ensp;
<a title="GitHub" href="https://github.com/apache/incubator-druid" target="_blank"><span class="fab fa-github"></span></a>
</div>
<div class="text-center license">
Copyright © 2019 <a href="https://www.apache.org/" target="_blank">Apache Software Foundation</a>.<br>
Except where otherwise noted, licensed under <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">CC BY-SA 4.0</a>.<br>
Apache Druid, Druid, and the Druid logo are either registered trademarks or trademarks of The Apache Software Foundation in the United States and other countries.
</div>
</div>
</footer>
<script async src="https://www.googletagmanager.com/gtag/js?id=UA-131010415-1"></script>
<script>
window.dataLayer = window.dataLayer || [];
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
gtag('config', 'UA-131010415-1');
</script>
<script>
function trackDownload(type, url) {
ga('send', 'event', 'download', type, url);
}
</script>
<script src="//code.jquery.com/jquery.min.js"></script>
<script src="//maxcdn.bootstrapcdn.com/bootstrap/3.2.0/js/bootstrap.min.js"></script>
<script src="/assets/js/druid.js"></script>
<!-- stop page_footer include -->
</body>
</html>