blob: b78d27abc3a5da0b9a1df88322465c00625dacc4 [file] [log] [blame]
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta name="description" content="Apache Druid">
<meta name="keywords" content="druid,kafka,database,analytics,streaming,real-time,real time,apache,open source">
<meta name="author" content="Apache Software Foundation">
<title>Druid | Introducing Druid</title>
<link rel="alternate" type="application/atom+xml" href="/feed">
<link rel="shortcut icon" href="/img/favicon.png">
<link rel="stylesheet" href="/assets/css/font-awesome-5.css">
<link href='//fonts.googleapis.com/css?family=Open+Sans+Condensed:300,700,300italic|Open+Sans:300italic,400italic,600italic,400,300,600,700' rel='stylesheet' type='text/css'>
<link rel="stylesheet" href="/css/bootstrap-pure.css?v=1.1">
<link rel="stylesheet" href="/css/base.css?v=1.1">
<link rel="stylesheet" href="/css/header.css?v=1.1">
<link rel="stylesheet" href="/css/footer.css?v=1.1">
<link rel="stylesheet" href="/css/syntax.css?v=1.1">
<link rel="stylesheet" href="/css/docs.css?v=1.1">
<script>
(function() {
var cx = '000162378814775985090:molvbm0vggm';
var gcse = document.createElement('script');
gcse.type = 'text/javascript';
gcse.async = true;
gcse.src = (document.location.protocol == 'https:' ? 'https:' : 'http:') +
'//cse.google.com/cse.js?cx=' + cx;
var s = document.getElementsByTagName('script')[0];
s.parentNode.insertBefore(gcse, s);
})();
</script>
</head>
<body>
<!-- Start page_header include -->
<script src="//ajax.googleapis.com/ajax/libs/jquery/2.2.4/jquery.min.js"></script>
<div class="top-navigator">
<div class="container">
<div class="left-cont">
<a class="logo" href="/"><span class="druid-logo"></span></a>
</div>
<div class="right-cont">
<ul class="links">
<li class=""><a href="/technology">Technology</a></li>
<li class=""><a href="/use-cases">Use Cases</a></li>
<li class=""><a href="/druid-powered">Powered By</a></li>
<li class=""><a href="/docs/latest/design/">Docs</a></li>
<li class=""><a href="/community/">Community</a></li>
<li class="header-dropdown">
<a>Apache</a>
<div class="header-dropdown-menu">
<a href="https://www.apache.org/" target="_blank">Foundation</a>
<a href="https://www.apache.org/events/current-event" target="_blank">Events</a>
<a href="https://www.apache.org/licenses/" target="_blank">License</a>
<a href="https://www.apache.org/foundation/thanks.html" target="_blank">Thanks</a>
<a href="https://www.apache.org/security/" target="_blank">Security</a>
<a href="https://www.apache.org/foundation/sponsorship.html" target="_blank">Sponsorship</a>
</div>
</li>
<li class=" button-link"><a href="/downloads.html">Download</a></li>
</ul>
</div>
</div>
<div class="action-button menu-icon">
<span class="fa fa-bars"></span> MENU
</div>
<div class="action-button menu-icon-close">
<span class="fa fa-times"></span> MENU
</div>
</div>
<script type="text/javascript">
var $menu = $('.right-cont');
var $menuIcon = $('.menu-icon');
var $menuIconClose = $('.menu-icon-close');
function showMenu() {
$menu.fadeIn(100);
$menuIcon.fadeOut(100);
$menuIconClose.fadeIn(100);
}
$menuIcon.click(showMenu);
function hideMenu() {
$menu.fadeOut(100);
$menuIconClose.fadeOut(100);
$menuIcon.fadeIn(100);
}
$menuIconClose.click(hideMenu);
$(window).resize(function() {
if ($(window).width() >= 840) {
$menu.fadeIn(100);
$menuIcon.fadeOut(100);
$menuIconClose.fadeOut(100);
}
else {
$menu.fadeOut(100);
$menuIcon.fadeIn(100);
$menuIconClose.fadeOut(100);
}
});
</script>
<!-- Stop page_header include -->
<link rel="stylesheet" href="/css/blogs.css">
<div class="blog druid-header">
<div class="row">
<div class="col-md-8 col-md-offset-2">
<div class="title-image-wrap">
<div class="title-spacer"></div>
<img class="title-image" src="http://metamarkets.com/wp-content/uploads/2012/10/Druid.jpg" alt="Introducing Druid"/>
</div>
</div>
</div>
</div>
<div class="container blog">
<div class="row">
<div class="col-md-8 col-md-offset-2">
<div class="blog-entry">
<h1>Introducing Druid</h1>
<p class="text-muted">by <span class="author text-uppercase">Eric Tschetter</span> · October 24, 2012</p>
<p>In <a href="http://metamarkets.com/2011/druid-part-i-real-time-analytics-at-a-billion-rows-per-second/">April 2011</a>,
we introduced Druid, our distributed, real-time data store. Today I am
extremely proud to announce that we are releasing the Druid data store to the
community as an open source project. To mark this special occasion, I wanted to
recap why we built Druid, and why we believe there is broader utility for Druid
beyond <a href="http://metamarkets.com/2012/metamarkets-open-sources-druid/metamarkets.com/product">Metamarkets&#39; analytical SaaS offering</a>.</p>
<p>When we started to build Metamarkets’ analytics solution, we tried several
commercially available data stores. Our requirements were driven by our online
advertising customers who have data volumes often upwards of hundreds of
billions of events per month, and need highly interactive queries on the latest
data as well as an ability to arbitrarily filter across any dimension – with
data sets that contain 30 dimensions or more. For example, a typical query
might be “find me how many advertisements were seen by female executives, aged
35 to 44, from the US, UK, and Canada, reading sports blogs on weekends.”</p>
<p>First, we went the traditional database route. Companies have successfully used
data warehouses to manage the cost and overhead of querying historical data,
and the architecture aligned with our goals of data aggregation and drill down.
For our data volumes (reaching billions of records), we found that the scan
rates were not fast enough to support our interactive dashboard, and caching
could not be used to reliably speed up queries due to the arbitrary drill-downs
we need to support. In addition, because RDBMS data updates are inherently
batch, updates made via inserts lead to locking of rows for queries.</p>
<p>Next, we investigated a NoSQL architecture. To support our use case of allowing
users to drill down on arbitrary dimensions, we pre-computed dimensional
aggregations and wrote them into a NoSQL key-value store. While this approach
provided fast query times, pre-aggregations required hours of processing time
for just millions of records (on a ~10-node Hadoop cluster). More
problematically, as the number of dimensions increased, the aggregation and
processing time increased exponentially, exceeding 24 hours. Beyond its cost,
this time created an unacceptably high latency between when events occurred and
when they were available for querying – negating any possibility of supporting
our customers’ desire for real-time visibility into their data.</p>
<p>We thus decided to build Druid, to better meet the needs of analytics workloads
requiring fast, real-time access to data at scale.</p>
<p>Druid’s key features are:</p>
<ul>
<li><p><strong>Distributed architecture.</strong> Swappable read-only data segments using an MVCC
swapping protocol. Per-segment replication relieves load on hot segments.
Supports both in-memory and memory-mapped versions.</p></li>
<li><p><strong>Real-time ingestion.</strong> Real-time ingestion coupled with broker servers to
query across real-time and historical data. Automated migration of real-time to
historical as it ages.</p></li>
<li><p><strong>Column-oriented for speed.</strong> Data is laid out in columns so that scans are
limited to specific data being searched. Compression decreases overall data
footprint.</p></li>
<li><p><strong>Fast filtering.</strong> Bitmap indices with CONCISE compression.</p></li>
<li><p><strong>Operational simplicity.</strong> Fault tolerant due to replication. Supports
rolling deployments and restarts. Allows simple scale up and scale down – just
add or remove nodes.</p></li>
</ul>
<p>From a query perspective, Druid supports arbitrary Boolean filters as well as
Group By, time series roll-ups, aggregation functions and regular expression
searches.</p>
<p>In terms of performance, Druid’s scan speed is 33M rows per second per core,
and can ingest up to 10K incoming records per second per node. We have
horizontally scaled Druid to support <a href="http://metamarkets.com/2012/scaling-druid/">scan speeds of 26B records per
second</a>.</p>
<p>Now that more people have hands-on experience with Hadoop, there is a
broadening realization that while it is ideal for batch processing of large
data volumes, tools for real-time data queries are lacking. Hence there is
growing interest in tools like Google’s Dremel and PowerDrill, as evidenced by
the new Apache Drill project. We believe that Druid addresses a gap in the
existing big data ecosystem for a real-time analytical data store, and we are
excited to make it available to the open source community.</p>
<p>Metamarkets has engaged with multiple large internet properties like Netflix,
providing early access to the code for evaluation purposes. Netflix is
assessing Druid for operational monitoring of real-time metrics across their
streaming media business.</p>
<p>Sudhir Tonse, Manager, Cloud Platform Infrastructure says, “Netflix manages
billions of streaming events each day, so we need a highly scalable data store
for operational reporting. We are so far impressed with the speed and
scalability of Druid, and are continuing to evaluate it for providing critical
real-time transparency into our operational metrics.”</p>
<p>Metamarkets anticipates that open sourcing Druid will also help other
organizations solve their real-time data analysis and processing needs. We are
excited to see how the open source community benefits from using Druid in their
own applications, and hopeful that Druid improves through their feedback and
usage.</p>
<p>Druid is available for download on GitHub at <a href="https://github.com/metamx/druid">https://github.com/metamx/druid</a>,
and more information can be found on the <a href="http://metamarkets.com/druid">Druid project
website</a>.</p>
</div>
</div>
</div>
</div>
<!-- Start page_footer include -->
<footer class="druid-footer">
<div class="container">
<div class="text-center">
<p>
<a href="/technology">Technology</a>&ensp;·&ensp;
<a href="/use-cases">Use Cases</a>&ensp;·&ensp;
<a href="/druid-powered">Powered by Druid</a>&ensp;·&ensp;
<a href="/docs/latest/">Docs</a>&ensp;·&ensp;
<a href="/community/">Community</a>&ensp;·&ensp;
<a href="/downloads.html">Download</a>&ensp;·&ensp;
<a href="/faq">FAQ</a>
</p>
</div>
<div class="text-center">
<a title="Join the user group" href="https://groups.google.com/forum/#!forum/druid-user" target="_blank"><span class="fa fa-comments"></span></a>&ensp;·&ensp;
<a title="Follow Druid" href="https://twitter.com/druidio" target="_blank"><span class="fab fa-twitter"></span></a>&ensp;·&ensp;
<a title="GitHub" href="https://github.com/apache/druid" target="_blank"><span class="fab fa-github"></span></a>
</div>
<div class="text-center license">
Copyright © 2020 <a href="https://www.apache.org/" target="_blank">Apache Software Foundation</a>.<br>
Except where otherwise noted, licensed under <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">CC BY-SA 4.0</a>.<br>
Apache Druid, Druid, and the Druid logo are either registered trademarks or trademarks of The Apache Software Foundation in the United States and other countries.
</div>
</div>
</footer>
<script async src="https://www.googletagmanager.com/gtag/js?id=UA-131010415-1"></script>
<script>
window.dataLayer = window.dataLayer || [];
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
gtag('config', 'UA-131010415-1');
</script>
<script>
function trackDownload(type, url) {
ga('send', 'event', 'download', type, url);
}
</script>
<script src="//code.jquery.com/jquery.min.js"></script>
<script src="//maxcdn.bootstrapcdn.com/bootstrap/3.2.0/js/bootstrap.min.js"></script>
<script src="/assets/js/druid.js"></script>
<!-- stop page_footer include -->
</body>
</html>