blob: e8fd6af5235e5540428c3f23a12888f720610e0d [file] [log] [blame]
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta name="description" content="Apache Druid">
<meta name="keywords" content="druid,kafka,database,analytics,streaming,real-time,real time,apache,open source">
<meta name="author" content="Apache Software Foundation">
<title>Druid | Understanding Druid Via Twitter Data</title>
<link rel="alternate" type="application/atom+xml" href="/feed">
<link rel="shortcut icon" href="/img/favicon.png">
<link rel="stylesheet" href="https://use.fontawesome.com/releases/v5.7.2/css/all.css" integrity="sha384-fnmOCqbTlWIlj8LyTjo7mOUStjsKC4pOpQbqyi7RrhN7udi9RwhKkMHpvLbHG9Sr" crossorigin="anonymous">
<link href='//fonts.googleapis.com/css?family=Open+Sans+Condensed:300,700,300italic|Open+Sans:300italic,400italic,600italic,400,300,600,700' rel='stylesheet' type='text/css'>
<link rel="stylesheet" href="/css/bootstrap-pure.css?v=1.1">
<link rel="stylesheet" href="/css/base.css?v=1.1">
<link rel="stylesheet" href="/css/header.css?v=1.1">
<link rel="stylesheet" href="/css/footer.css?v=1.1">
<link rel="stylesheet" href="/css/syntax.css?v=1.1">
<link rel="stylesheet" href="/css/docs.css?v=1.1">
<script>
(function() {
var cx = '000162378814775985090:molvbm0vggm';
var gcse = document.createElement('script');
gcse.type = 'text/javascript';
gcse.async = true;
gcse.src = (document.location.protocol == 'https:' ? 'https:' : 'http:') +
'//cse.google.com/cse.js?cx=' + cx;
var s = document.getElementsByTagName('script')[0];
s.parentNode.insertBefore(gcse, s);
})();
</script>
</head>
<body>
<!-- Start page_header include -->
<script src="//ajax.googleapis.com/ajax/libs/jquery/2.2.4/jquery.min.js"></script>
<div class="top-navigator">
<div class="container">
<div class="left-cont">
<a class="logo" href="/"><span class="druid-logo"></span></a>
</div>
<div class="right-cont">
<ul class="links">
<li class=""><a href="/technology">Technology</a></li>
<li class=""><a href="/use-cases">Use Cases</a></li>
<li class=""><a href="/druid-powered">Powered By</a></li>
<li class=""><a href="/docs/latest/design/">Docs</a></li>
<li class=""><a href="/community/">Community</a></li>
<li class="header-dropdown">
<a>Apache</a>
<div class="header-dropdown-menu">
<a href="https://www.apache.org/" target="_blank">Foundation</a>
<a href="https://www.apache.org/events/current-event" target="_blank">Events</a>
<a href="https://www.apache.org/licenses/" target="_blank">License</a>
<a href="https://www.apache.org/foundation/thanks.html" target="_blank">Thanks</a>
<a href="https://www.apache.org/security/" target="_blank">Security</a>
<a href="https://www.apache.org/foundation/sponsorship.html" target="_blank">Sponsorship</a>
</div>
</li>
<li class=" button-link"><a href="/downloads.html">Download</a></li>
</ul>
</div>
</div>
<div class="action-button menu-icon">
<span class="fa fa-bars"></span> MENU
</div>
<div class="action-button menu-icon-close">
<span class="fa fa-times"></span> MENU
</div>
</div>
<script type="text/javascript">
var $menu = $('.right-cont');
var $menuIcon = $('.menu-icon');
var $menuIconClose = $('.menu-icon-close');
function showMenu() {
$menu.fadeIn(100);
$menuIcon.fadeOut(100);
$menuIconClose.fadeIn(100);
}
$menuIcon.click(showMenu);
function hideMenu() {
$menu.fadeOut(100);
$menuIconClose.fadeOut(100);
$menuIcon.fadeIn(100);
}
$menuIconClose.click(hideMenu);
$(window).resize(function() {
if ($(window).width() >= 840) {
$menu.fadeIn(100);
$menuIcon.fadeOut(100);
$menuIconClose.fadeOut(100);
}
else {
$menu.fadeOut(100);
$menuIcon.fadeIn(100);
$menuIconClose.fadeOut(100);
}
});
</script>
<!-- Stop page_header include -->
<link rel="stylesheet" href="/css/blogs.css">
<div class="blog druid-header">
<div class="row">
<div class="col-md-8 col-md-offset-2">
<div class="title-image-wrap">
</div>
</div>
</div>
</div>
<div class="container blog">
<div class="row">
<div class="col-md-8 col-md-offset-2">
<div class="blog-entry">
<h1>Understanding Druid Via Twitter Data</h1>
<p class="text-muted">by <span class="author text-uppercase">Russell Jurney</span> · August 6, 2013</p>
<p>Druid is a rockin&#39; exploratory analytical data store capable of offering interactive query of big data in realtime - as data is ingested. Druid drives 10&#39;s of billions of events per day for the <a href="http://www.metamarkets.com">Metamarkets</a> platform, and Metamarkets is committed to building Druid in open source.</p>
<h2 id="about-druid">About Druid</h2>
<p>Thanks for taking an interest in Druid. This tutorial will help clarify some core Druid concepts. We will go through one of the Real-time examples and issue some basic Druid queries. The data source we&#39;ll be working with is the <a href="https://dev.twitter.com/docs/streaming-apis/streams/public">Twitter spritzer stream</a>. If you are ready to explore Druid, brave its challenges, and maybe learn a thing or two, read on!</p>
<h2 id="setting-up">Setting Up</h2>
<p>There are two ways to setup Druid: download a tarball, or build it from source.</p>
<h3 id="download-a-tarball">Download a Tarball</h3>
<p>We&#39;ve built a tarball that contains everything you&#39;ll need. You&#39;ll find it <a href="http://static.druid.io/artifacts/releases/druid-services-0.5.6-SNAPSHOT-bin.tar.gz">here</a>.
Download this to a directory of your choosing.</p>
<p>You can extract the awesomeness within by issuing:
<pre>tar -zxvf druid-services-0.5.6-SNAPSHOT-bin.tar.gz</pre></p>
<p>If you cd into the directory:
<pre>cd druid-services-0.5.6-SNAPSHOT</pre></p>
<p>You should see a bunch of files:
* run_example_server.sh
* run_example_client.sh
* run_ec2.sh
* LICENSE, config, examples, lib directories</p>
<h3 id="clone-and-build-from-source">Clone and Build from Source</h3>
<p>The other way to setup Druid is from source via git. To do so, run these commands:</p>
<div class="highlight"><pre><code class="language-text" data-lang="text"><span></span>git clone git@github.com:metamx/druid.git
cd druid
./build.sh
</code></pre></div>
<p>You should see a bunch of files: </p>
<div class="highlight"><pre><code class="language-text" data-lang="text"><span></span>DruidCorporateCLA.pdf README common examples indexer pom.xml server
DruidIndividualCLA.pdf build.sh doc group_by.body install publications services
LICENSE client eclipse_formatting.xml index-common merger realtime
</code></pre></div>
<p>You can find the example executables in the examples/bin directory:
* run_example_server.sh
* run_example_client.sh</p>
<h2 id="running-example-scripts">Running Example Scripts</h2>
<p>Let&#39;s start doing stuff. You can start a Druid Realtime node by issuing:
<pre>./run_example_server.sh</pre></p>
<p>Select &quot;twitter&quot;. </p>
<p>You&#39;ll need to register a new application with the twitter API, which only takes a minute. Go to <a href="https://dev.twitter.com/apps/new">https://dev.twitter.com/apps/new</a> and fill out the form and submit. Don&#39;t worry, the home page and callback url can be anything. This will generate keys for the Twitter example application. Take note of the values for consumer key/secret and access token/secret.</p>
<p>Enter your credentials when prompted.</p>
<p>Once the node starts up you will see a bunch of logs about setting up properties and connecting to the data source. If you see crazy exceptions, you probably typed in your login information incorrectly. If the server started properly you will see a message like this that repeats periodically:</p>
<pre><code>
2013-05-17 23:04:59,793 INFO [chief-twitterstream] druid.examples.twitter.TwitterSpritzerFirehoseFactory - nextRow() has returned 1,000 InputRows
</code></pre>
<p>These messages indicate you are ingesting events. The Druid real time-node ingests events in an in-memory buffer. Periodically, these events will be persisted to disk. Persisting to disk generates a whole bunch of logs:
<pre><code>
2013-05-17 23:06:40,918 INFO [chief-twitterstream] com.metamx.druid.realtime.plumber.RealtimePlumberSchool - Submitting persist runnable for dataSource[twitterstream]
2013-05-17 23:06:40,920 INFO [twitterstream-incremental-persist] com.metamx.druid.realtime.plumber.RealtimePlumberSchool - DataSource[twitterstream], Interval[2013-05-17T23:00:00.000Z/2013-05-18T00:00:00.000Z], persisting Hydrant[FireHydrant{index=com.metamx.druid.index.v1.IncrementalIndex@126212dd, queryable=com.metamx.druid.index.IncrementalIndexSegment@64c47498, count=0}]
2013-05-17 23:06:40,937 INFO [twitterstream-incremental-persist] com.metamx.druid.index.v1.IndexMerger - Starting persist for interval[2013-05-17T23:00:00.000Z/2013-05-17T23:07:00.000Z], rows[4,666]
2013-05-17 23:06:41,039 INFO [twitterstream-incremental-persist] com.metamx.druid.index.v1.IndexMerger - outDir[/tmp/example/twitter_realtime/basePersist/twitterstream/2013-05-17T23:00:00.000Z_2013-05-18T00:00:00.000Z/0/v8-tmp] completed index.drd in 11 millis.
2013-05-17 23:06:41,070 INFO [twitterstream-incremental-persist] com.metamx.druid.index.v1.IndexMerger - outDir[/tmp/example/twitter_realtime/basePersist/twitterstream/2013-05-17T23:00:00.000Z_2013-05-18T00:00:00.000Z/0/v8-tmp] completed dim conversions in 31 millis.
2013-05-17 23:06:41,275 INFO [twitterstream-incremental-persist] com.metamx.druid.index.v1.CompressedPools - Allocating new chunkEncoder[1]
2013-05-17 23:06:41,332 INFO [twitterstream-incremental-persist] com.metamx.druid.index.v1.IndexMerger - outDir[/tmp/example/twitter_realtime/basePersist/twitterstream/2013-05-17T23:00:00.000Z_2013-05-18T00:00:00.000Z/0/v8-tmp] completed walk through of 4,666 rows in 262 millis.
2013-05-17 23:06:41,334 INFO [twitterstream-incremental-persist] com.metamx.druid.index.v1.IndexMerger - Starting dimension[htags] with cardinality[634]
2013-05-17 23:06:41,381 INFO [twitterstream-incremental-persist] com.metamx.druid.index.v1.IndexMerger - Completed dimension[htags] in 49 millis.
2013-05-17 23:06:41,382 INFO [twitterstream-incremental-persist] com.metamx.druid.index.v1.IndexMerger - Starting dimension[lang] with cardinality[19]
2013-05-17 23:06:41,398 INFO [twitterstream-incremental-persist] com.metamx.druid.index.v1.IndexMerger - Completed dimension[lang] in 17 millis.
2013-05-17 23:06:41,398 INFO [twitterstream-incremental-persist] com.metamx.druid.index.v1.IndexMerger - Starting dimension[utc_offset] with cardinality[32]
2013-05-17 23:06:41,413 INFO [twitterstream-incremental-persist] com.metamx.druid.index.v1.IndexMerger - Completed dimension[utc_offset] in 15 millis.
2013-05-17 23:06:41,413 INFO [twitterstream-incremental-persist] com.metamx.druid.index.v1.IndexMerger - outDir[/tmp/example/twitter_realtime/basePersist/twitterstream/2013-05-17T23:00:00.000Z_2013-05-18T00:00:00.000Z/0/v8-tmp] completed inverted.drd in 81 millis.
2013-05-17 23:06:41,425 INFO [twitterstream-incremental-persist] com.metamx.druid.index.v1.IndexIO$DefaultIndexIOHandler - Converting v8[/tmp/example/twitter_realtime/basePersist/twitterstream/2013-05-17T23:00:00.000Z_2013-05-18T00:00:00.000Z/0/v8-tmp] to v9[/tmp/example/twitter_realtime/basePersist/twitterstream/2013-05-17T23:00:00.000Z_2013-05-18T00:00:00.000Z/0]
2013-05-17 23:06:41,426 INFO [twitterstream-incremental-persist]
... ETC
</code></pre></p>
<p>The logs are about building different columns, probably not the most exciting stuff (they might as well be in Vulcan) if are you learning about Druid for the first time. Nevertheless, if you are interested in the details of our real-time architecture and why we persist indexes to disk, I suggest you read our <a href="http://static.druid.io/docs/druid.pdf">White Paper</a>.</p>
<p>Okay, things are about to get real (-time). To query the real-time node you&#39;ve spun up, you can issue:
<pre>./run_example_client.sh</pre></p>
<p>Select &quot;twitter&quot; once again. This script issues GroupByQuerys to the twitter data we&#39;ve been ingesting. The query looks like this:</p>
<div class="highlight"><pre><code class="language-json" data-lang="json"><span></span><span class="p">{</span>
<span class="nt">&quot;queryType&quot;</span><span class="p">:</span> <span class="s2">&quot;groupBy&quot;</span><span class="p">,</span>
<span class="nt">&quot;dataSource&quot;</span><span class="p">:</span> <span class="s2">&quot;twitterstream&quot;</span><span class="p">,</span>
<span class="nt">&quot;granularity&quot;</span><span class="p">:</span> <span class="s2">&quot;all&quot;</span><span class="p">,</span>
<span class="nt">&quot;dimensions&quot;</span><span class="p">:</span> <span class="p">[</span><span class="s2">&quot;lang&quot;</span><span class="p">,</span> <span class="s2">&quot;utc_offset&quot;</span><span class="p">],</span>
<span class="nt">&quot;aggregations&quot;</span><span class="p">:[</span>
<span class="p">{</span> <span class="nt">&quot;type&quot;</span><span class="p">:</span> <span class="s2">&quot;count&quot;</span><span class="p">,</span> <span class="nt">&quot;name&quot;</span><span class="p">:</span> <span class="s2">&quot;rows&quot;</span><span class="p">},</span>
<span class="p">{</span> <span class="nt">&quot;type&quot;</span><span class="p">:</span> <span class="s2">&quot;doubleSum&quot;</span><span class="p">,</span> <span class="nt">&quot;fieldName&quot;</span><span class="p">:</span> <span class="s2">&quot;tweets&quot;</span><span class="p">,</span> <span class="nt">&quot;name&quot;</span><span class="p">:</span> <span class="s2">&quot;tweets&quot;</span><span class="p">}</span>
<span class="p">],</span>
<span class="nt">&quot;filter&quot;</span><span class="p">:</span> <span class="p">{</span> <span class="nt">&quot;type&quot;</span><span class="p">:</span> <span class="s2">&quot;selector&quot;</span><span class="p">,</span> <span class="nt">&quot;dimension&quot;</span><span class="p">:</span> <span class="s2">&quot;lang&quot;</span><span class="p">,</span> <span class="nt">&quot;value&quot;</span><span class="p">:</span> <span class="s2">&quot;en&quot;</span> <span class="p">},</span>
<span class="nt">&quot;intervals&quot;</span><span class="p">:[</span><span class="s2">&quot;2012-10-01T00:00/2020-01-01T00&quot;</span><span class="p">]</span>
<span class="p">}</span>
</code></pre></div>
<p>This is a <strong>groupBy</strong> query, which you may be familiar with from SQL. We are grouping, or aggregating, via the <strong>dimensions</strong> field: [&quot;lang&quot;, &quot;utc_offset&quot;]. We are <strong>filtering</strong> via the <strong>&quot;lang&quot;</strong> dimension, to only look at english tweets. Our <strong>aggregations</strong> are what we are calculating: a row count, and the sum of the tweets in our data.</p>
<p>The result looks something like this:</p>
<div class="highlight"><pre><code class="language-json" data-lang="json"><span></span><span class="p">[</span>
<span class="p">{</span>
<span class="nt">&quot;version&quot;</span><span class="p">:</span> <span class="s2">&quot;v1&quot;</span><span class="p">,</span>
<span class="nt">&quot;timestamp&quot;</span><span class="p">:</span> <span class="s2">&quot;2012-10-01T00:00:00.000Z&quot;</span><span class="p">,</span>
<span class="nt">&quot;event&quot;</span><span class="p">:</span> <span class="p">{</span>
<span class="nt">&quot;utc_offset&quot;</span><span class="p">:</span> <span class="s2">&quot;-10800&quot;</span><span class="p">,</span>
<span class="nt">&quot;tweets&quot;</span><span class="p">:</span> <span class="mi">90</span><span class="p">,</span>
<span class="nt">&quot;lang&quot;</span><span class="p">:</span> <span class="s2">&quot;en&quot;</span><span class="p">,</span>
<span class="nt">&quot;rows&quot;</span><span class="p">:</span> <span class="mi">81</span>
<span class="p">}</span>
<span class="p">},</span>
<span class="p">{</span>
<span class="nt">&quot;version&quot;</span><span class="p">:</span> <span class="s2">&quot;v1&quot;</span><span class="p">,</span>
<span class="nt">&quot;timestamp&quot;</span><span class="p">:</span> <span class="s2">&quot;2012-10-01T00:00:00.000Z&quot;</span><span class="p">,</span>
<span class="nt">&quot;event&quot;</span><span class="p">:</span> <span class="p">{</span>
<span class="nt">&quot;utc_offset&quot;</span><span class="p">:</span> <span class="s2">&quot;-14400&quot;</span><span class="p">,</span>
<span class="nt">&quot;tweets&quot;</span><span class="p">:</span> <span class="mi">177</span><span class="p">,</span>
<span class="nt">&quot;lang&quot;</span><span class="p">:</span> <span class="s2">&quot;en&quot;</span><span class="p">,</span>
<span class="nt">&quot;rows&quot;</span><span class="p">:</span> <span class="mi">154</span>
<span class="p">}</span>
<span class="p">},</span>
<span class="err">...</span>
</code></pre></div>
<p>This data, plotted in a time series/distribution, looks something like this:</p>
<p><img src="http://metamarkets.com/wp-content/uploads/2013/06/tweets_timezone_offset.png" alt="Timezone / Tweets Scatter Plot"></p>
<p>This groupBy query is a bit complicated and we&#39;ll return to it later. For the time being, just make sure you are getting some blocks of data back. If you are having problems, make sure you have <a href="http://curl.haxx.se/">curl</a> installed. Control+C to break out of the client script.</p>
<h2 id="querying-druid">Querying Druid</h2>
<p>In your favorite editor, create the file:
<pre>time_boundary_query.body</pre></p>
<p>Druid queries are JSON blobs which are relatively painless to create programmatically, but an absolute pain to write by hand. So anyway, we are going to create a Druid query by hand. Add the following to the file you just created:
<pre><code>{

&quot;queryType&quot; : &quot;timeBoundary&quot;,
&quot;dataSource&quot; : &quot;twitterstream&quot;
}
</code></pre></p>
<p>The <a href="https://github.com/metamx/druid/wiki/TimeBoundaryQuery">TimeBoundaryQuery</a> is one of the simplest Druid queries. To run the query, you can issue:
<pre><code> curl -X POST &#39;http://localhost:8080/druid/v2/?pretty&#39; -H &#39;content-type: application/json&#39; -d @time_boundary_query.body</code></pre></p>
<p>We get something like this JSON back:</p>
<div class="highlight"><pre><code class="language-json" data-lang="json"><span></span><span class="p">[</span> <span class="p">{</span>
<span class="nt">&quot;timestamp&quot;</span> <span class="p">:</span> <span class="s2">&quot;2013-06-10T19:09:00.000Z&quot;</span><span class="p">,</span>
<span class="nt">&quot;result&quot;</span> <span class="p">:</span> <span class="p">{</span>
<span class="nt">&quot;minTime&quot;</span> <span class="p">:</span> <span class="s2">&quot;2013-06-10T19:09:00.000Z&quot;</span><span class="p">,</span>
<span class="nt">&quot;maxTime&quot;</span> <span class="p">:</span> <span class="s2">&quot;2013-06-10T20:50:00.000Z&quot;</span>
<span class="p">}</span>
<span class="p">}</span> <span class="p">]</span>
</code></pre></div>
<p>That&#39;s the result. What information do you think the result is conveying?
...
If you said the result is indicating the maximum and minimum timestamps of the events ingested we&#39;ve seen thus far (summarized to a minutely granularity), you are absolutely correct. I can see you are a person legitimately interested in learning about Druid. Let&#39;s explore a bit further.</p>
<p>Return to your favorite editor and create the file:
<pre>timeseries_query.body</pre></p>
<p>We are going to make a slightly more complicated query, the <a href="https://github.com/metamx/druid/wiki/TimeseriesQuery">TimeseriesQuery</a>. Copy and paste the following into the file:
<pre><code>{
&quot;queryType&quot;:&quot;timeseries&quot;,
&quot;dataSource&quot;:&quot;twitterstream&quot;,
&quot;intervals&quot;:[&quot;2010-01-01/2020-01-01&quot;],
&quot;granularity&quot;:&quot;all&quot;,
&quot;aggregations&quot;:[
{ &quot;type&quot;: &quot;count&quot;, &quot;name&quot;: &quot;rows&quot;},
{ &quot;type&quot;: &quot;doubleSum&quot;, &quot;fieldName&quot;: &quot;tweets&quot;, &quot;name&quot;: &quot;tweets&quot;}
]
}
</code></pre></p>
<p>You are probably wondering, what are these granularities and aggregations things? What the query is doing is aggregating some metrics over some span of time.
To issue the query and get some results, run the following in your command line:
<pre><code>curl -X POST &#39;http://localhost:8080/druid/v2/?pretty&#39; -H &#39;content-type: application/json&#39; -d @timeseries_query.body</code></pre></p>
<p>Once again, you should get a JSON blob of text back with your results, that looks something like this:</p>
<div class="highlight"><pre><code class="language-json" data-lang="json"><span></span><span class="p">[</span> <span class="p">{</span>
<span class="nt">&quot;timestamp&quot;</span> <span class="p">:</span> <span class="s2">&quot;2013-06-10T19:09:00.000Z&quot;</span><span class="p">,</span>
<span class="nt">&quot;result&quot;</span> <span class="p">:</span> <span class="p">{</span>
<span class="nt">&quot;tweets&quot;</span> <span class="p">:</span> <span class="mf">358562.0</span><span class="p">,</span>
<span class="nt">&quot;rows&quot;</span> <span class="p">:</span> <span class="mi">272271</span>
<span class="p">}</span>
<span class="p">}</span> <span class="p">]</span>
</code></pre></div>
<p>If you issue the query again, you should notice your results updating.</p>
<p>Right now all the results you are getting back are being aggregated into a single timestamp bucket. What if we wanted to see our aggregations on a per minute basis? What field can we change in the query to accomplish this?</p>
<p>If you loudly exclaimed &quot;we can change granularity to minute&quot;, you are absolutely correct again! We can specify different granularities to bucket our results, like so:</p>
<div class="highlight"><pre><code class="language-json" data-lang="json"><span></span><span class="p">{</span>
<span class="nt">&quot;queryType&quot;</span><span class="p">:</span><span class="s2">&quot;timeseries&quot;</span><span class="p">,</span>
<span class="nt">&quot;dataSource&quot;</span><span class="p">:</span><span class="s2">&quot;twitterstream&quot;</span><span class="p">,</span>
<span class="nt">&quot;intervals&quot;</span><span class="p">:[</span><span class="s2">&quot;2010-01-01/2020-01-01&quot;</span><span class="p">],</span>
<span class="nt">&quot;granularity&quot;</span><span class="p">:</span><span class="s2">&quot;minute&quot;</span><span class="p">,</span>
<span class="nt">&quot;aggregations&quot;</span><span class="p">:[</span>
<span class="p">{</span> <span class="nt">&quot;type&quot;</span><span class="p">:</span> <span class="s2">&quot;count&quot;</span><span class="p">,</span> <span class="nt">&quot;name&quot;</span><span class="p">:</span> <span class="s2">&quot;rows&quot;</span><span class="p">},</span>
<span class="p">{</span> <span class="nt">&quot;type&quot;</span><span class="p">:</span> <span class="s2">&quot;doubleSum&quot;</span><span class="p">,</span> <span class="nt">&quot;fieldName&quot;</span><span class="p">:</span> <span class="s2">&quot;tweets&quot;</span><span class="p">,</span> <span class="nt">&quot;name&quot;</span><span class="p">:</span> <span class="s2">&quot;tweets&quot;</span><span class="p">}</span>
<span class="p">]</span>
<span class="p">}</span>
</code></pre></div>
<p>This gives us something like the following:</p>
<div class="highlight"><pre><code class="language-json" data-lang="json"><span></span><span class="p">[</span> <span class="p">{</span>
<span class="nt">&quot;timestamp&quot;</span> <span class="p">:</span> <span class="s2">&quot;2013-06-10T19:09:00.000Z&quot;</span><span class="p">,</span>
<span class="nt">&quot;result&quot;</span> <span class="p">:</span> <span class="p">{</span>
<span class="nt">&quot;tweets&quot;</span> <span class="p">:</span> <span class="mf">2650.0</span><span class="p">,</span>
<span class="nt">&quot;rows&quot;</span> <span class="p">:</span> <span class="mi">2120</span>
<span class="p">}</span>
<span class="p">},</span> <span class="p">{</span>
<span class="nt">&quot;timestamp&quot;</span> <span class="p">:</span> <span class="s2">&quot;2013-06-10T19:10:00.000Z&quot;</span><span class="p">,</span>
<span class="nt">&quot;result&quot;</span> <span class="p">:</span> <span class="p">{</span>
<span class="nt">&quot;tweets&quot;</span> <span class="p">:</span> <span class="mf">3401.0</span><span class="p">,</span>
<span class="nt">&quot;rows&quot;</span> <span class="p">:</span> <span class="mi">2609</span>
<span class="p">}</span>
<span class="p">},</span> <span class="p">{</span>
<span class="nt">&quot;timestamp&quot;</span> <span class="p">:</span> <span class="s2">&quot;2013-06-10T19:11:00.000Z&quot;</span><span class="p">,</span>
<span class="nt">&quot;result&quot;</span> <span class="p">:</span> <span class="p">{</span>
<span class="nt">&quot;tweets&quot;</span> <span class="p">:</span> <span class="mf">3472.0</span><span class="p">,</span>
<span class="nt">&quot;rows&quot;</span> <span class="p">:</span> <span class="mi">2610</span>
<span class="p">}</span>
<span class="p">},</span>
<span class="err">...</span>
</code></pre></div>
<h2 id="solving-a-problem">Solving a Problem</h2>
<p>One of Druid&#39;s main powers is to provide answers to problems, so let&#39;s pose a problem. What if we wanted to know what the top hash tags are, ordered by the number tweets, where the language is english, over the last few minutes you&#39;ve been reading this tutorial? To solve this problem, we have to return to the query we introduced at the very beginning of this tutorial, the <a href="https://github.com/metamx/druid/wiki/GroupByQuery">GroupByQuery</a>. It would be nice if we could group our results by a dimension value and somehow sort those results... and it turns out we can! </p>
<p>Let&#39;s create the file:
<pre>group_by_query.body</pre>
and put the following in there:
<pre><code>{
&quot;queryType&quot;: &quot;groupBy&quot;,
&quot;dataSource&quot;: &quot;twitterstream&quot;,
&quot;granularity&quot;: &quot;all&quot;,
&quot;dimensions&quot;: [&quot;htags&quot;],
&quot;orderBy&quot;: {&quot;type&quot;:&quot;default&quot;, &quot;columns&quot;:[{&quot;dimension&quot;: &quot;tweets&quot;, &quot;direction&quot;:&quot;DESCENDING&quot;}], &quot;limit&quot;:5},
&quot;aggregations&quot;:[
{ &quot;type&quot;: &quot;longSum&quot;, &quot;fieldName&quot;: &quot;tweets&quot;, &quot;name&quot;: &quot;tweets&quot;}
],
&quot;filter&quot;: {&quot;type&quot;: &quot;selector&quot;, &quot;dimension&quot;: &quot;lang&quot;, &quot;value&quot;: &quot;en&quot; },
&quot;intervals&quot;:[&quot;2012-10-01T00:00/2020-01-01T00&quot;]
}
</code></pre></p>
<p>Woah! Our query just got a way more complicated. Now we have these <a href="https://github.com/metamx/druid/wiki/Filters">Filters</a> things and this <a href="https://github.com/metamx/druid/wiki/OrderBy">OrderBy</a> thing. Fear not, it turns out the new objects we&#39;ve introduced to our query can help define the format of our results and provide an answer to our question.</p>
<p>If you issue the query:
<pre><code>curl -X POST &#39;http://localhost:8080/druid/v2/?pretty&#39; -H &#39;content-type: application/json&#39; -d @group_by_query.body</code></pre></p>
<p>You should hopefully see an answer to our question. For my twitter stream, it looks like this:</p>
<div class="highlight"><pre><code class="language-json" data-lang="json"><span></span><span class="p">[</span> <span class="p">{</span>
<span class="nt">&quot;version&quot;</span> <span class="p">:</span> <span class="s2">&quot;v1&quot;</span><span class="p">,</span>
<span class="nt">&quot;timestamp&quot;</span> <span class="p">:</span> <span class="s2">&quot;2012-10-01T00:00:00.000Z&quot;</span><span class="p">,</span>
<span class="nt">&quot;event&quot;</span> <span class="p">:</span> <span class="p">{</span>
<span class="nt">&quot;tweets&quot;</span> <span class="p">:</span> <span class="mi">2660</span><span class="p">,</span>
<span class="nt">&quot;htags&quot;</span> <span class="p">:</span> <span class="s2">&quot;android&quot;</span>
<span class="p">}</span>
<span class="p">},</span> <span class="p">{</span>
<span class="nt">&quot;version&quot;</span> <span class="p">:</span> <span class="s2">&quot;v1&quot;</span><span class="p">,</span>
<span class="nt">&quot;timestamp&quot;</span> <span class="p">:</span> <span class="s2">&quot;2012-10-01T00:00:00.000Z&quot;</span><span class="p">,</span>
<span class="nt">&quot;event&quot;</span> <span class="p">:</span> <span class="p">{</span>
<span class="nt">&quot;tweets&quot;</span> <span class="p">:</span> <span class="mi">1944</span><span class="p">,</span>
<span class="nt">&quot;htags&quot;</span> <span class="p">:</span> <span class="s2">&quot;E3&quot;</span>
<span class="p">}</span>
<span class="p">},</span> <span class="p">{</span>
<span class="nt">&quot;version&quot;</span> <span class="p">:</span> <span class="s2">&quot;v1&quot;</span><span class="p">,</span>
<span class="nt">&quot;timestamp&quot;</span> <span class="p">:</span> <span class="s2">&quot;2012-10-01T00:00:00.000Z&quot;</span><span class="p">,</span>
<span class="nt">&quot;event&quot;</span> <span class="p">:</span> <span class="p">{</span>
<span class="nt">&quot;tweets&quot;</span> <span class="p">:</span> <span class="mi">1927</span><span class="p">,</span>
<span class="nt">&quot;htags&quot;</span> <span class="p">:</span> <span class="s2">&quot;15SueñosPendientes&quot;</span>
<span class="p">}</span>
<span class="p">},</span> <span class="p">{</span>
<span class="nt">&quot;version&quot;</span> <span class="p">:</span> <span class="s2">&quot;v1&quot;</span><span class="p">,</span>
<span class="nt">&quot;timestamp&quot;</span> <span class="p">:</span> <span class="s2">&quot;2012-10-01T00:00:00.000Z&quot;</span><span class="p">,</span>
<span class="nt">&quot;event&quot;</span> <span class="p">:</span> <span class="p">{</span>
<span class="nt">&quot;tweets&quot;</span> <span class="p">:</span> <span class="mi">1717</span><span class="p">,</span>
<span class="nt">&quot;htags&quot;</span> <span class="p">:</span> <span class="s2">&quot;ipad&quot;</span>
<span class="p">}</span>
<span class="p">},</span> <span class="p">{</span>
<span class="nt">&quot;version&quot;</span> <span class="p">:</span> <span class="s2">&quot;v1&quot;</span><span class="p">,</span>
<span class="nt">&quot;timestamp&quot;</span> <span class="p">:</span> <span class="s2">&quot;2012-10-01T00:00:00.000Z&quot;</span><span class="p">,</span>
<span class="nt">&quot;event&quot;</span> <span class="p">:</span> <span class="p">{</span>
<span class="nt">&quot;tweets&quot;</span> <span class="p">:</span> <span class="mi">1515</span><span class="p">,</span>
<span class="nt">&quot;htags&quot;</span> <span class="p">:</span> <span class="s2">&quot;IDidntTextYouBackBecause&quot;</span>
<span class="p">}</span>
<span class="p">}</span> <span class="p">]</span>
</code></pre></div>
<p>Feel free to tweak other query parameters to answer other questions you may have about the data.</p>
<h2 id="additional-information">Additional Information</h2>
<p>This tutorial is merely showcasing a small fraction of what Druid can do. If you are interested in more information about Druid, including setting up a more sophisticated Druid cluster, please read the other links in our wiki.</p>
<p>And thus concludes our journey! Hopefully you learned a thing or two about Druid real-time ingestion, querying Druid, and how Druid can be used to solve problems. If you have additional questions, feel free to post in our <a href="http://www.groups.google.com/forum/#!forum/druid-development">google groups page</a>.</p>
</div>
</div>
</div>
</div>
<!-- Start page_footer include -->
<footer class="druid-footer">
<div class="container">
<div class="text-center">
<p>
<a href="/technology">Technology</a>&ensp;·&ensp;
<a href="/use-cases">Use Cases</a>&ensp;·&ensp;
<a href="/druid-powered">Powered by Druid</a>&ensp;·&ensp;
<a href="/docs/latest">Docs</a>&ensp;·&ensp;
<a href="/community/">Community</a>&ensp;·&ensp;
<a href="/downloads.html">Download</a>&ensp;·&ensp;
<a href="/faq">FAQ</a>
</p>
</div>
<div class="text-center">
<a title="Join the user group" href="https://groups.google.com/forum/#!forum/druid-user" target="_blank"><span class="fa fa-comments"></span></a>&ensp;·&ensp;
<a title="Follow Druid" href="https://twitter.com/druidio" target="_blank"><span class="fab fa-twitter"></span></a>&ensp;·&ensp;
<a title="Download via Apache" href="https://www.apache.org/dyn/closer.cgi?path=/incubator/druid/0.16.1-incubating/apache-druid-0.16.1-incubating-bin.tar.gz" target="_blank"><span class="fas fa-feather"></span></a>&ensp;·&ensp;
<a title="GitHub" href="https://github.com/apache/incubator-druid" target="_blank"><span class="fab fa-github"></span></a>
</div>
<div class="text-center license">
Copyright © 2019 <a href="https://www.apache.org/" target="_blank">Apache Software Foundation</a>.<br>
Except where otherwise noted, licensed under <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">CC BY-SA 4.0</a>.<br>
Apache Druid, Druid, and the Druid logo are either registered trademarks or trademarks of The Apache Software Foundation in the United States and other countries.
</div>
</div>
</footer>
<script async src="https://www.googletagmanager.com/gtag/js?id=UA-131010415-1"></script>
<script>
window.dataLayer = window.dataLayer || [];
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
gtag('config', 'UA-131010415-1');
</script>
<script>
function trackDownload(type, url) {
ga('send', 'event', 'download', type, url);
}
</script>
<script src="//code.jquery.com/jquery.min.js"></script>
<script src="//maxcdn.bootstrapcdn.com/bootstrap/3.2.0/js/bootstrap.min.js"></script>
<script src="/assets/js/druid.js"></script>
<!-- stop page_footer include -->
</body>
</html>