blog/2014/02/03/rdruid-and-twitterstream.html - druid-website - Git at Google

 <!DOCTYPE html>
 <html lang="en">
   <head>
     <meta charset="UTF-8" />
 <meta name="viewport" content="width=device-width, initial-scale=1.0">
 <meta name="description" content="Apache Druid">
 <meta name="keywords" content="druid,kafka,database,analytics,streaming,real-time,real time,apache,open source">
 <meta name="author" content="Apache Software Foundation">

 <title>Druid | RDruid and Twitterstream</title>

 <link rel="alternate" type="application/atom+xml" href="/feed">
 <link rel="shortcut icon" href="/img/favicon.png">

 <link rel="stylesheet" href="https://use.fontawesome.com/releases/v5.7.2/css/all.css" integrity="sha384-fnmOCqbTlWIlj8LyTjo7mOUStjsKC4pOpQbqyi7RrhN7udi9RwhKkMHpvLbHG9Sr" crossorigin="anonymous">

 <link href='//fonts.googleapis.com/css?family=Open+Sans+Condensed:300,700,300italic|Open+Sans:300italic,400italic,600italic,400,300,600,700' rel='stylesheet' type='text/css'>

 <link rel="stylesheet" href="/css/bootstrap-pure.css?v=1.1">
 <link rel="stylesheet" href="/css/base.css?v=1.1">
 <link rel="stylesheet" href="/css/header.css?v=1.1">
 <link rel="stylesheet" href="/css/footer.css?v=1.1">
 <link rel="stylesheet" href="/css/syntax.css?v=1.1">
 <link rel="stylesheet" href="/css/docs.css?v=1.1">

 <script>
   (function() {
     var cx = '000162378814775985090:molvbm0vggm';
     var gcse = document.createElement('script');
     gcse.type = 'text/javascript';
     gcse.async = true;
     gcse.src = (document.location.protocol == 'https:' ? 'https:' : 'http:') +
         '//cse.google.com/cse.js?cx=' + cx;
     var s = document.getElementsByTagName('script')[0];
     s.parentNode.insertBefore(gcse, s);
   })();
 </script>


   </head>
   <body>
     <!-- Start page_header include -->
 <script src="//ajax.googleapis.com/ajax/libs/jquery/2.2.4/jquery.min.js"></script>

 <div class="top-navigator">
   <div class="container">
     <div class="left-cont">
       <a class="logo" href="/"><span class="druid-logo"></span></a>
     </div>
     <div class="right-cont">
       <ul class="links">
         <li class=""><a href="/technology">Technology</a></li>
         <li class=""><a href="/use-cases">Use Cases</a></li>
         <li class=""><a href="/druid-powered">Powered By</a></li>
         <li class=""><a href="/docs/latest/design/">Docs</a></li>
         <li class=""><a href="/community/">Community</a></li>
         <li class="header-dropdown">
           <a>Apache</a>
           <div class="header-dropdown-menu">
             <a href="https://www.apache.org/" target="_blank">Foundation</a>
             <a href="https://www.apache.org/events/current-event" target="_blank">Events</a>
             <a href="https://www.apache.org/licenses/" target="_blank">License</a>
             <a href="https://www.apache.org/foundation/thanks.html" target="_blank">Thanks</a>
             <a href="https://www.apache.org/security/" target="_blank">Security</a>
             <a href="https://www.apache.org/foundation/sponsorship.html" target="_blank">Sponsorship</a>
           </div>
         </li>
         <li class=" button-link"><a href="/downloads.html">Download</a></li>
       </ul>
     </div>
   </div>
   <div class="action-button menu-icon">
     <span class="fa fa-bars"></span> MENU
   </div>
   <div class="action-button menu-icon-close">
     <span class="fa fa-times"></span> MENU
   </div>
 </div>

 <script type="text/javascript">
   var $menu = $('.right-cont');
   var $menuIcon = $('.menu-icon');
   var $menuIconClose = $('.menu-icon-close');

   function showMenu() {
     $menu.fadeIn(100);
     $menuIcon.fadeOut(100);
     $menuIconClose.fadeIn(100);
   }

   $menuIcon.click(showMenu);

   function hideMenu() {
     $menu.fadeOut(100);
     $menuIconClose.fadeOut(100);
     $menuIcon.fadeIn(100);
   }

   $menuIconClose.click(hideMenu);

   $(window).resize(function() {
     if ($(window).width() >= 840) {
       $menu.fadeIn(100);
       $menuIcon.fadeOut(100);
       $menuIconClose.fadeOut(100);
     }
     else {
       $menu.fadeOut(100);
       $menuIcon.fadeIn(100);
       $menuIconClose.fadeOut(100);
     }
   });
 </script>

 <!-- Stop page_header include -->


     <link rel="stylesheet" href="/css/blogs.css">

 <div class="blog druid-header">
   <div class="row">
     <div class="col-md-8 col-md-offset-2">
       <div class="title-image-wrap">

       </div>
     </div>
   </div>
 </div>

 <div class="container blog">
   <div class="row">
     <div class="col-md-8 col-md-offset-2">
       <div class="blog-entry">
         <h1>RDruid and Twitterstream</h1>
         <p class="text-muted">by <span class="author text-uppercase">Igal Levy</span> · February  3, 2014</p>

         <p>What if you could combine a statistical analysis language with the power of an analytics database for instant insights into realtime data? You&#39;d be able to draw conclusions from analyzing data streams at the speed of now. That&#39;s what combining the prowess of a <a href="">Druid database</a> with the power of <a href="http://www.r-project.org">R</a> can do.</p>

 <p>In this blog, we&#39;ll look at how to bring streamed realtime data into R using nothing more than a laptop, an Internet connection, and open-source applications. And we&#39;ll do it with <em>only one</em> Druid node.</p>

 <h2 id="what-youll-need">What You&#39;ll Need</h2>

 <p>You&#39;ll need to download and unpack <a href="http://static.druid.io/artifacts/releases/druid-services-0.6.52-bin.tar.gz">Druid</a>.</p>

 <p>Get the <a href="http://www.r-project.org/">R application</a> for your platform.
 We also recommend using <a href="http://www.rstudio.com/">RStudio</a> as the R IDE, which is what we used to run R.</p>

 <p>You&#39;ll also need a free Twitter account to be able to get a sample of streamed Twitter data.</p>

 <h2 id="set-up-the-twitterstream">Set Up the Twitterstream</h2>

 <p>First, register with the Twitter API. Log in at the <a href="https://dev.twitter.com/apps/new">Twitter developer&#39;s site</a> (you can use your normal Twitter credentials) and fill out the form for creating an application; use any website and callback URL to complete the form.</p>

 <p>Make note of the API credentials that are then generated. Later you&#39;ll need to enter them when prompted by the Twitter-example startup script, or save them in a <code>twitter4j.properties</code> file (nicer if you ever restart the server). If using a properties file, save it under <code>$DRUID_HOME/examples/twitter</code>. The file should contains the following (using your real keys):</p>
 <div class="highlight"><pre><code class="language-text" data-lang="text"><span></span>oauth.consumerKey=&lt;yourTwitterConsumerKey&gt;
 oauth.consumerSecret=&lt;yourTwitterConsumerSecret&gt;
 oauth.accessToken=&lt;yourTwitterAccessToken&gt;
 oauth.accessTokenSecret=&lt;yourTwitterAccessTokenSecret&gt;
 </code></pre></div>
 <h2 id="start-up-the-realtime-node">Start Up the Realtime Node</h2>

 <p>From the Druid home directory, start the Druid Realtime node:</p>
 <div class="highlight"><pre><code class="language-text" data-lang="text"><span></span>$DRUID_HOME/run_example_server.sh
 </code></pre></div>
 <p>When prompted, you&#39;ll choose the &quot;twitter&quot; example. If you&#39;re using the properties file, the server should start right up. Otherwise, you&#39;ll have to answer the prompts with the credentials you obtained from Twitter.</p>

 <p>After the Realtime node starts successfully, you should see &quot;Connected_to_Twitter&quot; printed, as well as messages similar to the following:</p>
 <div class="highlight"><pre><code class="language-text" data-lang="text"><span></span>2014-01-13 19:35:59,646 INFO [chief-twitterstream] druid.examples.twitter.TwitterSpritzerFirehoseFactory - nextRow() has returned 1,000 InputRows
 </code></pre></div>
 <p>This indicates that the Druid Realtime node is ingesting tweets in realtime.</p>

 <h2 id="set-up-r">Set Up R</h2>

 <p>Install and load the following packages:</p>
 <div class="highlight"><pre><code class="language-text" data-lang="text"><span></span>install.packages(&quot;devtools&quot;)
 install.packages(&quot;ggplot2&quot;)

 library(&quot;devtools&quot;)

 install_github(&quot;RDruid&quot;, &quot;metamx&quot;)

 library(RDruid)
 library(ggplot2)
 </code></pre></div>
 <p>Now tell RDruid where to find the Realtime node:</p>
 <div class="highlight"><pre><code class="language-text" data-lang="text"><span></span>druid &lt;- druid.url(&quot;localhost:8083&quot;)
 </code></pre></div>
 <h2 id="querying-the-realtime-node">Querying the Realtime Node</h2>

 <p><a href="/docs/latest/Tutorial:-All-About-Queries.html">Druid queries</a> are in the format of JSON objects, but in R they&#39;ll have a different format. Let&#39;s look at this with a simple query that will give the time range of the Twitter data currently in our Druid node:</p>
 <div class="highlight"><pre><code class="language-text" data-lang="text"><span></span>&gt; druid.query.timeBoundary(druid, dataSource=&quot;twitterstream&quot;, intervals=interval(ymd(20140101), ymd(20141231)), verbose=&quot;true&quot;)
 </code></pre></div>
 <p>Let&#39;s break this query down:</p>

 <ul>
 <li><code>druid.query.timeBoundary</code> &ndash; The RDruid query that finds the earliest and latest timestamps on data in Druid, within a specified interval.</li>
 <li><code>druid</code> and <code>dataSource</code> &ndash; Specify the location of the Druid node and the name of the Twitter data stream.</li>
 <li><code>intervals</code> &ndash; The interval we&#39;re looking in. Our choice is more than wide enough to encompass any data we&#39;ve received from Twitter.</li>
 <li><code>verbose</code> &ndash; The response should also print the JSON object that is posted to the Realtime node, that node&#39;s HTTP response, and possibly other information besides the actual response to the query.</li>
 </ul>

 <p>By making this a verbose query, we can take a look at the JSON object that RDruid creates from our R query and will post to the Druid node:</p>

 <p>{
     &quot;dataSource&quot; : &quot;twitterstream&quot;,
     &quot;intervals&quot; : [
         &quot;2014-01-01T00:00:00.000+00:00/2014-12-31T00:00:00.000+00:00&quot;
     ],
     &quot;queryType&quot; : &quot;timeBoundary&quot;
 }</p>

 <p>This is the type of query that Druid can understand. Now let&#39;s look at the rest of the post and response:</p>
 <div class="highlight"><pre><code class="language-text" data-lang="text"><span></span>* Adding handle: conn: 0x7fa1eb723800
 * Adding handle: send: 0
 * Adding handle: recv: 0
 * Curl_addHandleToPipeline: length: 1
 * - Conn 2 (0x7fa1eb723800) send_pipe: 1, recv_pipe: 0
 * About to connect() to localhost port 8083 (#2)
 *   Trying ::1...
 * Connected to localhost (::1) port 8083 (#2)
 &gt; POST /druid/v2/ HTTP/1.1
 Host: localhost:8083
 Accept: */*
 Accept-Encoding: gzip
 Content-Type: application/json
 Content-Length: 151

 * upload completely sent off: 151 out of 151 bytes
 &lt; HTTP/1.1 200 OK
 &lt; Content-Type: application/x-javascript
 &lt; Transfer-Encoding: chunked
 * Server Jetty(8.1.11.v20130520) is not blacklisted
 &lt; Server: Jetty(8.1.11.v20130520)
 &lt;
 * Connection #2 to host localhost left intact
                   minTime                   maxTime
 &quot;2014-01-25 00:52:00 UTC&quot; &quot;2014-01-25 01:35:00 UTC&quot;
 </code></pre></div>
 <p>At the very end comes the response to our query, a minTime and maxTime, the boundaries to our data set.</p>

 <h3 id="more-complex-queries">More Complex Queries</h3>

 <p>Now lets look at some real Twitter data. Say we are interested in the number of tweets per language during that time period. We need to do an aggregation via a groupBy query (see RDruid help in RStudio):</p>
 <div class="highlight"><pre><code class="language-text" data-lang="text"><span></span>druid.query.groupBy(druid, dataSource=&quot;twitterstream&quot;,
                     interval(ymd(&quot;2014-01-01&quot;), ymd(&quot;2015-01-01&quot;)),
                     granularity=granularity(&quot;P1D&quot;),
                     aggregations = (tweets = sum(metric(&quot;tweets&quot;))),
                     dimensions = &quot;lang&quot;,
                     verbose=&quot;true&quot;)
 </code></pre></div>
 <p>We see some new arguments in this query:</p>

 <ul>
 <li><code>granularity</code> &ndash; This sets the time period for each aggregation (in ISO 8601). Since all our data is in one day and we don&#39;t care about breaking down by hour or minute, we choose per-day granularity.</li>
 <li><code>aggregations</code> &ndash; This is where we specify and name the metrics that we&#39;re interesting in summing up. We wants tweets, and it just so happens that this metric is named &quot;tweets&quot; as it&#39;s mapped from the twitter API, so we&#39;ll keep that name as the column head for our output.</li>
 <li><code>dimension</code> &ndash; Here&#39;s the actual type of data we&#39;re interesting in. Tweets are identified by language in their metadata (using ISO 639 language codes). We use the name of the dimension, &quot;lang,&quot; to slice the data along language.</li>
 </ul>

 <p>Here&#39;s the actual output:</p>
 <div class="highlight"><pre><code class="language-text" data-lang="text"><span></span>{
     &quot;intervals&quot; : [
         &quot;2014-01-01T00:00:00.000+00:00/2015-01-01T00:00:00.000+00:00&quot;
     ],
     &quot;aggregations&quot; : [
         {
             &quot;type&quot; : &quot;doubleSum&quot;,
             &quot;name&quot; : &quot;tweets&quot;,
             &quot;fieldName&quot; : &quot;tweets&quot;
         }
     ],
     &quot;dataSource&quot; : &quot;twitterstream&quot;,
     &quot;filter&quot; : null,
     &quot;having&quot; : null,
     &quot;granularity&quot; : {
         &quot;type&quot; : &quot;period&quot;,
         &quot;period&quot; : &quot;P1D&quot;,
         &quot;origin&quot; : null,
         &quot;timeZone&quot; : null
     },
     &quot;dimensions&quot; : [
         &quot;lang&quot;
     ],
     &quot;postAggregations&quot; : null,
     &quot;limitSpec&quot; : null,
     &quot;queryType&quot; : &quot;groupBy&quot;,
     &quot;context&quot; : null
 }
 * Adding handle: conn: 0x7fa1eb767600
 * Adding handle: send: 0
 * Adding handle: recv: 0
 * Curl_addHandleToPipeline: length: 1
 * - Conn 3 (0x7fa1eb767600) send_pipe: 1, recv_pipe: 0
 * About to connect() to localhost port 8083 (#3)
 *   Trying ::1...
 * Connected to localhost (::1) port 8083 (#3)
 &gt; POST /druid/v2/ HTTP/1.1
 Host: localhost:8083
 Accept: */*
 Accept-Encoding: gzip
 Content-Type: application/json
 Content-Length: 489

 * upload completely sent off: 489 out of 489 bytes
 &lt; HTTP/1.1 200 OK
 &lt; Content-Type: application/x-javascript
 &lt; Transfer-Encoding: chunked
 * Server Jetty(8.1.11.v20130520) is not blacklisted
 &lt; Server: Jetty(8.1.11.v20130520)
 &lt;
 * Connection #3 to host localhost left intact
     timestamp tweets  lang
 1  2014-01-25   6476    ar
 2  2014-01-25      1    bg
 3  2014-01-25     22    ca
 4  2014-01-25     10    cs
 5  2014-01-25     21    da
 6  2014-01-25    311    de
 7  2014-01-25     23    el
 8  2014-01-25  74842    en
 9  2014-01-25     20 en-GB
 10 2014-01-25    690 en-gb
 11 2014-01-25  22920    es
 12 2014-01-25      2    eu
 13 2014-01-25      2    fa
 14 2014-01-25     10    fi
 15 2014-01-25     36   fil
 16 2014-01-25   1521    fr
 17 2014-01-25      9    gl
 18 2014-01-25     15    he
 19 2014-01-25      1    hi
 20 2014-01-25      5    hu
 21 2014-01-25   3809    id
 22 2014-01-25      4    in
 23 2014-01-25    256    it
 24 2014-01-25  19748    ja
 25 2014-01-25   1079    ko
 26 2014-01-25      1    ms
 27 2014-01-25     19   msa
 28 2014-01-25    243    nl
 29 2014-01-25     24    no
 30 2014-01-25    113    pl
 31 2014-01-25  12707    pt
 32 2014-01-25      3    ro
 33 2014-01-25   1606    ru
 34 2014-01-25      1    sr
 35 2014-01-25     76    sv
 36 2014-01-25    532    th
 37 2014-01-25   1415    tr
 38 2014-01-25     30    uk
 39 2014-01-25      6 xx-lc
 40 2014-01-25      1 zh-CN
 41 2014-01-25     30 zh-cn
 42 2014-01-25     34 zh-tw
 </code></pre></div>
 <p>This gives an idea of what languages dominate Twitter (at least for the given time range). For visualization, you can use a library like ggplot2. Try the <code>geom_bar</code> function to quickly produce a basic bar chart of the data. First, send the query above to a dataframe (let&#39;s call it <code>tweet_langs</code> in this example), then subset it to take languages with more than a thousand tweets:</p>
 <div class="highlight"><pre><code class="language-text" data-lang="text"><span></span>major_tweet_langs &lt;- subset(tweet_langs, tweets &gt; 1000)
 </code></pre></div>
 <p>Then create the chart:</p>
 <div class="highlight"><pre><code class="language-text" data-lang="text"><span></span>ggplot(major_tweet_langs, aes(x=lang, y=tweets)) + geom_bar(stat=&quot;identity&quot;)
 </code></pre></div>
 <p>You can refine this query with more aggregations and post aggregations (math within the results). For example, to find out how many rows in Druid the data for each of those languages takes, use:</p>
 <div class="highlight"><pre><code class="language-text" data-lang="text"><span></span>druid.query.groupBy(druid, dataSource=&quot;twitterstream&quot;,
                     interval(ymd(&quot;2014-01-01&quot;), ymd(&quot;2015-01-01&quot;)),
                     granularity=granularity(&quot;all&quot;),
                     aggregations = list(rows = druid.count(),
                                         tweets = sum(metric(&quot;tweets&quot;))),
                     dimensions = &quot;lang&quot;,
                     verbose=&quot;true&quot;)
 </code></pre></div>
 <h2 id="metrics-and-dimensions">Metrics and Dimensions</h2>

 <p>How do you find out what metrics and dimensions are available to query? You can find the metrics in <code>$DRUID_HOME/examples/twitter/twitter_realtime.spec</code>. The dimensions are not as apparent. There&#39;s an easy way to query for them from a certain type of Druid node, but not from a Realtime node, which leaves the less-appetizing approach of digging through <a href="https://github.com/metamx/druid/blob/druid-0.5.x/examples/src/main/java/druid/examples/twitter/TwitterSpritzerFirehoseFactory.java">code</a>. To allow for further experimentation, we list some here:</p>

 <ul>
 <li>&quot;first_hashtag&quot;</li>
 <li>&quot;user_time_zone&quot;</li>
 <li>&quot;user_location&quot;</li>
 <li>&quot;is_retweet&quot;</li>
 <li>&quot;is_viral&quot;</li>
 </ul>

 <p>Some interesting analyses on current events could be done using these dimensions and metrics. For example, you could filter on specific hashtags for events that happen to be spiking at the time:</p>
 <div class="highlight"><pre><code class="language-text" data-lang="text"><span></span>druid.query.groupBy(druid, dataSource=&quot;twitterstream&quot;,
                 interval(ymd(&quot;2014-01-01&quot;), ymd(&quot;2015-01-01&quot;)),
                 granularity=granularity(&quot;P1D&quot;),
                 aggregations = (tweets = sum(metric(&quot;tweets&quot;))),
                 filter =
                     dimension(&quot;first_hashtag&quot;) %~% &quot;academyawards&quot; |
                     dimension(&quot;first_hashtag&quot;) %~% &quot;oscars&quot;,
                 dimensions   = list(&quot;first_hashtag&quot;))
 </code></pre></div>
 <p>See the <a href="https://github.com/metamx/RDruid/wiki/Examples">RDruid wiki</a> for more examples.</p>

 <p>The point to remember is that this data is being streamed into Druid and brought into R via RDruid in realtime. For example, with an R script the data could be continuously queried, updated, and analyzed.</p>

       </div>
     </div>
   </div>
 </div>


     <!-- Start page_footer include -->
 <footer class="druid-footer">
 <div class="container">
   <div class="text-center">
     <p>
     <a href="/technology">Technology</a>&ensp;·&ensp;
     <a href="/use-cases">Use Cases</a>&ensp;·&ensp;
     <a href="/druid-powered">Powered by Druid</a>&ensp;·&ensp;
     <a href="/docs/latest/">Docs</a>&ensp;·&ensp;
     <a href="/community/">Community</a>&ensp;·&ensp;
     <a href="/downloads.html">Download</a>&ensp;·&ensp;
     <a href="/faq">FAQ</a>
     </p>
   </div>
   <div class="text-center">
     <a title="Join the user group" href="https://groups.google.com/forum/#!forum/druid-user" target="_blank"><span class="fa fa-comments"></span></a>&ensp;·&ensp;
     <a title="Follow Druid" href="https://twitter.com/druidio" target="_blank"><span class="fab fa-twitter"></span></a>&ensp;·&ensp;
     <a title="GitHub" href="https://github.com/apache/druid" target="_blank"><span class="fab fa-github"></span></a>
   </div>
   <div class="text-center license">
     Copyright © 2020 <a href="https://www.apache.org/" target="_blank">Apache Software Foundation</a>.<br>
     Except where otherwise noted, licensed under <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">CC BY-SA 4.0</a>.<br>
     Apache Druid, Druid, and the Druid logo are either registered trademarks or trademarks of The Apache Software Foundation in the United States and other countries.
   </div>
 </div>
 </footer>

 <script async src="https://www.googletagmanager.com/gtag/js?id=UA-131010415-1"></script>
 <script>
   window.dataLayer = window.dataLayer || [];
   function gtag(){dataLayer.push(arguments);}
   gtag('js', new Date());
   gtag('config', 'UA-131010415-1');
 </script>
 <script>
   function trackDownload(type, url) {
     ga('send', 'event', 'download', type, url);
   }
 </script>
 <script src="//code.jquery.com/jquery.min.js"></script>
 <script src="//maxcdn.bootstrapcdn.com/bootstrap/3.2.0/js/bootstrap.min.js"></script>
 <script src="/assets/js/druid.js"></script>
 <!-- stop page_footer include -->


   </body>
 </html>
	<!DOCTYPE html>
	<html lang="en">
	<head>
	<meta charset="UTF-8" />
	<meta name="viewport" content="width=device-width, initial-scale=1.0">
	<meta name="description" content="Apache Druid">
	<meta name="keywords" content="druid,kafka,database,analytics,streaming,real-time,real time,apache,open source">
	<meta name="author" content="Apache Software Foundation">

	<title>Druid \| RDruid and Twitterstream</title>

	<link rel="alternate" type="application/atom+xml" href="/feed">
	<link rel="shortcut icon" href="/img/favicon.png">

	<link rel="stylesheet" href="https://use.fontawesome.com/releases/v5.7.2/css/all.css" integrity="sha384-fnmOCqbTlWIlj8LyTjo7mOUStjsKC4pOpQbqyi7RrhN7udi9RwhKkMHpvLbHG9Sr" crossorigin="anonymous">

	<link href='//fonts.googleapis.com/css?family=Open+Sans+Condensed:300,700,300italic\|Open+Sans:300italic,400italic,600italic,400,300,600,700' rel='stylesheet' type='text/css'>

	<link rel="stylesheet" href="/css/bootstrap-pure.css?v=1.1">
	<link rel="stylesheet" href="/css/base.css?v=1.1">
	<link rel="stylesheet" href="/css/header.css?v=1.1">
	<link rel="stylesheet" href="/css/footer.css?v=1.1">
	<link rel="stylesheet" href="/css/syntax.css?v=1.1">
	<link rel="stylesheet" href="/css/docs.css?v=1.1">

	<script>
	(function() {
	var cx = '000162378814775985090:molvbm0vggm';
	var gcse = document.createElement('script');
	gcse.type = 'text/javascript';
	gcse.async = true;
	gcse.src = (document.location.protocol == 'https:' ? 'https:' : 'http:') +
	'//cse.google.com/cse.js?cx=' + cx;
	var s = document.getElementsByTagName('script')[0];
	s.parentNode.insertBefore(gcse, s);
	})();
	</script>


	</head>
	<body>
	<!-- Start page_header include -->
	<script src="//ajax.googleapis.com/ajax/libs/jquery/2.2.4/jquery.min.js"></script>

	<div class="top-navigator">
	<div class="container">
	<div class="left-cont">
	<a class="logo" href="/"><span class="druid-logo"></span></a>
	</div>
	<div class="right-cont">
	<ul class="links">
	<li class=""><a href="/technology">Technology</a></li>
	<li class=""><a href="/use-cases">Use Cases</a></li>
	<li class=""><a href="/druid-powered">Powered By</a></li>
	<li class=""><a href="/docs/latest/design/">Docs</a></li>
	<li class=""><a href="/community/">Community</a></li>
	<li class="header-dropdown">
	<a>Apache</a>
	<div class="header-dropdown-menu">
	<a href="https://www.apache.org/" target="_blank">Foundation</a>
	<a href="https://www.apache.org/events/current-event" target="_blank">Events</a>
	<a href="https://www.apache.org/licenses/" target="_blank">License</a>
	<a href="https://www.apache.org/foundation/thanks.html" target="_blank">Thanks</a>
	<a href="https://www.apache.org/security/" target="_blank">Security</a>
	<a href="https://www.apache.org/foundation/sponsorship.html" target="_blank">Sponsorship</a>
	</div>
	</li>
	<li class=" button-link"><a href="/downloads.html">Download</a></li>
	</ul>
	</div>
	</div>
	<div class="action-button menu-icon">
	<span class="fa fa-bars"></span> MENU
	</div>
	<div class="action-button menu-icon-close">
	<span class="fa fa-times"></span> MENU
	</div>
	</div>

	<script type="text/javascript">
	var $menu = $('.right-cont');
	var $menuIcon = $('.menu-icon');
	var $menuIconClose = $('.menu-icon-close');

	function showMenu() {
	$menu.fadeIn(100);
	$menuIcon.fadeOut(100);
	$menuIconClose.fadeIn(100);
	}

	$menuIcon.click(showMenu);

	function hideMenu() {
	$menu.fadeOut(100);
	$menuIconClose.fadeOut(100);
	$menuIcon.fadeIn(100);
	}

	$menuIconClose.click(hideMenu);

	$(window).resize(function() {
	if ($(window).width() >= 840) {
	$menu.fadeIn(100);
	$menuIcon.fadeOut(100);
	$menuIconClose.fadeOut(100);
	}
	else {
	$menu.fadeOut(100);
	$menuIcon.fadeIn(100);
	$menuIconClose.fadeOut(100);
	}
	});
	</script>

	<!-- Stop page_header include -->


	<link rel="stylesheet" href="/css/blogs.css">

	<div class="blog druid-header">
	<div class="row">
	<div class="col-md-8 col-md-offset-2">
	<div class="title-image-wrap">

	</div>
	</div>
	</div>
	</div>

	<div class="container blog">
	<div class="row">
	<div class="col-md-8 col-md-offset-2">
	<div class="blog-entry">
	<h1>RDruid and Twitterstream</h1>
	<p class="text-muted">by <span class="author text-uppercase">Igal Levy</span> · February 3, 2014</p>

	<p>What if you could combine a statistical analysis language with the power of an analytics database for instant insights into realtime data? You'd be able to draw conclusions from analyzing data streams at the speed of now. That's what combining the prowess of a <a href="">Druid database</a> with the power of <a href="http://www.r-project.org">R</a> can do.</p>

	<p>In this blog, we'll look at how to bring streamed realtime data into R using nothing more than a laptop, an Internet connection, and open-source applications. And we'll do it with <em>only one</em> Druid node.</p>

	<h2 id="what-youll-need">What You'll Need</h2>

	<p>You'll need to download and unpack <a href="http://static.druid.io/artifacts/releases/druid-services-0.6.52-bin.tar.gz">Druid</a>.</p>

	<p>Get the <a href="http://www.r-project.org/">R application</a> for your platform.
	We also recommend using <a href="http://www.rstudio.com/">RStudio</a> as the R IDE, which is what we used to run R.</p>

	<p>You'll also need a free Twitter account to be able to get a sample of streamed Twitter data.</p>

	<h2 id="set-up-the-twitterstream">Set Up the Twitterstream</h2>

	<p>First, register with the Twitter API. Log in at the <a href="https://dev.twitter.com/apps/new">Twitter developer's site</a> (you can use your normal Twitter credentials) and fill out the form for creating an application; use any website and callback URL to complete the form.</p>

	<p>Make note of the API credentials that are then generated. Later you'll need to enter them when prompted by the Twitter-example startup script, or save them in a <code>twitter4j.properties</code> file (nicer if you ever restart the server). If using a properties file, save it under <code>$DRUID_HOME/examples/twitter</code>. The file should contains the following (using your real keys):</p>
	<div class="highlight"><pre><code class="language-text" data-lang="text"><span></span>oauth.consumerKey=<yourTwitterConsumerKey>
	oauth.consumerSecret=<yourTwitterConsumerSecret>
	oauth.accessToken=<yourTwitterAccessToken>
	oauth.accessTokenSecret=<yourTwitterAccessTokenSecret>
	</code></pre></div>
	<h2 id="start-up-the-realtime-node">Start Up the Realtime Node</h2>

	<p>From the Druid home directory, start the Druid Realtime node:</p>
	<div class="highlight"><pre><code class="language-text" data-lang="text"><span></span>$DRUID_HOME/run_example_server.sh
	</code></pre></div>
	<p>When prompted, you'll choose the "twitter" example. If you're using the properties file, the server should start right up. Otherwise, you'll have to answer the prompts with the credentials you obtained from Twitter.</p>

	<p>After the Realtime node starts successfully, you should see "Connected_to_Twitter" printed, as well as messages similar to the following:</p>
	<div class="highlight"><pre><code class="language-text" data-lang="text"><span></span>2014-01-13 19:35:59,646 INFO [chief-twitterstream] druid.examples.twitter.TwitterSpritzerFirehoseFactory - nextRow() has returned 1,000 InputRows
	</code></pre></div>
	<p>This indicates that the Druid Realtime node is ingesting tweets in realtime.</p>

	<h2 id="set-up-r">Set Up R</h2>

	<p>Install and load the following packages:</p>
	<div class="highlight"><pre><code class="language-text" data-lang="text"><span></span>install.packages("devtools")
	install.packages("ggplot2")

	library("devtools")

	install_github("RDruid", "metamx")

	library(RDruid)
	library(ggplot2)
	</code></pre></div>
	<p>Now tell RDruid where to find the Realtime node:</p>
	<div class="highlight"><pre><code class="language-text" data-lang="text"><span></span>druid <- druid.url("localhost:8083")
	</code></pre></div>
	<h2 id="querying-the-realtime-node">Querying the Realtime Node</h2>

	<p><a href="/docs/latest/Tutorial:-All-About-Queries.html">Druid queries</a> are in the format of JSON objects, but in R they'll have a different format. Let's look at this with a simple query that will give the time range of the Twitter data currently in our Druid node:</p>
	<div class="highlight"><pre><code class="language-text" data-lang="text"><span></span>> druid.query.timeBoundary(druid, dataSource="twitterstream", intervals=interval(ymd(20140101), ymd(20141231)), verbose="true")
	</code></pre></div>
	<p>Let's break this query down:</p>

	<ul>
	<li><code>druid.query.timeBoundary</code> – The RDruid query that finds the earliest and latest timestamps on data in Druid, within a specified interval.</li>
	<li><code>druid</code> and <code>dataSource</code> – Specify the location of the Druid node and the name of the Twitter data stream.</li>
	<li><code>intervals</code> – The interval we're looking in. Our choice is more than wide enough to encompass any data we've received from Twitter.</li>
	<li><code>verbose</code> – The response should also print the JSON object that is posted to the Realtime node, that node's HTTP response, and possibly other information besides the actual response to the query.</li>
	</ul>

	<p>By making this a verbose query, we can take a look at the JSON object that RDruid creates from our R query and will post to the Druid node:</p>

	<p>{
	"dataSource" : "twitterstream",
	"intervals" : [
	"2014-01-01T00:00:00.000+00:00/2014-12-31T00:00:00.000+00:00"
	],
	"queryType" : "timeBoundary"
	}</p>

	<p>This is the type of query that Druid can understand. Now let's look at the rest of the post and response:</p>
	<div class="highlight"><pre><code class="language-text" data-lang="text"><span></span>* Adding handle: conn: 0x7fa1eb723800
	* Adding handle: send: 0
	* Adding handle: recv: 0
	* Curl_addHandleToPipeline: length: 1
	* - Conn 2 (0x7fa1eb723800) send_pipe: 1, recv_pipe: 0
	* About to connect() to localhost port 8083 (#2)
	* Trying ::1...
	* Connected to localhost (::1) port 8083 (#2)
	> POST /druid/v2/ HTTP/1.1
	Host: localhost:8083
	Accept: /
	Accept-Encoding: gzip
	Content-Type: application/json
	Content-Length: 151

	* upload completely sent off: 151 out of 151 bytes
	< HTTP/1.1 200 OK
	< Content-Type: application/x-javascript
	< Transfer-Encoding: chunked
	* Server Jetty(8.1.11.v20130520) is not blacklisted
	< Server: Jetty(8.1.11.v20130520)
	<
	* Connection #2 to host localhost left intact
	minTime maxTime
	"2014-01-25 00:52:00 UTC" "2014-01-25 01:35:00 UTC"
	</code></pre></div>
	<p>At the very end comes the response to our query, a minTime and maxTime, the boundaries to our data set.</p>

	<h3 id="more-complex-queries">More Complex Queries</h3>

	<p>Now lets look at some real Twitter data. Say we are interested in the number of tweets per language during that time period. We need to do an aggregation via a groupBy query (see RDruid help in RStudio):</p>
	<div class="highlight"><pre><code class="language-text" data-lang="text"><span></span>druid.query.groupBy(druid, dataSource="twitterstream",
	interval(ymd("2014-01-01"), ymd("2015-01-01")),
	granularity=granularity("P1D"),
	aggregations = (tweets = sum(metric("tweets"))),
	dimensions = "lang",
	verbose="true")
	</code></pre></div>
	<p>We see some new arguments in this query:</p>

	<ul>
	<li><code>granularity</code> – This sets the time period for each aggregation (in ISO 8601). Since all our data is in one day and we don't care about breaking down by hour or minute, we choose per-day granularity.</li>
	<li><code>aggregations</code> – This is where we specify and name the metrics that we're interesting in summing up. We wants tweets, and it just so happens that this metric is named "tweets" as it's mapped from the twitter API, so we'll keep that name as the column head for our output.</li>
	<li><code>dimension</code> – Here's the actual type of data we're interesting in. Tweets are identified by language in their metadata (using ISO 639 language codes). We use the name of the dimension, "lang," to slice the data along language.</li>
	</ul>

	<p>Here's the actual output:</p>
	<div class="highlight"><pre><code class="language-text" data-lang="text"><span></span>{
	"intervals" : [
	"2014-01-01T00:00:00.000+00:00/2015-01-01T00:00:00.000+00:00"
	],
	"aggregations" : [
	{
	"type" : "doubleSum",
	"name" : "tweets",
	"fieldName" : "tweets"
	}
	],
	"dataSource" : "twitterstream",
	"filter" : null,
	"having" : null,
	"granularity" : {
	"type" : "period",
	"period" : "P1D",
	"origin" : null,
	"timeZone" : null
	},
	"dimensions" : [
	"lang"
	],
	"postAggregations" : null,
	"limitSpec" : null,
	"queryType" : "groupBy",
	"context" : null
	}
	* Adding handle: conn: 0x7fa1eb767600
	* Adding handle: send: 0
	* Adding handle: recv: 0
	* Curl_addHandleToPipeline: length: 1
	* - Conn 3 (0x7fa1eb767600) send_pipe: 1, recv_pipe: 0
	* About to connect() to localhost port 8083 (#3)
	* Trying ::1...
	* Connected to localhost (::1) port 8083 (#3)
	> POST /druid/v2/ HTTP/1.1
	Host: localhost:8083
	Accept: /
	Accept-Encoding: gzip
	Content-Type: application/json
	Content-Length: 489

	* upload completely sent off: 489 out of 489 bytes
	< HTTP/1.1 200 OK
	< Content-Type: application/x-javascript
	< Transfer-Encoding: chunked
	* Server Jetty(8.1.11.v20130520) is not blacklisted
	< Server: Jetty(8.1.11.v20130520)
	<
	* Connection #3 to host localhost left intact
	timestamp tweets lang
	1 2014-01-25 6476 ar
	2 2014-01-25 1 bg
	3 2014-01-25 22 ca
	4 2014-01-25 10 cs
	5 2014-01-25 21 da
	6 2014-01-25 311 de
	7 2014-01-25 23 el
	8 2014-01-25 74842 en
	9 2014-01-25 20 en-GB
	10 2014-01-25 690 en-gb
	11 2014-01-25 22920 es
	12 2014-01-25 2 eu
	13 2014-01-25 2 fa
	14 2014-01-25 10 fi
	15 2014-01-25 36 fil
	16 2014-01-25 1521 fr
	17 2014-01-25 9 gl
	18 2014-01-25 15 he
	19 2014-01-25 1 hi
	20 2014-01-25 5 hu
	21 2014-01-25 3809 id
	22 2014-01-25 4 in
	23 2014-01-25 256 it
	24 2014-01-25 19748 ja
	25 2014-01-25 1079 ko
	26 2014-01-25 1 ms
	27 2014-01-25 19 msa
	28 2014-01-25 243 nl
	29 2014-01-25 24 no
	30 2014-01-25 113 pl
	31 2014-01-25 12707 pt
	32 2014-01-25 3 ro
	33 2014-01-25 1606 ru
	34 2014-01-25 1 sr
	35 2014-01-25 76 sv
	36 2014-01-25 532 th
	37 2014-01-25 1415 tr
	38 2014-01-25 30 uk
	39 2014-01-25 6 xx-lc
	40 2014-01-25 1 zh-CN
	41 2014-01-25 30 zh-cn
	42 2014-01-25 34 zh-tw
	</code></pre></div>
	<p>This gives an idea of what languages dominate Twitter (at least for the given time range). For visualization, you can use a library like ggplot2. Try the <code>geom_bar</code> function to quickly produce a basic bar chart of the data. First, send the query above to a dataframe (let's call it <code>tweet_langs</code> in this example), then subset it to take languages with more than a thousand tweets:</p>
	<div class="highlight"><pre><code class="language-text" data-lang="text"><span></span>major_tweet_langs <- subset(tweet_langs, tweets > 1000)
	</code></pre></div>
	<p>Then create the chart:</p>
	<div class="highlight"><pre><code class="language-text" data-lang="text"><span></span>ggplot(major_tweet_langs, aes(x=lang, y=tweets)) + geom_bar(stat="identity")
	</code></pre></div>
	<p>You can refine this query with more aggregations and post aggregations (math within the results). For example, to find out how many rows in Druid the data for each of those languages takes, use:</p>
	<div class="highlight"><pre><code class="language-text" data-lang="text"><span></span>druid.query.groupBy(druid, dataSource="twitterstream",
	interval(ymd("2014-01-01"), ymd("2015-01-01")),
	granularity=granularity("all"),
	aggregations = list(rows = druid.count(),
	tweets = sum(metric("tweets"))),
	dimensions = "lang",
	verbose="true")
	</code></pre></div>
	<h2 id="metrics-and-dimensions">Metrics and Dimensions</h2>

	<p>How do you find out what metrics and dimensions are available to query? You can find the metrics in <code>$DRUID_HOME/examples/twitter/twitter_realtime.spec</code>. The dimensions are not as apparent. There's an easy way to query for them from a certain type of Druid node, but not from a Realtime node, which leaves the less-appetizing approach of digging through <a href="https://github.com/metamx/druid/blob/druid-0.5.x/examples/src/main/java/druid/examples/twitter/TwitterSpritzerFirehoseFactory.java">code</a>. To allow for further experimentation, we list some here:</p>

	<ul>
	<li>"first_hashtag"</li>
	<li>"user_time_zone"</li>
	<li>"user_location"</li>
	<li>"is_retweet"</li>
	<li>"is_viral"</li>
	</ul>

	<p>Some interesting analyses on current events could be done using these dimensions and metrics. For example, you could filter on specific hashtags for events that happen to be spiking at the time:</p>
	<div class="highlight"><pre><code class="language-text" data-lang="text"><span></span>druid.query.groupBy(druid, dataSource="twitterstream",
	interval(ymd("2014-01-01"), ymd("2015-01-01")),
	granularity=granularity("P1D"),
	aggregations = (tweets = sum(metric("tweets"))),
	filter =
	dimension("first_hashtag") %~% "academyawards" \|
	dimension("first_hashtag") %~% "oscars",
	dimensions = list("first_hashtag"))
	</code></pre></div>
	<p>See the <a href="https://github.com/metamx/RDruid/wiki/Examples">RDruid wiki</a> for more examples.</p>

	<p>The point to remember is that this data is being streamed into Druid and brought into R via RDruid in realtime. For example, with an R script the data could be continuously queried, updated, and analyzed.</p>

	</div>
	</div>
	</div>
	</div>


	<!-- Start page_footer include -->
	<footer class="druid-footer">
	<div class="container">
	<div class="text-center">
	<p>
	<a href="/technology">Technology</a>&ensp;·&ensp;
	<a href="/use-cases">Use Cases</a>&ensp;·&ensp;
	<a href="/druid-powered">Powered by Druid</a>&ensp;·&ensp;
	<a href="/docs/latest/">Docs</a>&ensp;·&ensp;
	<a href="/community/">Community</a>&ensp;·&ensp;
	<a href="/downloads.html">Download</a>&ensp;·&ensp;
	<a href="/faq">FAQ</a>
	</p>
	</div>
	<div class="text-center">
	<a title="Join the user group" href="https://groups.google.com/forum/#!forum/druid-user" target="_blank"><span class="fa fa-comments"></span></a>&ensp;·&ensp;
	<a title="Follow Druid" href="https://twitter.com/druidio" target="_blank"><span class="fab fa-twitter"></span></a>&ensp;·&ensp;
	<a title="GitHub" href="https://github.com/apache/druid" target="_blank"><span class="fab fa-github"></span></a>
	</div>
	<div class="text-center license">
	Copyright © 2020 <a href="https://www.apache.org/" target="_blank">Apache Software Foundation</a>.<br>
	Except where otherwise noted, licensed under <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">CC BY-SA 4.0</a>.<br>
	Apache Druid, Druid, and the Druid logo are either registered trademarks or trademarks of The Apache Software Foundation in the United States and other countries.
	</div>
	</div>
	</footer>

	<script async src="https://www.googletagmanager.com/gtag/js?id=UA-131010415-1"></script>
	<script>
	window.dataLayer = window.dataLayer \|\| [];
	function gtag(){dataLayer.push(arguments);}
	gtag('js', new Date());
	gtag('config', 'UA-131010415-1');
	</script>
	<script>
	function trackDownload(type, url) {
	ga('send', 'event', 'download', type, url);
	}
	</script>
	<script src="//code.jquery.com/jquery.min.js"></script>
	<script src="//maxcdn.bootstrapcdn.com/bootstrap/3.2.0/js/bootstrap.min.js"></script>
	<script src="/assets/js/druid.js"></script>
	<!-- stop page_footer include -->


	</body>
	</html>