blob: 1397ab3a096e90d48668f794759b097741861322 [file] [log] [blame]
<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
<atom:link href="" rel="self" type="application/rss+xml" />
<lastBuildDate>Tue, 03 Nov 2015 08:00:45 +0000</lastBuildDate>
<title>Dimensions Computation (Aggregate Navigator) Part 1: Intro</title>
<pubDate>Tue, 03 Nov 2015 08:00:29 +0000</pubDate>
<dc:creator><![CDATA[tim farkas]]></dc:creator>
<guid isPermaLink="false"></guid>
<description><![CDATA[<p>Introduction In the world of big data, enterprises have a common problem. They have large volumes of data flowing into their systems for which they need to observe historical trends in real-time. Consider the case of a digital advertising publisher that is receiving hundreds of thousands of click events every second. Looking at the history [&#8230;]</p>
<p>The post <a rel="nofollow" href="">Dimensions Computation (Aggregate Navigator) Part 1: Intro</a> appeared first on <a rel="nofollow" href="">DataTorrent</a>.</p>
<content:encoded><![CDATA[<h2 id="introduction">Introduction</h2>
<p>In the world of big data, enterprises have a common problem. They have large volumes of data flowing into their systems for which they need to observe historical trends in real-time. Consider the case of a digital advertising publisher that is receiving hundreds of thousands of click events every second. Looking at the history of individual clicks and impressions doesn’t tell the publisher much about what is going on. A technique the publisher might employ is to track the total number of clicks and impressions for every second, minute, hour, and day. Such a technique might help find global trends in their systems, but may not provide enough granularity to take action on localized trends. The technique will need to be powerful enough to spot local trends. For example, the total clicks and impressions for an advertiser, a geographical area, or a combination of the two can provide some actionable insight. This process of receiving individual events, aggregating them over time, and drilling down into the data using some parameters like “advertiser” and “location” is called Dimensions Computation.</p>
<p>Dimensions Computation is a powerful mechanism that allows you to spot trends in your streaming data in real-time. In this post we’ll cover the key concepts behind Dimensions Computation and outline the process of performing Dimensions Computation. We will also show you how to use Data Torrent’s out-of-the-box enterprise operators to easily add Dimensions Computation to your application.</p>
<h2 id="the-process">The Process</h2>
<p>Let us continue with our example of an advertising publisher. Let us now see the steps that the publisher might take to ensure that large volumes of raw advertisement data is converted into meaningful historical views of their advertisement events.</p>
<h3 id="the-data">The Data</h3>
<p>Typically advertising publishers receive packets of information for each advertising event. The events that a publisher receives might look like this:</p>
<pre class="prettyprint"><code class=" hljs cs"> <span class="hljs-keyword">public</span> <span class="hljs-keyword">class</span> AdEvent
<span class="hljs-comment">//The name of the company that is advertising</span>
<span class="hljs-keyword">public</span> String advertiser;
<span class="hljs-comment">//The geographical location of the person initiating the event</span>
<span class="hljs-keyword">public</span> String location;
<span class="hljs-comment">//How much the advertiser was charged for the event</span>
<span class="hljs-keyword">public</span> <span class="hljs-keyword">double</span> cost;
<span class="hljs-comment">//How much revenue was generated for the advertiser</span>
<span class="hljs-keyword">public</span> <span class="hljs-keyword">double</span> revenue;
<span class="hljs-comment">//The number of impressions the advertiser received from this event</span>
<span class="hljs-keyword">public</span> <span class="hljs-keyword">long</span> impressions;
<span class="hljs-comment">//The number of clicks the advertiser received from this event</span>
<span class="hljs-keyword">public</span> <span class="hljs-keyword">long</span> clicks;
<span class="hljs-comment">//The timestamp of the event in milliseconds</span>
<span class="hljs-keyword">public</span> <span class="hljs-keyword">long</span> time;
<p>The class <strong>AdEvent</strong> contains two types of data:</p>
<li><strong>Aggregates</strong>: The data that is combined using aggregators.</li>
<li><strong>Keys</strong>: The data that is used to select aggregations at a finer granularity.</li>
<h4 id="aggregates">Aggregates</h4>
<p>The aggregates in our <strong>AdEvent</strong> object are the pieces of data, which we must combine using aggregators in order to obtain a meaningful historical view. In this case, we can think of combining cost, revenue, impressions, and clicks. So these are our aggregates. We won’t obtain anything useful by aggregating the location and advertiser strings in our <strong>AdEvent</strong>, so those are not considered aggregates. It’s important to note that aggregates are considered separate entities. This means that the cost field of and <strong>AdEvent</strong> cannot be combined with its revenue field; cost values can only be aggregated with other cost values and revenue values can only be aggregated with other revenue values.</p>
<p>In summary the aggregates in our <strong>AdEvent</strong> are:</p>
<h4 id="keys">Keys</h4>
<p>The keys in our <strong>AdEvent</strong> object are used for selecting aggregations at a finer granularity. For example, it would make sense to look at the number of clicks for a particular advertiser, the number of clicks for a certain location, and the number of clicks for a certain location and advertiser combination. So location and advertiser are keys. Time is also another key since it is useful to look at the number of clicks received in a particular time range (For example, 12:00 pm through 1:00 pm or 12:00 pm through 12:01 pm.</p>
<p>In summary the keys in our <strong>AdEvent</strong> are:</p>
<h3 id="computing-the-aggregations">Computing The Aggregations</h3>
<p>When the publisher receives a new <strong>AdEvent</strong> the event is added to running aggregations in real time. The keys and aggregates in the <strong>AdEvent</strong> are used to compute aggregations. How the aggregations are computed and the number of aggregations computed are determined by three tunable parameters:</p>
<li><strong>Dimensions Combinations</strong></li>
<li><strong>Time Buckets</strong></li>
<h4 id="aggregators">Aggregators</h4>
<p>Dimensions Computation supports more than just one type of aggregation, and multiple aggregators can be used to combine incoming data at once. Some of the aggregators available out-of-the-box are:</p>
<p>As an example, suppose the publisher is not using the keys in their <strong>AdEvents</strong> and this publisher wants to perform a sum and a max aggregation.</p>
<p><strong>1.</strong> An AdEvent arrives. The AdEvent is aggregated to the Sum and Count aggregations.<br />
<img title="" src=";h=720" alt="Adding Aggregate" /><br />
<strong>2.</strong> Another AdEvent arrives. The AdEvent is aggregated to the Sum and Count aggregations.<br />
<img title="" src=";h=720" alt="Adding Aggregate" /></p>
<p>As can be seen from the example above, each <strong>AdEvent</strong> contributes to two aggregations.</p>
<h4 id="dimension-combinations">Dimension Combinations</h4>
<p>Each <strong>AdEvent</strong> does not necessarily contribute to only one aggregation for each aggregator. In our advertisement example there are 4 <strong>dimension combinations</strong> over which aggregations can be computed.</p>
<li><strong>advertiser:</strong> This dimension combination is comprised of just the advertiser value. This means that all the aggregates for <strong>AdEvents</strong> with a particular value for advertiser (for example, Gamestop) are aggregated.</li>
<li><strong>location:</strong> This dimension combination is comprised of just the location value. This means that all the aggregates for <strong>AdEvents</strong> with a particular value for location (for example, California) are aggregated.</li>
<li><strong>advertiser, location:</strong> This dimension combination is comprised the advertiser and location values. This means that all the aggregates for <strong>AdEvents</strong> with the same advertiser and location combination (for example, Gamestop, California) are aggregated.</li>
<li><strong>the empty combination:</strong> This combination is a <em>global aggregation</em> because it doesn’t use any of the keys in the <strong>AdEvent</strong>. This means that all the <strong>AddEvents</strong> are aggregated.</li>
<p>Therefore if a publisher is using the four dimension combinations enumerated above along with the sum and max aggregators, the number of aggregations being maintained would be:</p>
<p>NUM_AGGS = 2 x <em>(number of unique advertisers)</em> + 2 * <em>(number of unique locations)</em> + 2 * <em>(number of unique advertiser and location combinations)</em> + 2</p>
<p>And each individual <strong>AdEvent</strong> will contribute to <em>(number of aggregators)</em> x <em>(number of dimension combinations)</em> aggregations.</p>
<p>Here is an example of how NUM_AGGS aggregations are computed:</p>
<p><strong>1.</strong> An <strong>AdEvent</strong> arrives. The <strong>AdEvent</strong> is applied to aggregations for each aggregator and each dimension combination.<br />
<img title="" src=";h=720" alt="Adding Aggregate" /><br />
<strong>2.</strong> Another <strong>AdEvent</strong> arrives. The <strong>AdEvent</strong> is applied to aggregations for each aggregator and each dimension combination.<br />
<img title="" src=";h=720" alt="Adding Aggregate" /><br />
<strong>3.</strong> Another <strong>AdEvent</strong> arrives. The <strong>AdEvent</strong> is applied to aggregations for each aggregator and each dimension combination.<br />
<img title="" src=";h=720" alt="Adding Aggregate" /></p>
<p>As can be seen from the example above each <strong>AdEvent</strong> contributes to 2 x 4 = 8 aggregations and there are 2 x 2 + 2 x 2 + 2 x 3 + 2 = 16 aggregations in total.</p>
<h4 id="time-buckets">Time Buckets</h4>
<p>In addition to computing multiple aggregations for each dimension combination, aggregations can also be performed over time buckets. Time buckets are windows of time (for example, 1:00 pm through 1:01 pm) that are specified by a simple notation: 1m is one minute, 1h is one hour, 1d is one day. When aggregations are performed over time buckets, separate aggregations are maintained for each time bucket. Aggregations for a time bucket are comprised only of events with a time stamp that falls into that time bucket.</p>
<p>An example of how these time bucketed aggregations are computed is as follows:</p>
<p>Let’s say our advertisement publisher is interested in computing the Sum and Max of <strong>AdEvents</strong> for the dimension combinations comprised of <strong>advertiser</strong> and <strong>location</strong> over 1 minute and 1 hour time buckets.</p>
<p><strong>1.</strong> An <strong>AdEvent</strong> arrives. The <strong>AdEvent</strong> is applied to the aggregations for the appropriate aggregator, dimension combination and time bucket.</p>
<p><img title="" src=";h=720" alt="Adding Aggregate" /></p>
<p><strong>3.</strong> Another <strong>AdEvent</strong> arrives. The <strong>AdEvent</strong> is applied to the aggregations for the appropriate aggregator, dimension combination and time bucket.</p>
<p><img title="" src=";h=720" alt="Adding Aggregate" /></p>
<h4 id="conclusion">Conclusion</h4>
<p>In summary, the three tunable parameters discussed above (<strong>Aggregators</strong>, <strong>Dimension Combinations</strong>, and <strong>Time Buckets</strong>) determine how aggregations are computed. In the examples provided in the <strong>Aggregators</strong>, <strong>Dimension Combinations</strong>, and <strong>Time Buckets</strong> sections respectively, we have incrementally increased the complexity in which the aggregations are performed. The examples provided in the <strong>Aggregators</strong>, and <strong>Dimension Combinations</strong> sections were for illustration purposes only; the example provided in the <strong>Time Buckets</strong> section provides an accurate view of how aggregations are computed within Data Torrent&#8217;s enterprise operators.</p>
<p>Download DataTorrent Sandbox <a href="" target="_blank">here</a></p>
<p>Download DataTorrent Enterprise Edition <a href="" target="_blank">here</a></p>
<p>The post <a rel="nofollow" href="">Dimensions Computation (Aggregate Navigator) Part 1: Intro</a> appeared first on <a rel="nofollow" href="">DataTorrent</a>.</p>
<title>Cisco ACI, Big Data, and DataTorrent</title>
<pubDate>Tue, 27 Oct 2015 22:30:07 +0000</pubDate>
<dc:creator><![CDATA[Charu Madan]]></dc:creator>
<guid isPermaLink="false"></guid>
<description><![CDATA[<p>By: Harry Petty, Data Center and Cloud Networking, Cisco  (This blog has been developed in association with Farid Jiandani, Product Manager with Cisco’s Insieme Networks Business Unit and Charu Madan, Director Business Development at DataTorrent. It was originally published on Cisco Blogs) If you work for an enterprise that’s looking to hit its digital sweet [&#8230;]</p>
<p>The post <a rel="nofollow" href="">Cisco ACI, Big Data, and DataTorrent</a> appeared first on <a rel="nofollow" href="">DataTorrent</a>.</p>
<content:encoded><![CDATA[<p>By: Harry Petty, Data Center and Cloud Networking, Cisco</p>
<p class="c0 c11"><a name="h.gjdgxs"></a><span class="c1"> (</span><span class="c4 c13">This blog has been developed in association with Farid Jiandani, Product Manager with Cisco’s Insieme Networks Business Unit and Charu Madan, Director Business Development at DataTorrent. It was originally published on <a href="" target="_blank">Cisco Blogs</a>)</span></p>
<p class="c0"><span class="c1">If you work for an enterprise that’s looking to hit its digital sweet spot, then you’re scrutinizing your sales, marketing and operations to see where you should make digital investments to innovate and improve productivity. Super-fast data processing at scale is being used to obtain real-time insights for digital business and Internet of Things (IoT) initiatives.</span></p>
<p class="c0"><span class="c1">According to Gartner Group, one of the cool vendors in this area of providing super- fast big data analysis using in-memory streaming analytics is called DataTorrent, a startup founded by long-time ex-Yahoo! veterans with vast experience managing big data for leading edge applications and infrastructure at massive scale. Their goal is to empower today’s enterprises to experience the full potential and business impact of big data with a platform that processes and analyzes data in real-time.</span></p>
<p class="c0"><span class="c1 c2">DataTorrent RTS</span></p>
<p class="c0"><span class="c4 c6">DataTorrent RTS is an open core</span><span class="c2 c4 c6">, enterprise-grade product powered by Apache Apex. </span><span class="c4 c6">DataTorrent RTS provides a single, unified batch and stream processing platform that enables organizations to reduce time to market, development costs and operational expenditures for big data analytics applications. </span></p>
<p class="c0"><span class="c1 c2">DataTorrent RTS Integration with ACI</span></p>
<p class="c0"><span class="c4 c6">A member of the Cisco ACI ecosystem, DataTorrent announced on September 29th DataTorrent RTS integration with Cisco </span><span class="c4 c6"><a class="c7" href=";sa=D&amp;usg=AFQjCNFMZhMYdUmPuuqrUI5IZmrvEhlK5g">Application Centric Infrastructure (ACI)</a></span><span class="c4 c6"> through the Application Policy Infrastructure Controller (APIC) to help create more efficient IT operations, bringing together network operations management and big data application management and development: </span></p>
<p class="c0"><span class="c4"><a class="c7" href=";sa=D&amp;usg=AFQjCNG4S_2-OY5ox5nCf_0_Qj7s-x9pyw">DataTorrent Integrates with Cisco ACI to Help Secure Big Data Processing Through a Unified Data and Network Fabric</a></span><span class="c4">. </span><span class="c2 c4">The joint solution enables</span></p>
<ul class="c8 lst-kix_list_2-0 start">
<li class="c12 c0"><span class="c4">A unified fabric approach for managing </span><span class="c2 c4">Applications, Data </span><span class="c4">and </span><span class="c2 c4">Network</span></li>
<li class="c0 c12"><span class="c4">A highly secure and automated Big Data application platform which uses the power of Cisco ACI for automation and security policy management </span></li>
<li class="c12 c0"><span class="c4">The creation, repository, and enforcement point for Cisco ACI application policies for big data applications</span></li>
<p class="c0"><span class="c4">With the ACI integration, secure connectivity to diverse data sets becomes a part of a user defined policy which is automated and does not compromise on security and access management. As an example, if one of the DataTorrent Big Data application needs access to say a Kafka source, then all nodes need to be opened up. This leaves the environment vulnerable and prone to attacks. With ACI, the access management policies and contracts help define the connectivity contracts and only the right node and right application gets access. See Figure 1 and 2 for the illustration of this concept. </span></p>
<p class="c0"><span class="c1 c2">Figure 1:</span></p>
<p class="c0"><a href=""><img class="alignnone size-full wp-image-2349" src="" alt="image00" width="432" height="219" /></a></p>
<p class="c0"><span class="c1 c2">Figure 2</span></p>
<p class="c0"><a href=""><img class="alignnone size-full wp-image-2350" src="" alt="image01" width="904" height="493" /></a></p>
<p class="c0"><span class="c1 c2">ACI Support for Big Data Solutions</span></p>
<p class="c0"><span class="c1">The openness and the flexibility of ACI allow big data customers to run a wide variety of different applications within their fabric alongside Hadoop. Due to the elasticity of ACI, customers are able to run batch processing alongside stream processing and other data base applications in a seamless fashion. In traditional Hadoop environments, the network is segmented based off of individual server nodes (see Figure 1). This makes it difficult to elastically allow access to and from different applications. Ultimately, within the ACI framework, logical demarcation points can be created based on application workloads rather than physical server groups (a set of Hadoop nodes should not be considered as a bunch of individual server nodes, rather a single group.)</span></p>
<p class="c0"><span class="c1 c2">A Robust and Active Ecosystem</span></p>
<p class="c0"><span class="c1">Many vendors claim they have a broad ecosystem of vendors, but sometimes that’s pure marketing, without any real integration efforts going on behind the slideware. But Cisco’s Application Centric Infrastructure (ACI) has a very active ecosystem of industry leaders who are putting significant resources into integration efforts, taking advantage of ACI’s open Northbound and Southbound API’s. DataTorrent is just one example of an innovative company that is using ACI integration to add value to their solutions and deliver real benefits to their channel partners and customers.</span></p>
<p class="c0"><span class="c1">Stay tuned for more success stories to come: we’ll continue to showcase industry leaders that are taking advantage of the open ACI API’s.</span></p>
<p class="c0"><span class="c1">Additional References</span></p>
<p class="c0"><span class="c3"><a class="c7" href=";sa=D&amp;usg=AFQjCNHPa1zEn6-1fEWQeCgZ-QmP9te5ig"></a></span></p>
<p class="c0"><span class="c3"><a class="c7" href=";sa=D&amp;usg=AFQjCNGmS3P3mOU0DQen5F43--fDi25uWw"></a></span></p>
<p class="c11 c0"><span class="c3"><a class="c7" href=";sa=D&amp;usg=AFQjCNHbzoCVBy0azkWTbjpqdyxPqkCo9g">http://www.datatorrent/</a></span></p>
<p>The post <a rel="nofollow" href="">Cisco ACI, Big Data, and DataTorrent</a> appeared first on <a rel="nofollow" href="">DataTorrent</a>.</p>
<title>Write Your First Apache Apex Application in Scala</title>
<pubDate>Tue, 27 Oct 2015 01:58:25 +0000</pubDate>
<dc:creator><![CDATA[Tushar Gosavi]]></dc:creator>
<guid isPermaLink="false"></guid>
<description><![CDATA[<p>* Extend your Scala expertise to building Apache Apex applications * Scala is modern, multi-paradigm programing language that integrates features of functional as well as object-oriented languages elegantly. Big Data frameworks are already exploring Scala as a language of choice for implementations. Apache Apex is developed in Java, the Apex APIs are such that writing [&#8230;]</p>
<p>The post <a rel="nofollow" href="">Write Your First Apache Apex Application in Scala</a> appeared first on <a rel="nofollow" href="">DataTorrent</a>.</p>
<content:encoded><![CDATA[<p><em>* Extend your Scala expertise to building Apache Apex applications *</em></p>
<p>Scala is modern, multi-paradigm programing language that integrates features of functional as well as object-oriented languages elegantly. Big Data frameworks are already exploring Scala as a language of choice for implementations. Apache Apex is developed in Java, the Apex APIs are such that writing applications is a smooth sail. Developers can use any programing language that can run on JVM and access JAVA classes, because Scala has good interoperability with Java, running Apex applications designed in Scala is a fuss-free experience. We will explain how to write an Apache Apex application in Scala.</p>
<p>Writing an <a href="" target="_blank">Apache Apex</a> application in Scala is simple.</p>
<h2 id="operators-within-the-application">Operators within the application</h2>
<p>We will develop a word count applications in Scala. This application will look for new files in a directory. With the availability of new files, the word count application will read the files, and compute a count for each word and print result on stdout. The application requires following operators:</p>
<li><strong>LineReader</strong> &#8211; This operator monitors directories for new files periodically. After a new file is detected, LineReader reads the file line-by-line, and makes lines available on the output port for the next operator.</li>
<li><strong>Parser</strong> &#8211; This operator receives lines read by LineReader on its input port. Parser breaks the line into words, and makes individual words available on the output port.</li>
<li><strong>UniqueCounter</strong> &#8211; This operator computes the count of each word received on its input port.</li>
<li><strong>ConsoleOutputOperator</strong> &#8211; This operator prints unique counts of words on standard output.</li>
<h2 id="build-the-scala-word-count-application">Build the Scala word count application</h2>
<p>Now, we will generate a sample application using maven archtype:generate.</p>
<h3 id="generate-a-sample-application">Generate a sample application.</h3>
<pre class="prettyprint"><code class="language-bash hljs ">mvn archetype:generate -DarchetypeRepository= -DarchetypeGroupId=com.datatorrent -DarchetypeArtifactId=apex-app-archetype -DarchetypeVersion=<span class="hljs-number">3.0</span>.<span class="hljs-number">0</span> -DgroupId=com.datatorrent -Dpackage=com.datatorrent.wordcount -DartifactId=wordcount -Dversion=<span class="hljs-number">1.0</span>-SNAPSHOT</code></pre>
<p>This creates a directory called <strong>wordcount</strong>, with a sample application and build script. Let us see how to modify this application into the Scala-based word count application that we are looking to develop.</p>
<h3 id="add-the-scala-build-plugin">Add the Scala build plugin.</h3>
<p>Apache Apex uses maven for building the framework and operator library. Maven supports a plugin for compiling Scala files. To enable this plugin, add the following snippet to the <code>build -&gt; plugins</code> sections of the <code>pom.xml</code> file that is located in the application directory.</p>
<pre class="prettyprint"><code class="language-xml hljs "> <span class="hljs-tag">&lt;<span class="hljs-title">plugin</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-title">groupId</span>&gt;</span>net.alchim31.maven<span class="hljs-tag">&lt;/<span class="hljs-title">groupId</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-title">artifactId</span>&gt;</span>scala-maven-plugin<span class="hljs-tag">&lt;/<span class="hljs-title">artifactId</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-title">version</span>&gt;</span>3.2.1<span class="hljs-tag">&lt;/<span class="hljs-title">version</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-title">executions</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-title">execution</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-title">goals</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-title">goal</span>&gt;</span>compile<span class="hljs-tag">&lt;/<span class="hljs-title">goal</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-title">goal</span>&gt;</span>testCompile<span class="hljs-tag">&lt;/<span class="hljs-title">goal</span>&gt;</span>
<span class="hljs-tag">&lt;/<span class="hljs-title">goals</span>&gt;</span>
<span class="hljs-tag">&lt;/<span class="hljs-title">execution</span>&gt;</span>
<span class="hljs-tag">&lt;/<span class="hljs-title">executions</span>&gt;</span>
<span class="hljs-tag">&lt;/<span class="hljs-title">plugin</span>&gt;</span></code></pre>
<p>Also, specify the Scala library as a dependency in the pom.xml file.<br />
Add the Scala library.</p>
<pre class="prettyprint"><code class="language-xml hljs "><span class="hljs-tag">&lt;<span class="hljs-title">dependency</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-title">groupId</span>&gt;</span>org.scala-lang<span class="hljs-tag">&lt;/<span class="hljs-title">groupId</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-title">artifactId</span>&gt;</span>scala-library<span class="hljs-tag">&lt;/<span class="hljs-title">artifactId</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-title">version</span>&gt;</span>2.11.2<span class="hljs-tag">&lt;/<span class="hljs-title">version</span>&gt;</span>
<span class="hljs-tag">&lt;/<span class="hljs-title">dependency</span>&gt;</span></code></pre>
<p>We are now set to write a Scala application.</p>
<h2 id="write-your-scala-word-count-application">Write your Scala word count application</h2>
<h3 id="linereader">LineReader</h3>
<p><a href="" target="_blank">Apache Malhar library</a> contains an <a href="" target="_blank">AbstractFileInputOperator</a> operator that monitors and reads files from a directory. This operator has capabilities such as support for scaling, fault tolerance, and exactly once processing. To complete the functionality, override a few methods:<br />
<em>readEntity</em> : Reads a line from a file.<br />
<em>emit</em> : Emits data read on the output port.<br />
We have overridden openFile to obtain an instance of BufferedReader that is required while reading lines from the file. We also override closeFile for closing an instance of BufferedReader.</p>
<pre class="prettyprint"><code class="language-scala hljs "><span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">LineReader</span> <span class="hljs-keyword">extends</span> <span class="hljs-title">AbstractFileInputOperator</span>[<span class="hljs-title">String</span>] {</span>
<span class="hljs-annotation">@transient</span>
<span class="hljs-keyword">val</span> out : DefaultOutputPort[String] = <span class="hljs-keyword">new</span> DefaultOutputPort[String]();
<span class="hljs-keyword">override</span> <span class="hljs-keyword">def</span> readEntity(): String = br.readLine()
<span class="hljs-keyword">override</span> <span class="hljs-keyword">def</span> emit(line: String): Unit = out.emit(line)
<span class="hljs-keyword">override</span> <span class="hljs-keyword">def</span> openFile(path: Path): InputStream = {
<span class="hljs-keyword">val</span> in = <span class="hljs-keyword">super</span>.openFile(path)
br = <span class="hljs-keyword">new</span> BufferedReader(<span class="hljs-keyword">new</span> InputStreamReader(in))
<span class="hljs-keyword">return</span> in
<span class="hljs-keyword">override</span> <span class="hljs-keyword">def</span> closeFile(is: InputStream): Unit = {
<span class="hljs-keyword">super</span>.closeFile(is)
<span class="hljs-annotation">@transient</span>
<span class="hljs-keyword">private</span> <span class="hljs-keyword">var</span> br : BufferedReader = <span class="hljs-keyword">null</span>
<p>Some Apex API classes are not serializable, and must be defined as transient. Scala supports transient annotation for such scenarios. If you see objects that are not a part of the state of the operator, you must annotate them with a @transient. For example, in this code, we have annotated buffer reader and output port as transient.</p>
<h3 id="parser">Parser</h3>
<p>Parser splits lines using in-built JAVA split function, and emits individual words to the output port.</p>
<pre class="prettyprint"><code class="language-scala hljs "><span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">Parser</span> <span class="hljs-keyword">extends</span> <span class="hljs-title">BaseOperator</span> {</span>
<span class="hljs-annotation">@BeanProperty</span>
<span class="hljs-keyword">var</span> regex : String = <span class="hljs-string">" "</span>
<span class="hljs-annotation">@transient</span>
<span class="hljs-keyword">val</span> out = <span class="hljs-keyword">new</span> DefaultOutputPort[String]()
<span class="hljs-annotation">@transient</span>
<span class="hljs-keyword">val</span> in = <span class="hljs-keyword">new</span> DefaultInputPort[String]() {
<span class="hljs-keyword">override</span> <span class="hljs-keyword">def</span> process(t: String): Unit = {
<span class="hljs-keyword">for</span>(w &lt;- t.split(regex)) out.emit(w)
<p>Scala simplifies automatic generation of setters and getters based on the @BinProperty annotation. Properties annotated with @BinProperty can be modified at the time of launching an application by using configuration files. You can also modify such properties while an application is running. You can specify the regular expression used for splitting within the configuration file.</p>
<h3 id="uniquecount-and-consoeloutputoperator">UniqueCount and ConsoelOutputOperator</h3>
<p>For this application, let us use UniqueCount and ConsoleOutputOperator as is.</p>
<h3 id="put-together-the-word-count-application">Put together the word count application</h3>
<p>Writing the main application class in Scala is similar to doing it in JAVA. You must first get an instance of DAG object by overriding the populateDAG() method. Later, you must add operators to this instance using the addOperator() method. Finally, you must connect the operators with the addStream() method.</p>
<pre class="prettyprint"><code class="language-scala hljs "><span class="hljs-annotation">@ApplicationAnnotation</span>(name=<span class="hljs-string">"WordCount"</span>)
<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">Application</span> <span class="hljs-keyword">extends</span> <span class="hljs-title">StreamingApplication</span> {</span>
<span class="hljs-keyword">override</span> <span class="hljs-keyword">def</span> populateDAG(dag: DAG, configuration: Configuration): Unit = {
<span class="hljs-keyword">val</span> input = dag.addOperator(<span class="hljs-string">"input"</span>, <span class="hljs-keyword">new</span> LineReader)
<span class="hljs-keyword">val</span> parser = dag.addOperator(<span class="hljs-string">"parser"</span>, <span class="hljs-keyword">new</span> Parser)
<span class="hljs-keyword">val</span> counter = dag.addOperator(<span class="hljs-string">"counter"</span>, <span class="hljs-keyword">new</span> UniqueCounter[String])
<span class="hljs-keyword">val</span> out = dag.addOperator(<span class="hljs-string">"console"</span>, <span class="hljs-keyword">new</span> ConsoleOutputOperator)
dag.addStream[String](<span class="hljs-string">"lines"</span>, input.out,
dag.addStream[String](<span class="hljs-string">"words"</span>, parser.out,
dag.addStream[java.util.HashMap[String,Integer]](<span class="hljs-string">"counts"</span>, counter.count, out.input)
<h2 id="running-application">Running application</h2>
<p>Before running the word count application, specify the input directory for the input operator. You can use the default configuration file for this. Open the <em>src/main/resources/META-INF/properties.xml</em> file, and add the following lines between the tag. Do not forget to replace “username” with your Hadoop username.</p>
<pre class="prettyprint"><code class="language-xml hljs "><span class="hljs-tag">&lt;<span class="hljs-title">property</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-title">name</span>&gt;</span><span class="hljs-tag">&lt;/<span class="hljs-title">name</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-title">value</span>&gt;</span>/user/username/data<span class="hljs-tag">&lt;/<span class="hljs-title">value</span>&gt;</span>
<span class="hljs-tag">&lt;/<span class="hljs-title">property</span>&gt;</span></code></pre>
<p>Build the application from the application directory using this command:</p>
<pre class="prettyprint"><code class="language-bash hljs ">mvn clean install -DskipTests</code></pre>
<p>You should now have an application package in the target directory.</p>
<p>Now, launch this application package using dtcli.</p>
<pre class="prettyprint"><code class="language-bash hljs ">$ dtcli
DT CLI <span class="hljs-number">3.2</span>.<span class="hljs-number">0</span>-SNAPSHOT <span class="hljs-number">28.09</span>.<span class="hljs-number">2015</span> @ <span class="hljs-number">12</span>:<span class="hljs-number">45</span>:<span class="hljs-number">15</span> IST rev: <span class="hljs-number">8</span>e49cfb branch: devel-<span class="hljs-number">3</span>
dt&gt; launch target/wordcount-<span class="hljs-number">1.0</span>-SNAPSHOT.apa
{<span class="hljs-string">"appId"</span>: <span class="hljs-string">"application_1443354392775_0010"</span>}
dt (application_1443354392775_0010) &gt;</code></pre>
<p>Add some text files to the <em>/user/username/data</em> directory on your HDFS to see how the application works. You can see the words along with their counts in the container log of the console operator.</p>
<h2 id="summary">Summary</h2>
<p>Scala classes are JVM classes that can be inherited from JAVA classes, while allowing transparency in JAVA object creation and calling. That is why you can easily extend your Scala capabilities to build Apex applications.<br />
To get started with creating your first application, see <a href=""></a>.</p>
<h2 id="see-also">See Also</h2>
<li>Building Applications with Apache Apex and Malhar <a href=""></a></li>
<li>Scala tutorial for java programmers <a href=""></a></li>
<li>Application Developer Guide <a href=""></a></li>
<li>Operator Developer Guide <a href=""></a></li>
<li>Malhar Operator Library <a href=""></a></li>
<p>The post <a rel="nofollow" href="">Write Your First Apache Apex Application in Scala</a> appeared first on <a rel="nofollow" href="">DataTorrent</a>.</p>
<title>Apache Apex Performance Benchmarks</title>
<pubDate>Tue, 20 Oct 2015 13:23:27 +0000</pubDate>
<dc:creator><![CDATA[Vlad Rozov]]></dc:creator>
<guid isPermaLink="false"></guid>
<description><![CDATA[<p>Why another benchmark blog? As engineers, we are skeptical of performance benchmarks developed and published by software vendors. Most of the time such benchmarks are biased towards the company’s own product in comparison to other vendors. Any reported advantage may be the result of selecting a specific use case better handled by the product or [&#8230;]</p>
<p>The post <a rel="nofollow" href="">Apache Apex Performance Benchmarks</a> appeared first on <a rel="nofollow" href="">DataTorrent</a>.</p>
<content:encoded><![CDATA[<p><b id="apex-performance-benchmarks" class="c2 c16"><span class="c0">Why another benchmark blog?</span></b></p>
<p class="c2">As engineers, we are skeptical of performance benchmarks developed and published by software vendors. Most of the time such benchmarks are biased towards the company’s own product in comparison to other vendors. Any reported advantage may be the result of selecting a specific use case better handled by the product or using optimized configuration for one’s own product compared to out of the box configuration for other vendors’ products.</p>
<p class="c2">So, why another blog on the topic? The reason I decided to write this blog is to explain the rationale behind <a href="">DataTorrent’s</a> effort to develop and maintain a benchmark performance suite, how the suite is used to certify various releases, and seek community opinion on how the performance benchmark may be improved.</p>
<p class="c2 c4">Note: the performance numbers given here are only for reference and by no means a comprehensive performance evaluation of <a href="">Apache APEX</a>; performance numbers can vary depending on different configurations or different use cases.</p>
<p class="c12 c2 subtitle"><strong>Benchmark application.</strong><img class=" aligncenter" title="" src="" alt="" /></p>
<p class="c2">To evaluate the performance of the <a href="">Apache APEX</a>  platform,  we use Benchmark application that has a simple <a href="">DAG</a> with only two operators. The first operator (<span class="c3">wordGenerato</span>r) emits tuples and the second operator (<span class="c3">counter</span>) counts tuples received. For such trivial operations, operators add minimum overhead to CPU and memory consumption allowing measurement of <a href="">Apache APEX</a>  platform throughput. As operators don’t change from release to release, this application is suitable for comparing the platform performance across releases.</p>
<p class="c2">Tuples are byte arrays with configurable length, minimizing complexity of tuples serialization and at the same time allowing examination of  performance of the platform against several different tuple sizes. The emitter (<span class="c3">wordGenerator</span>) operator may be configured to use the same byte array avoiding the operator induced garbage collection. Or it may be configured to allocate new byte array for every new tuple emitted, more closely simulating real application behavior.</p>
<p class="c2">The consumer (<span class="c3">counter</span>) operator collects the total number of tuples received and the wall-clock time in milliseconds passed between begin and end window. It writes the collected data to the log at the end of every 10th window.</p>
<p class="c2">The data stream (<span class="c3">Generator2Counter</span>) connects the first operator output port to the second operator input port. The benchmark suite exercises all possible configurations for the stream locality:</p>
<ul class="c8 lst-kix_2ql03f9wui4c-0 start">
<li class="c2 c7">thread local (<span class="c3">THREAD_LOCAL</span>) when both operators are deployed into the same thread within a container effectively serializing operators computation;</li>
<li class="c2 c7">container local (<span class="c3">CONTAINER_LOCAL)</span><span class="c3"> </span>when both operators are deployed into the same container and execute in two different threads;</li>
<li class="c2 c7">node local (<span class="c3">NODE_LOCAL</span>)<sup><a href="#ftnt_ref1">[1]</a></sup> when operators are deployed into two different containers running on the same yarn node;</li>
<li class="c2 c7">rack local (RACK_LOCAL)<sup><a href="#ftnt_ref2">[2]</a></sup> when operators are deployed into two different containers running on yarn nodes residing on the same rack</li>
<li class="c2 c7">no locality when operators are deployed into two different containers running on any hadoop cluster node.</li>
<p class="c2 c12 subtitle"><span class="c0"><b><a href="">Apache APEX</a> release performance certification</b></span></p>
<p class="c2">The benchmark application is a part of <a href="">Apache APEX</a> release certification. It is executed on <a href="">DataTorrent’s</a> development Hadoop cluster by an automated script that launches the application with all supported <span class="c3">Generator2Counter</span> stream localities and 64, 128, 256, 512, 1024, 2048 and a tuple byte array length of 4096. The script collects the number of tuples emitted, the number of tuples processed and the <span class="c3">counter</span> operator latency for the running application and shuts down the application after it runs for 5 minutes, whereupon it moves on to the next configuration. For all configurations, the script runs between 6 and 8 hours depending on the development cluster load.</p>
<p class="c12 c2 subtitle"><span class="c0"><b>Benchmark results</b></span></p>
<p class="c2">As each supported stream locality has distinct performance characteristics (with exception of rack local and no locality due to the development cluster being setup on a single rack), I use a separate chart for each stream locality.</p>
<p class="c2">Overall the results are self explanatory and I hope that anyone who uses, plans to use or plans to contribute to the <a href="">Apache APEX</a> project finds it useful. A few notes that seems to be worth mentioning:</p>
<ul class="c8 lst-kix_5u2revq5rd1r-0 start">
<li class="c2 c7">There is no performance regression in APEX release 3.0 compared to release 2.0</li>
<li class="c2 c7">Benchmark was executed with default settings for buffer server spooling (turned on by default in release 3.0 and off in release 2.0). As the result, the benchmark application required just 2 GB of memory for the <span class="c3">wordGenerator</span> operator container in release 3.0, while it was necessary to allocate 8 GB to the same in release 2.0</li>
<li class="c2 c7">When tuple size increases, JVM garbage collection starts to play a major role in performance benchmark compared to locality</li>
<li class="c2 c7">Thread local outperforms all other stream localities only for trivial operators that we specifically designed for the benchmark.</li>
<li class="c2 c7">The benchmark was performed on the development cluster while other developers were using it<img title="" src="" alt="" /></li>
<p class="c2"><img title="" src="" alt="" /></p>
<p class="c2 c17"><img title="" src="" alt="" /></p>
<hr class="c10" />
<p class="c2 c13"><a name="ftnt_ref1"></a>[1]<span class="c6"> NODE_LOCAL is currently excluded from the benchmark test due to known limitation. Please see </span><span class="c6 c9"><a class="c5" href="">APEX-123</a></span></p>
<p class="c2 c13"><a name="ftnt_ref2"></a>[2]<span class="c6"> RACK_LOCAL is not yet fully implemented by APEX and is currently equivalent to no locality specified</span></p>
<p>The post <a rel="nofollow" href="">Apache Apex Performance Benchmarks</a> appeared first on <a rel="nofollow" href="">DataTorrent</a>.</p>
<title>Introduction to dtGateway</title>
<pubDate>Tue, 06 Oct 2015 13:00:48 +0000</pubDate>
<dc:creator><![CDATA[David Yan]]></dc:creator>
<guid isPermaLink="false"></guid>
<description><![CDATA[<p>A platform, no matter how much it can do, and how technically superb it is, does not delight users without a proper UI or an API. That’s why there are products such as Cloudera Manager and Apache Ambari to improve the usability of the Hadoop platform. At DataTorrent, in addition to excellence in technology, we [&#8230;]</p>
<p>The post <a rel="nofollow" href="">Introduction to dtGateway</a> appeared first on <a rel="nofollow" href="">DataTorrent</a>.</p>
<content:encoded><![CDATA[<p>A platform, no matter how much it can do, and how technically superb it is, does not delight users without a proper UI or an API. That’s why there are products such as Cloudera Manager and Apache Ambari to improve the usability of the Hadoop platform. At DataTorrent, in addition to excellence in technology, we strive for user delight. One of the main components of DataTorrent RTS is dtGateway. dtGateway is the window to your DataTorrent RTS installation. It is a Java-based multithreaded web server that allows you to easily access information and perform various operations on DataTorrent RTS, and it is the server behind dtManage. It can run on any node in your Hadoop cluster or any other node that can access your Hadoop nodes, and is installed as a system service automatically by the RTS installer.</p>
<p>dtGateway talks to all running Apex App Masters, as well as the Node Managers and the Resource Manager in the Hadoop cluster, in order to gather all the information and to perform all the operations users may need.</p>
<p><img title="" src="" alt="dtGateway diagram" /></p>
<p>These features are exposed through a REST API. Here are some of things you can do with the REST API:</p>
<li>Get performance metrics (e.g. CPU, memory usage, tuples per second, latency, etc.) and other details of all Apex application instances</li>
<li>Get performance metrics and other details of physical and logical operators of each Apex application instance</li>
<li>Get performance metrics and other details of individual containers used by each Apex application instance</li>
<li>Retrieve container logs</li>
<li>Dynamically change operator properties, and add and remove operators from the DAG of a running Apex application</li>
<li>Record and retrieve tuples on the fly</li>
<li>Shutdown a running container or an entire Apex application</li>
<li>Dynamically change logging level of a container</li>
<li>Create, manage, and view custom system alerts</li>
<li>Create, manage, and interact with dtDashboard</li>
<li>Create, manage, and launch Apex App Packages</li>
<li>Basic health checks of the cluster</li>
<p>Here is an example of using the curl command to access dtGateway’s REST API to get the details of a physical operator with ID=40 of application instance with ID=application_1442448722264_14891, assuming dtGateway is listening at localhost:9090:</p>
<pre class="prettyprint"><code class="language-bash hljs ">$ curl http://localhost:<span class="hljs-number">9090</span>/ws/v2/applications/application_1442448722264_14891/physicalPlan/operators/<span class="hljs-number">40</span>
<span class="hljs-string">"checkpointStartTime"</span>: <span class="hljs-string">"1442512091772"</span>,
<span class="hljs-string">"checkpointTime"</span>: <span class="hljs-string">"175"</span>,
<span class="hljs-string">"checkpointTimeMA"</span>: <span class="hljs-string">"164"</span>,
<span class="hljs-string">"className"</span>: <span class="hljs-string">"com.datatorrent.contrib.kafka.KafkaSinglePortOutputOperator"</span>,
<span class="hljs-string">"container"</span>: <span class="hljs-string">"container_e08_1442448722264_14891_01_000017"</span>,
<span class="hljs-string">"counters"</span>: null,
<span class="hljs-string">"cpuPercentageMA"</span>: <span class="hljs-string">"0.2039266316727741"</span>,
<span class="hljs-string">"currentWindowId"</span>: <span class="hljs-string">"6195527785184762469"</span>,
<span class="hljs-string">"failureCount"</span>: <span class="hljs-string">"0"</span>,
<span class="hljs-string">"host"</span>: <span class="hljs-string">""</span>,
<span class="hljs-string">"id"</span>: <span class="hljs-string">"40"</span>,
<span class="hljs-string">"lastHeartbeat"</span>: <span class="hljs-string">"1442512100742"</span>,
<span class="hljs-string">"latencyMA"</span>: <span class="hljs-string">"5"</span>,
<span class="hljs-string">"logicalName"</span>: <span class="hljs-string">"QueryResult"</span>,
<span class="hljs-string">"metrics"</span>: {},
<span class="hljs-string">"name"</span>: <span class="hljs-string">"QueryResult"</span>,
<span class="hljs-string">"ports"</span>: [
<span class="hljs-string">"bufferServerBytesPSMA"</span>: <span class="hljs-string">"0"</span>,
<span class="hljs-string">"name"</span>: <span class="hljs-string">"inputPort"</span>,
<span class="hljs-string">"queueSizeMA"</span>: <span class="hljs-string">"1"</span>,
<span class="hljs-string">"recordingId"</span>: null,
<span class="hljs-string">"totalTuples"</span>: <span class="hljs-string">"6976"</span>,
<span class="hljs-string">"tuplesPSMA"</span>: <span class="hljs-string">"0"</span>,
<span class="hljs-string">"type"</span>: <span class="hljs-string">"input"</span>
<span class="hljs-string">"recordingId"</span>: null,
<span class="hljs-string">"recoveryWindowId"</span>: <span class="hljs-string">"6195527785184762451"</span>,
<span class="hljs-string">"status"</span>: <span class="hljs-string">"ACTIVE"</span>,
<span class="hljs-string">"totalTuplesEmitted"</span>: <span class="hljs-string">"0"</span>,
<span class="hljs-string">"totalTuplesProcessed"</span>: <span class="hljs-string">"6976"</span>,
<span class="hljs-string">"tuplesEmittedPSMA"</span>: <span class="hljs-string">"0"</span>,
<span class="hljs-string">"tuplesProcessedPSMA"</span>: <span class="hljs-string">"20"</span>,
<span class="hljs-string">"unifierClass"</span>: null
<p>For the complete spec of the REST API, please refer to our dtGateway REST API documentation <a href="" target="_blank">here</a>.</p>
<p>With great power comes great responsibility. With all the information dtGateway has and what dtGateway can do, the admin of DataTorrent RTS may want to restrict access to certain information and operations to only certain group of users. This means dtGateway must support authentication and authorization.</p>
<p>For authentication, dtGateway can easily be integrated with existing LDAP, Kerberos, or PAM framework. You can also choose to have dtGateway manage its own user database.</p>
<p>For authorization, dtGateway provides built-in role-based access control. The admin can decide which roles can view what information and perform what operations in dtGateway. The user-to-role mapping can be managed by dtGateway, or be integrated with LDAP roles.<br />
In addition, we provide access control with granularity to the application instance level as well as to the application package level. For example, you can control which users and which roles have read or write access to which application instances and to which application packages.</p>
<p>For more information, visit our dtGateway security documentation <a href="" target="_blank">here</a>.</p>
<p>An important part of user delight is backward compatibility. Imagine after a version upgrade, stuff starts breaking because a “new feature” or a “bug fix” changes an API so that components that expect the old API don’t work any more with the new version. That has to be a frustrating experience for the user!</p>
<p>When a user upgrades to a newer version of DataTorrent RTS, we guarantee that existing components or applications that work with previous minor versions of DataTorrent RTS still work. And that includes the REST API. Even when we release a major RTS version that has backward incompatible changes to the REST API spec, we will maintain backward compatibility by versioning the resource paths of the REST API (e.g. with a change in the prefix in the path /ws/v2 to /ws/v3) and maintaining the old spec until the end-of-life of the old version is reached.</p>
<p>We hope dtGateway is a delight to use for DataTorrent RTS users. We welcome any feedback.</p>
<p>The post <a rel="nofollow" href="">Introduction to dtGateway</a> appeared first on <a rel="nofollow" href="">DataTorrent</a>.</p>
<title>Tracing DAGs from specification to execution</title>
<pubDate>Thu, 01 Oct 2015 04:09:00 +0000</pubDate>
<dc:creator><![CDATA[Thomas Weise]]></dc:creator>
<category><![CDATA[Apache Apex]]></category>
<guid isPermaLink="false"></guid>
<description><![CDATA[<p>How Apex orchestrates the DAG lifecycle Apache Apex (incubating) uses the concept of a DAG to represent an application’s processing logic. This blog will introduce the different perspectives within the architecture, starting from specification by the user to execution within the engine. Understanding DAGs DAG, or Directed Acyclic Graph, expresses processing logic as operators (vertices) [&#8230;]</p>
<p>The post <a rel="nofollow" href="">Tracing DAGs from specification to execution</a> appeared first on <a rel="nofollow" href="">DataTorrent</a>.</p>
<content:encoded><![CDATA[<h2 id="how-apex-orchestrates-the-dag-lifecycle">How Apex orchestrates the DAG lifecycle</h2>
<p><a href="">Apache Apex (incubating)</a> uses the concept of a DAG to represent an application’s processing logic. This blog will introduce the different perspectives within the architecture, starting from specification by the user to execution within the engine.</p>
<h3 id="understanding-dags">Understanding DAGs</h3>
<p>DAG, or Directed Acyclic Graph, expresses processing logic as operators (vertices) and streams (edges) that together make an Apache® Apex (incubating) application. Just as the name suggests, the resulting graph must be acyclic, while specifying the logic that will be executed in sequence or in parallel. DAGs are used to exhibit dependencies, such as in event-based systems where previously occurred events lead to newer ones. The DAG concept is widely used, for example in revision control systems such as Git. Apex leverages the concept of a DAG to express how data is processed. Operators function as nodes within the graph, which are connected by a stream of events called tuples. </p>
<p>There are several frameworks in the wider Hadoop ecosystem that employ the DAG concept to model dependencies. Some of those trace back to MapReduce, where the processing logic is a two operator sequence: map and reduce. This is simple but also rigid, as most processing pipelines have a more complex structure. Therefore, when using MapReduce directly, multiple map-reduce stages need to be chained together to achieve the overall goal. Coordination is not trivial, which lead to the rise of higher level frameworks that attempt to shield the user from this complexity, such as Pig, Hive, Cascading, etc. Earlier on, Pig and Hive directly translated into a MapReduce execution layer, later Tez came into the picture as alternative, common layer for optimization and execution. Other platforms such as Storm and Spark also represent the logic as DAG, each with its own flavor of specification and different architecture of execution layer. </p>
<h3 id="dag-of-operators-represents-the-business-logic">DAG of operators represents the business logic</h3>
<p>Apex permits any operation to be applied to a stream of events and there is practically no limitation on the complexity of the ensuing DAG of operators. The full DAG blueprint is visible to the engine, which means that it can be translated into an end-to-end, fault-tolerant, scalable execution layer.</p>
<p>The operators represent the point where business logic is introduced. Operators receive events via input ports, and emit events via output ports to represent the execution of a user-defined functionality. Operators that don’t receive events on a port are called the input operators. They receive events from external systems, thus acting as roots of the DAG.</p>
<p>Operators can implement any functionality. It can be code that is very specific to a use case or generic and broadly applicable functionality like the operators that are part of the Apex Malhar operator library, with support for reading from various sources, transformations, filtering, dimensional computation or write to a variety of destinations.</p>
<h3 id="specifying-a-dag">Specifying a DAG</h3>
<p>As discussed earlier, a DAG is comprised of connections between output ports and input ports. Operators can have multiple input and output ports, each of a different type. This simplifies the operator programming because the port concept clearly highlights the source and type of the event. This information is visible to the Java compiler, thus enabling immediate feedback to the developer working in the IDE.</p>
<p>Similar to the DOM in a web browser, which can result from a static HTML source or a piece of JavaScript that created it on the fly, an Apex DAG can be created from different source representations, and dynamically modified after the application is running!</p>
<p><img src="" alt="Logical Plan" title=""></p>
<p>We refer to the DAG that was specified by the user as the “logical plan”. This is because upon launch it will be translated into a physical plan, and then mapped to an execution layer (more on this process below).</p>
<h3 id="a-simple-example">A simple example</h3>
<p>Let’s consider the example of the WordCount application, which is the de-facto hello world application of Hadoop. Here is how this simple, sequential DAG will look: The input operator reads a file to emit lines. The “lines” act as a stream, which in turn becomes the input for the parser operator. The parser operator performs a parse operation to generate words for the counter operator. The counter operator emits tuples (word, count) to the console. </p>
<p><img src="" alt="WordCount DAG" title=""></p>
<p>The source for the logical plan can be in different formats. Using the Apex Java API, the WordCount example could look like this:</p>
<pre class="prettyprint"><code class="language-java hljs "><span class="hljs-annotation">@ApplicationAnnotation</span>(name=<span class="hljs-string">"MyFirstApplication"</span>)
<span class="hljs-keyword">public</span> <span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">Application</span> <span class="hljs-keyword">implements</span> <span class="hljs-title">StreamingApplication</span>
<span class="hljs-annotation">@Override</span>
<span class="hljs-keyword">public</span> <span class="hljs-keyword">void</span> <span class="hljs-title">populateDAG</span>(DAG dag, Configuration conf)
LineReader lineReader = dag.addOperator(<span class="hljs-string">"input"</span>, <span class="hljs-keyword">new</span> LineReader());
Parser parser = dag.addOperator(<span class="hljs-string">"parser"</span>, <span class="hljs-keyword">new</span> Parser());
UniqueCounter&lt;String&gt; counter = dag.addOperator(<span class="hljs-string">"counter"</span>, <span class="hljs-keyword">new</span> UniqueCounter&lt;String&gt;());
ConsoleOutputOperator cons = dag.addOperator(<span class="hljs-string">"console"</span>, <span class="hljs-keyword">new</span> ConsoleOutputOperator());
dag.addStream(<span class="hljs-string">"lines"</span>, lineReader.output, parser.input);
dag.addStream(<span class="hljs-string">"words"</span>, parser.output,;
dag.addStream(<span class="hljs-string">"counts"</span>, counter.count, cons.input);
<p>The same WordCount application can be specified through JSON format (typically generated by a tool, such as the DataTorrent RTS application builder known as dtAssemble):</p>
<pre class="prettyprint"><code class="language-json hljs ">{
"<span class="hljs-attribute">displayName</span>": <span class="hljs-value"><span class="hljs-string">"WordCountJSON"</span></span>,
"<span class="hljs-attribute">operators</span>": <span class="hljs-value">[
"<span class="hljs-attribute">name</span>": <span class="hljs-value"><span class="hljs-string">"input"</span></span>,
"<span class="hljs-attribute">name</span>": <span class="hljs-value"><span class="hljs-string">"parse"</span></span>,
"<span class="hljs-attribute">name</span>": <span class="hljs-value"><span class="hljs-string">"count"</span></span>,
"<span class="hljs-attribute">class</span>": <span class="hljs-value"><span class="hljs-string">"com.datatorrent.lib.algo.UniqueCounter"</span></span>,
"<span class="hljs-attribute">properties</span>": <span class="hljs-value">{
"<span class="hljs-attribute">com.datatorrent.lib.algo.UniqueCounter</span>": <span class="hljs-value">{
"<span class="hljs-attribute">cumulative</span>": <span class="hljs-value"><span class="hljs-literal">false</span>
"<span class="hljs-attribute">name</span>": <span class="hljs-value"><span class="hljs-string">"console"</span></span>,
"<span class="hljs-attribute">streams</span>": <span class="hljs-value">[
"<span class="hljs-attribute">name</span>": <span class="hljs-value"><span class="hljs-string">"lines"</span></span>,
"<span class="hljs-attribute">sinks</span>": <span class="hljs-value">[
"<span class="hljs-attribute">operatorName</span>": <span class="hljs-value"><span class="hljs-string">"parse"</span></span>,
"<span class="hljs-attribute">portName</span>": <span class="hljs-value"><span class="hljs-string">"input"</span>
"<span class="hljs-attribute">source</span>": <span class="hljs-value">{
"<span class="hljs-attribute">operatorName</span>": <span class="hljs-value"><span class="hljs-string">"input"</span></span>,
"<span class="hljs-attribute">portName</span>": <span class="hljs-value"><span class="hljs-string">"output"</span>
"<span class="hljs-attribute">name</span>": <span class="hljs-value"><span class="hljs-string">"words"</span></span>,
"<span class="hljs-attribute">name</span>": <span class="hljs-value"><span class="hljs-string">"counts"</span></span>,
<p>As mentioned previously, the DAG can also be modified after an application was launched. In the following example we add another console operator to display the lines emitted by the input operator: </p>
<pre class="prettyprint"><code class="language-bash hljs ">Connected to application application_1442901180806_0001
dt (application_1442901180806_0001) &gt; begin-logical-plan-change
logical-plan-change (application_1442901180806_0001) &gt; create-operator linesConsole
logical-plan-change (application_1442901180806_0001) &gt; add-stream-sink lines linesConsole input
logical-plan-change (application_1442901180806_0001) &gt; submit
<p><img src="" alt="Altered WordCount DAG" title=""></p>
<h3 id="translation-of-logical-dag-into-physical-plan">Translation of logical DAG into physical plan</h3>
<p>Users specify the logical DAG. This logical representation is provided to the Apex client that bootstraps an application. When running on a <a href="" target="_blank">YARN</a> cluster, this client will launch the StrAM (Streaming Application Master), along with the logical plan and exit. StrAM takes over and, as a first task, converts the logical DAG into a physical plan.</p>
<p>To do so, StrAM assigns the operators within the DAG to containers, which will later correspond to actual YARN containers in the execution layer. You can influence many aspects of this translation using (optional) attributes in the the logical plan. The physical plan layout determines the performance and scalability of the application, which is why the configuration will typically specify more attributes as the application evolves.</p>
<p>Here are a few examples of attributes:</p>
<li>The amount of memory that an operator requires</li>
<li>The operator partitioning</li>
<li>Affinity between operators (aka stream locality)</li>
<li>Windows (sliding, tumbling)</li>
<li>JVM options for a container process</li>
<li>Timeout and interval settings for monitoring</li>
<li>Queue sizes</li>
<p><img src="" alt="Physical Plan" title=""></p>
<h3 id="the-physical-plan-works-as-the-blueprint-for-the-execution-layer">The physical plan works as the blueprint for the execution layer</h3>
<p>The physical plan lays the foundation for the execution layer, but because both are distinct, the same physical plan can be mapped to different execution layers. </p>
<p>Apex was designed to run on YARN natively and take full advantage of its features. When executing on YARN, resource scheduling and allocation are the responsibility of the underlying infrastructure. </p>
<p>There is only one other execution layer implementation for development purposes: Local mode will host an entire application within a single JVM. This allows to do all work including functional testing and efficient debugging within an IDE, before packaging the application and taking it to the YARN cluster.</p>
<h3 id="executing-the-physical-plan-on-yarn">Executing the physical plan on YARN</h3>
<p>When running on YARN, each container in the physical plan is mapped to a separate process (called a container). The containers are requested by StrAM based on the resource specification prescribed by the physical plan. Once the resource manager allocates a container, StrAM will launch the processes on the respective node manager. We call these processes Streaming Containers, reflecting their facilitation for the data flow. The container, once launched, initiates the heartbeat protocol for passing on status information about the execution to StrAM. </p>
<p><img src="" alt="Execution Layer" title=""></p>
<p>Each container provisions a buffer server – the component that enables the pub sub based inter-process data flow. Once all containers are up and StrAM knows the buffer server endpoints, deployment of operators commences. The deploy instructions (and other control commands) are passed as part of the heartbeat response from StrAM to the streaming containers. There is no further scheduling or provisioning related activity unless a process fails or the operator is undeployed due to dynamic changes in the physical plan. </p>
<h3 id="deployment-of-operators">Deployment of operators</h3>
<p>The container, upon receiving the operator deployment request from StrAM, will bring to life the operator from its frozen state (the initial checkpoint). It will create a separate thread for each operator, in which all user code will run (except of course in the case where operators share a thread because the user defined a stream as <code>THREAD_LOCAL</code>). User code comprises all the callbacks defined in the <code>Operator</code> interface. The nice thing about this is that the user code is not concerned with thread synchronization, thus making it easier to develop and typically more efficient to run, as the heavy lifting is left to the engine and overhead avoided. </p>
<p>The very first thing after operator instantiation is the (one time) call to its <code>setup</code> method which gives the operator the opportunity to initialize state that is required for its processing prior to connecting the streams. There is also an optional interface <code>ActivationListener</code> and a method activate which will be called after the operator is wired and just before window processing starts.</p>
<p>Now the operator is ready to process the data, framed in streaming windows. The engine will call <code>beginWindow</code>, then <code>process</code> on the respective input port(s) for every data tuple and <code>endWindow</code>. This will repeat until either something catastrophic happens or StrAM requests an operator undeploy due to dynamic plan changes. It is clear at this point that this lifecycle minimizes the scheduling and expense to bootstrap processing. It is a one time cost.</p>
<p>There are a few other things that happen between invocations of the user code, demarcated by windows. For example, checkpoints are periodically taken (every 30s by default, tunable by the user). There are also optional callbacks defined by <code>CheckpointListener</code> that can be used to implement synchronization with external systems (think database transactions or copy of finalized files, for example).</p>
<h3 id="monitoring-the-execution">Monitoring the execution</h3>
<p>Once the containers are fully provisioned, StrAM records the periodic heartbeat updates, and watches operator processing as data flows through the pipeline. StrAM does not contribute to the data flow itself, processing is decentralized and asynchronous. StrAM collects the stats from the heartbeats and uses them to provide the central view of the execution. For example, it calculates latency based on the window timestamps that are reported, which is vital in identifying processing bottlenecks. It also uses the window information to monitor the progress of operators and identify operators that are stuck (and when necessary restarts them, with an interval controllable by user). StrAM also hosts a REST API that clients such as the CLI can use to collect data. Here is an example for the information that can be obtained through this API:</p>
<pre class="prettyprint"><code class="language-json hljs "> {
"<span class="hljs-attribute">id</span>": <span class="hljs-value"><span class="hljs-string">"3"</span></span>,
"<span class="hljs-attribute">name</span>": <span class="hljs-value"><span class="hljs-string">"counter"</span></span>,
"<span class="hljs-attribute">className</span>": <span class="hljs-value"><span class="hljs-string">"com.datatorrent.lib.algo.UniqueCounter"</span></span>,
"<span class="hljs-attribute">container</span>": <span class="hljs-value"><span class="hljs-string">"container_1443668714920_0001_01_000003"</span></span>,
"<span class="hljs-attribute">host</span>": <span class="hljs-value"><span class="hljs-string">"localhost:8052"</span></span>,
"<span class="hljs-attribute">totalTuplesProcessed</span>": <span class="hljs-value"><span class="hljs-string">"198"</span></span>,
"<span class="hljs-attribute">totalTuplesEmitted</span>": <span class="hljs-value"><span class="hljs-string">"1"</span></span>,
"<span class="hljs-attribute">tuplesProcessedPSMA</span>": <span class="hljs-value"><span class="hljs-string">"0"</span></span>,
"<span class="hljs-attribute">tuplesEmittedPSMA</span>": <span class="hljs-value"><span class="hljs-string">"0"</span></span>,
"<span class="hljs-attribute">cpuPercentageMA</span>": <span class="hljs-value"><span class="hljs-string">"1.5208279931258353"</span></span>,
"<span class="hljs-attribute">latencyMA</span>": <span class="hljs-value"><span class="hljs-string">"10"</span></span>,
"<span class="hljs-attribute">status</span>": <span class="hljs-value"><span class="hljs-string">"ACTIVE"</span></span>,
"<span class="hljs-attribute">lastHeartbeat</span>": <span class="hljs-value"><span class="hljs-string">"1443670671506"</span></span>,
"<span class="hljs-attribute">failureCount</span>": <span class="hljs-value"><span class="hljs-string">"0"</span></span>,
"<span class="hljs-attribute">recoveryWindowId</span>": <span class="hljs-value"><span class="hljs-string">"6200516265145009027"</span></span>,
"<span class="hljs-attribute">currentWindowId</span>": <span class="hljs-value"><span class="hljs-string">"6200516265145009085"</span></span>,
"<span class="hljs-attribute">ports</span>": <span class="hljs-value">[
"<span class="hljs-attribute">name</span>": <span class="hljs-value"><span class="hljs-string">"data"</span></span>,
"<span class="hljs-attribute">type</span>": <span class="hljs-value"><span class="hljs-string">"input"</span></span>,
"<span class="hljs-attribute">totalTuples</span>": <span class="hljs-value"><span class="hljs-string">"198"</span></span>,
"<span class="hljs-attribute">tuplesPSMA</span>": <span class="hljs-value"><span class="hljs-string">"0"</span></span>,
"<span class="hljs-attribute">bufferServerBytesPSMA</span>": <span class="hljs-value"><span class="hljs-string">"16"</span></span>,
"<span class="hljs-attribute">queueSizeMA</span>": <span class="hljs-value"><span class="hljs-string">"1"</span></span>,
"<span class="hljs-attribute">recordingId</span>": <span class="hljs-value"><span class="hljs-literal">null</span>
"<span class="hljs-attribute">name</span>": <span class="hljs-value"><span class="hljs-string">"count"</span></span>,
"<span class="hljs-attribute">type</span>": <span class="hljs-value"><span class="hljs-string">"output"</span></span>,
"<span class="hljs-attribute">totalTuples</span>": <span class="hljs-value"><span class="hljs-string">"1"</span></span>,
"<span class="hljs-attribute">tuplesPSMA</span>": <span class="hljs-value"><span class="hljs-string">"0"</span></span>,
"<span class="hljs-attribute">bufferServerBytesPSMA</span>": <span class="hljs-value"><span class="hljs-string">"12"</span></span>,
"<span class="hljs-attribute">queueSizeMA</span>": <span class="hljs-value"><span class="hljs-string">"0"</span></span>,
"<span class="hljs-attribute">recordingId</span>": <span class="hljs-value"><span class="hljs-literal">null</span>
"<span class="hljs-attribute">unifierClass</span>": <span class="hljs-value"><span class="hljs-literal">null</span></span>,
"<span class="hljs-attribute">logicalName</span>": <span class="hljs-value"><span class="hljs-string">"counter"</span></span>,
"<span class="hljs-attribute">recordingId</span>": <span class="hljs-value"><span class="hljs-literal">null</span></span>,
"<span class="hljs-attribute">counters</span>": <span class="hljs-value"><span class="hljs-literal">null</span></span>,
"<span class="hljs-attribute">metrics</span>": <span class="hljs-value">{}</span>,
"<span class="hljs-attribute">checkpointStartTime</span>": <span class="hljs-value"><span class="hljs-string">"1443670642472"</span></span>,
"<span class="hljs-attribute">checkpointTime</span>": <span class="hljs-value"><span class="hljs-string">"42"</span></span>,
"<span class="hljs-attribute">checkpointTimeMA</span>": <span class="hljs-value"><span class="hljs-string">"129"</span>
<p>This blog covered the lifecycle of a DAG. Future posts will cover the inside view of the Apex engine, including checkpointing, processing semantics, partitioning and more. Watch this space! </p>
<p>The post <a rel="nofollow" href="">Tracing DAGs from specification to execution</a> appeared first on <a rel="nofollow" href="">DataTorrent</a>.</p>
<title>Meet &amp; Name the Apache Apex Logo</title>
<pubDate>Fri, 25 Sep 2015 15:02:32 +0000</pubDate>
<dc:creator><![CDATA[John Fanelli]]></dc:creator>
<category><![CDATA[Big Data in Everyday Life]]></category>
<category><![CDATA[Apache Apex]]></category>
<category><![CDATA[Big Data]]></category>
<category><![CDATA[Fast Batch]]></category>
<guid isPermaLink="false"></guid>
<description><![CDATA[<p>Apache Apex, the open source, enterprise-grade unified stream and fast batch processing engine, is gathering momentum staggeringly fast. The timeline has been aggressive: Project Apex was announced on June 5, code was dropped on GitHub July 30 and an Apache Incubation proposal was posted on August 12 and accepted on August 17. Apache Apex has [&#8230;]</p>
<p>The post <a rel="nofollow" href="">Meet &#038; Name the Apache Apex Logo</a> appeared first on <a rel="nofollow" href="">DataTorrent</a>.</p>
<content:encoded><![CDATA[<p>Apache Apex, the open source, enterprise-grade unified stream and fast batch processing engine, is gathering momentum staggeringly fast.</p>
<p>The timeline has been aggressive: Project Apex <a href="">was announced</a> on June 5, code was dropped on <a href="">GitHub</a> July 30 and an <a href="">Apache Incubation proposal</a> was posted on August 12 and <a href="">accepted</a> on August 17.</p>
<p>Apache Apex has hit some great milestones already. We are just past the one month anniversary and Apache Apex has already <a href="">been named one of the best open source big data tools</a> of 2015 by InfoWorld, hosted its <a href="">first meetup</a> and <a href="">@ApacheApex</a> is quickly gaining Twitter followers.</p>
<p><strong>What’s in a logo?<br />
</strong>Today, we are pleased to introduce the Apache Apex logo. Meet, uhm, actually, he or she doesn&#8217;t have a name yet, and we need your help here! Well, let me describe what (he or she) represents first.</p>
<p>Apache Apex has a lofty goal to be at the top of its game or at the “Apex” of stream and batch processing pipeline engines. As such, you will see a mountain peak that reflects our aspiration to always be the best, and at the peak of our industry.</p>
<p>As a YARN native application, Apache Apex not only runs on, but is also optimized for Hadoop deployments. In a nod to that design center, the logo acknowledges the Hadoop foundation we have built on with feet similar to Hadoop’s logo.</p>
<p>Finally, open <a href="">source in general</a> and Hadoop <a href="">ecosystems</a> in particular are inundated with adorable animal logos. Many people like <a href="">monkees</a>, but if you look closely, you will see that Apache Apex is also Ape-x. So you will also see a cute ape in the logo.</p>
<p>You put all of this together and you get … Ta da….</p>
<p style="text-align: center;"><a href=""><img class="alignnone size-medium wp-image-2114" src="" alt="Apex_550pix(PNG24)" width="300" height="300" /></a></p>
<p><strong>We need your help!<br />
</strong>I discovered four paragraphs ago that our logo doesn’t have a name and that’s where the community comes in. Be a hero to our newest mascot and submit your ideas for his or her name <a href="">here.</a></p>
<p>Need Inspiration? Some of the ideas so far include Himalaya (aka Himmy, the world’s tallest mountain range), Ape-X (a mysterious, yet redundant Ape as it would be Ape-X, the Apache Apex Ape – that’s a lot of Ape to type!) and Ralph (why? … why not?!). As you can see, <a href="">we need your help!</a></p>
<p><strong>Meet us at Strata + Hadoop World to learn more and get an Apex sticker!<br />
</strong>Don’t take our word for it, stop by the DataTorrent Booth – Booth 935 – to share your ideas, learn more about Apache Apex and even score a sticker or mini-lego figure!</p>
<p>The post <a rel="nofollow" href="">Meet &#038; Name the Apache Apex Logo</a> appeared first on <a rel="nofollow" href="">DataTorrent</a>.</p>
<title>Building Applications with Apache Apex and Malhar</title>
<pubDate>Wed, 16 Sep 2015 16:34:03 +0000</pubDate>
<dc:creator><![CDATA[Munagala Ramanath]]></dc:creator>
<guid isPermaLink="false"></guid>
<description><![CDATA[<p>Now that the Apache apex-core and apex-malhar projects are incubating, I thought this would be a good time to provide a simple introduction to developing applications using these Open Source projects. This document introduces the reader to the streaming platform Apache Apex and a large associated library of operators (Malhar) and associated tools. All of [&#8230;]</p>
<p>The post <a rel="nofollow" href="">Building Applications with Apache Apex and Malhar</a> appeared first on <a rel="nofollow" href="">DataTorrent</a>.</p>
<content:encoded><![CDATA[<p>Now that the Apache <strong>apex-core</strong> and <strong>apex-malhar</strong> projects are incubating, I thought this would be a good time to provide a simple introduction to developing applications using these Open Source projects.</p>
<p>This document introduces the reader to the streaming platform Apache Apex and a large associated library of operators (<strong>Malhar</strong>) and associated tools. All of these are open source and freely downloadable from <strong><a href="" target="_blank">github</a></strong>. It is targeted at Java developers using the <strong>Linux</strong><strong></strong> operating system, so the reader is expected to have basic command line skills, have some familiarity with the Java programming language, and have at least a basic understanding of the benefits and pitfalls of distributed computing. Acquaintance with a Java IDE (Integrated Development Environment) such as <em>Eclipse</em>, <em>NetBeans</em> or <em>IntelliJ,</em>  the <em>Maven</em> build tool, and the <em>Git</em> revision control system will be useful though not required. Internet connectivity however, is required since we will need to download Apex and Malhar code.</p>
<p>There are, of course, other ways of approaching the problem such as using the binary releases or the virtual machine sandbox available on our website. However, I felt that diving directly into the code base would be more appealing to a (significant) segment of developers.</p>
<p>The following sections will walk the reader through the various steps to create, compile and run a couple of simple applications. Detailed documents on various aspects of the platform (such as the architecture, basic streaming concepts, etc.) and operator library are provided in the <em>Additional Resources</em> section at the end.</p>
<p>To begin with, make sure that you have Java JDK (version 1.7.0_79 or later), maven (version 3.0.5 or later) and git (version 1.7.1 or later) by running the following commands:</p>
<td width="99"><strong>Command</strong></td>
<td width="400"><strong>Expected output</strong></td>
<td width="99">
<pre><strong>java -version</strong></pre>
<td width="400">
<pre>java version "1.7.0_79"</pre>
<pre>Java(TM) SE Runtime Environment (build 1.7.0_79-b15)</pre>
<pre>Java HotSpot(TM) 64-Bit Server VM (build 24.79-b02, mixed mode)</pre>
<td width="99">
<pre><strong>mvn --version</strong></pre>
<td width="400">
<pre>Apache Maven 3.0.5</pre>
<pre>Maven home: /usr/share/maven</pre>
<pre>Java version: 1.7.0_79, vendor: Oracle Corporation</pre>
<pre>Java home: /home/&lt;user&gt;/Software/java/jdk1.7.0_79/jre</pre>
<pre>Default locale: en_US, platform encoding: UTF-8</pre>
<pre>OS name: "linux", version: "3.16.0-44-generic", arch: "amd64", family: "unix"</pre>
<td width="99">
<pre><strong>git --version</strong></pre>
<td width="400">
<pre>git version 1.7.1</pre>
<p>Next, create a new directory where you want the code to reside, for example:</p>
<pre><strong>    cd ~/src; mkdir dt; cd dt</strong></pre>
<p>Now, download the code for Apex and Malhar using git (all the commands below assume that you are running them from this newly created directory):</p>
<pre><strong>    git clone</strong></pre>
<pre><strong>    git clone</strong></pre>
<p>You should now see two directories named  <code><strong>incubator-apex-core</strong></code> and <code><strong>incubator-apex-malhar</strong></code>. Step into each, switch to the 3.1 branch and build it (make sure you build <code><strong>incubator-apex-core</strong> </code> first):</p>
<pre><strong>    pushd incubator-apex-core; git checkout release-3.1; mvn clean install -DskipTests; popd</strong></pre>
<pre><strong>    pushd incubator-apex-malhar; git checkout release-3.1; mvn clean install -DskipTests; popd</strong></pre>
<p>If you prefer, you can omit the semicolons and type each command on a line by itself. The “<code><strong>install</strong></code>” argument to the <code><strong>mvn</strong></code> command will cause resources from each project to be installed to your local maven repository (typically <code><strong>~/.m2/repository</strong></code>); nothing is installed to your system directories so you do not need root (superuser) privileges to run this command. The “<code><strong>-DskipTests</strong></code>” argument skips running unit tests since they can take a long time.</p>
<p>At the time of writing, 3.1 was the latest release so we use the <strong>release-3.1</strong> branch; if a later release is available, use a suitably modified branch name.</p>
<p>If this is your first build of these components, it can take several minutes to complete since Maven will download a variety of plugins needed for the build; later builds should go much faster.</p>
<h1>Running pre-built applications</h1>
<p>There are a number of pre-built applications that are bundled with Malhar, each in its own subdirectory under <code><strong>incubator-apex-malhar</strong>/demos</code>; some of these are standalone and some require external resources (such as a Twitter feed for the twitter demo). The build creates a file with the <strong>.apa</strong> extension (called an “<em>application package</em>”) for each in the corresponding target subdirectory; such packages may contain multiple applications.</p>
<p>We now discuss the command-line tool called <strong>dtcli</strong> which is bundled in core. Define the following alias for convenience: <code>alias dtcli=`pwd`/incubator-apex-core/engine/src/main/scripts/dtcli</code></p>
<p>You can now start it with:</p>
<p>It may produce a couple of warnings about log4j not being initialized; these are harmless and may be ignored.  You should see the “<code><strong>dt&gt;</strong></code>” prompt. Type <strong>help</strong> to see a list of available commands and a brief description for each. You can exit this application by typing <code><strong>CTRL-D</strong></code> or <code>exit</code>. For any command you enter, the left/right arrow keys will move the cursor within the command to edit it and the up/down arrow keys to cycle through command history just like a shell.</p>
<p>For brevity, we use these variables in the commands that follow (they can be entered verbatim at the “<code>dt&gt;</code>” prompt):</p>
<p>Now, run a couple of the standalone apps as follows; in each case you can hit <code>CTRL-C</code> to interrupt the program and return to the prompt:</p>
<table style="height: 249px" width="968">
<td width="104"><strong>Command</strong></td>
<td width="395"><strong>Description</strong></td>
<td width="104"><code>launch -local $wc</code></td>
<td width="395">Prints a list of unique words encountered in each window of the input stream along with a frequency count for each. The input for this application comes from the file <code>demos/wordcount/src/main/resources/samplefile.txt</code></td>
<td width="104"><code>launch -local $uc</code></td>
<td width="395">Shows you a list of two applications present in this jar; choose “<strong>1</strong>” which corresponds to the application named <code><strong>UniqueKeyValueCountDemo</strong></code>. It should then prints a list of unique key-value pairs and a frequency count for each. The input for this application is randomly generated.</td>
<p>The “<code><strong>-local</strong></code>” option requests the app to be run in <em>local mode</em> – meaning without a cluster. Running it in a Hadoop cluster is a more advanced topic and will be covered in later blogs.</p>
<p>There are a number of additional pre-built applications under the demos directory:</p>
<td width="104"><strong>Application</strong></td>
<td width="395"><strong>Description</strong></td>
<td width="104"><code>echoserver</code></td>
<td width="395">Reads messages from a network connection and echoes them back out</td>
<td width="104"><code>frauddetect</code></td>
<td width="395">Analyzes a stream of credit card merchant transactions</td>
<td width="104"><code>machinedata</code></td>
<td width="395">Analyzes a synthetic stream of events to determine health of a machine</td>
<td width="104"><code>mobile</code></td>
<td width="395">Demonstrates scalability of the Data Torrent platform by analyzing a synthetic stream of location data for a large number of cell phone connections.</td>
<td width="104"><code>mroperator</code></td>
<td width="395">Contains several <em>map-reduce</em> applications.</td>
<td width="104"><code>pi</code></td>
<td width="395">Simple application that compute the value of pi (π) using Monte Carlo methods.</td>
<td width="104"><code>r</code></td>
<td width="395">Analyzes a synthetic stream of eruption event data for the Old Faithful geyser (</td>
<td width="104"><code>twitter</code></td>
<td width="395">Analyzes live tweet data from a Twitter stream; needs valid Twitter credentials.</td>
<td width="104"><code>yahoofinance</code></td>
<td width="395">Analyzes live stock transaction data</td>
<p>You can now experiment with modifying the source code for these applications or proceed to the next section to build some applications from scratch.</p>
<h1>Your first application</h1>
<p>We will now build a new application from scratch. Put the following lines into a text file named, say, newdt and run the file with <code>bash newdt</code> to generate a new maven project (the command can be typed on multiple lines as shown here with a backslash at the end of each line except the last for better readability, or all on a single line without the backslashes):</p>
<pre># script to create a new project</pre>
<pre>mvn archetype:generate \</pre>
<pre> -DarchetypeRepository= \</pre>
<pre>-DarchetypeGroupId=com.datatorrent \</pre>
<pre>-DarchetypeArtifactId=apex-app-archetype \</pre>
<pre>-DarchetypeVersion=3.1.1 \</pre>
<pre>-DgroupId=com.example \</pre>
<pre>-Dpackage=com.example.myapexapp \</pre>
<pre>-DartifactId=myapexapp \</pre>
<p>The command details are based on the contents of <code>incubator-apex-core/apex-app-archetype/</code> and may need to be adjusted as new versions of core and malhar are released. It will display a summary of these properties and prompt for confirmation with “ <code>Y: :</code>” at which point you can just hit <code>ENTER</code>. It should then create a new project directory named <code>myapexapp</code> containing a basic application that generates an unending sequence of random numbers and prints them out.</p>
<p>Step into the new project directory and run the usual command to build it:</p>
<pre>cd myapexapp; mvn clean package -DskipTests</pre>
<p>If the build is successful, it should have created the application package file:</p>
<p>This package can be run the same way we ran the pre-built apps above; as before, hit <code>CTRL-C</code> to stop the app. It will print a series of lines that look something like this: <code>hello world: 0.30266721560549</code></p>
<p>The app consists of two source files: <code></code> and <code></code>. The former has the main application code that instantiates operators and wires them together in a <em>dag </em>(Directed Acyclic Graph); the latter is an operator that generates a sequence of random numbers. The dag uses a second operator, <code>ConsoleOutputOperator</code> which simply prints input tuples to the console; its source file is under <code>incubator-apex-malhar</code>.</p>
<p>Creating applications is discussed in more detail in the <a href="">Application Developer Guide</a>, and the package format is discussed in <a href="">Application Packages</a> (see Additional Resources section below).</p>
<h1>Operators in Malhar</h1>
<p>In addition to the operators bundled with each demo application, the Malhar library contains a large number of operators to meet a variety of needs. An overview is at <a href="">Malhar Operator Library</a>. You can, of course, write your own operators as well. The following observations may be useful in solidifying your understanding of operators as you go through the code and documentation.</p>
<li>Input data units are arbitrary Java objects and are called &#8216;<strong>tuples</strong>&#8216; or &#8216;<strong>events</strong>&#8216; (synonymous).</li>
<li>Operators can, generally speaking, have multiple input and output ports; they receive tuples on input ports and emit tuples on output ports.</li>
<li>An operator may have no ports at all!</li>
<li>Input and Output ports are fields defined in an operator; they may optionally be annotated with <code>@InputPortFieldAnnotation</code> or <code>@OutputPortFieldAnnotation</code><strong>.</strong></li>
<li>The specific type of objects entering an input port is always the same; likewise the type of objects leaving an output port. However, the types of objects associated with two different ports can be, and often are, very different.</li>
<li>All input operators implement the <code>InputOperator</code> interface, receive data from an external source (not from another operator) and invoke <code>emitTuples()</code> to write data to one or more output ports. Since they cannot receive data from another operator they have no input ports.</li>
<li>Output operators (there is no OutputOperator interface or base class) receive tuples on input ports, process them and write computed data (typically via the <code>processTuple()</code> method) to external sinks.</li>
<li>The vertices of a DAG are operators; the arcs between them are (parts of) streams.</li>
<li>A single port is connected to a single stream.</li>
<li>A stream is connected to <strong>one</strong> <em>output port of an operator </em>and multiple <em>input ports of other operators</em>; tuples enter the stream through the <em>unique</em> output port and are replicated to all the input ports. So a stream embodies the <em>collection</em> of outbound arcs at a single output port of a vertex (operator).</li>
<li>The output port that is the common initial end-point of all these arcs is called the <strong><em>source</em></strong> of the stream. Similarly, all the input ports at the final end-points of these arcs are called <strong><em>sinks</em></strong>. So a stream has a single source but can have multiple sinks.</li>
<li>A stream is represented by the StreamMeta interface and is created via <code>DAG.addStream()</code>.</li>
<li><code>DAG.StreamMeta</code> has methods <code>setSource()</code> for the output port and <code>addSink()</code> for adding input ports.</li>
<li>Ports are often created by extending <code>DefaultOutputPort</code> and <code>DefaultInputPort</code>.</li>
<h1>Additional Resources</h1>
<li><strong>Application Developer Guide </strong><a href=""></a></li>
<li><strong>Application Packages</strong> <a href=""></a></li>
<li><strong>Operator Developer Guide </strong><a href=""></a></li>
<li><strong>Malhar Operator Library </strong><a href=""></a></li>
<p>The post <a rel="nofollow" href="">Building Applications with Apache Apex and Malhar</a> appeared first on <a rel="nofollow" href="">DataTorrent</a>.</p>
<!-- Dynamic page generated in 0.322 seconds. -->
<!-- Cached page generated by WP-Super-Cache on 2015-11-10 15:02:12 -->