blob: 06cc7ea20f3821f82d63bbd7a547ed15201b9bc4 [file] [log] [blame]
<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
<channel>
<title>Flink Blog Feed</title>
<description>Flink Blog</description>
<link>http://flink.apache.org/blog</link>
<atom:link href="http://flink.apache.org/blog/feed.xml" rel="self" type="application/rss+xml" />
<item>
<title>Announcing Apache Flink 0.9.0</title>
<description>&lt;p&gt;The Apache Flink community is pleased to announce the availability of the 0.9.0 release. The release is the result of many months of hard work within the Flink community. It contains many new features and improvements which were previewed in the 0.9.0-milestone1 release and have been polished since then. This is the largest Flink release so far.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;http://flink.apache.org/downloads.html&quot;&gt;Download the release&lt;/a&gt; and check out &lt;a href=&quot;http://ci.apache.org/projects/flink/flink-docs-release-0.9/&quot;&gt;the documentation&lt;/a&gt;. Feedback through the Flink&lt;a href=&quot;http://flink.apache.org/community.html#mailing-lists&quot;&gt; mailing lists&lt;/a&gt; is, as always, very welcome!&lt;/p&gt;
&lt;h2 id=&quot;new-features&quot;&gt;New Features&lt;/h2&gt;
&lt;h3 id=&quot;exactly-once-fault-tolerance-for-streaming-programs&quot;&gt;Exactly-once Fault Tolerance for streaming programs&lt;/h3&gt;
&lt;p&gt;This release introduces a new fault tolerance mechanism for streaming dataflows. The new checkpointing algorithm takes data sources and also user-defined state into account and recovers failures such that all records are reflected exactly once in the operator states.&lt;/p&gt;
&lt;p&gt;The checkpointing algorithm is lightweight and driven by barriers that are periodically injected into the data streams at the sources. As such, it has an extremely low coordination overhead and is able to sustain very high throughput rates. User-defined state can be automatically backed up to configurable storage by the fault tolerance mechanism.&lt;/p&gt;
&lt;p&gt;Please refer to &lt;a href=&quot;http://ci.apache.org/projects/flink/flink-docs-release-0.9/apis/streaming_guide.html#stateful-computation&quot;&gt;the documentation on stateful computation&lt;/a&gt; for details in how to use fault tolerant data streams with Flink.&lt;/p&gt;
&lt;p&gt;The fault tolerance mechanism requires data sources that can replay recent parts of the stream, such as &lt;a href=&quot;http://kafka.apache.org&quot;&gt;Apache Kafka&lt;/a&gt;. Read more &lt;a href=&quot;http://ci.apache.org/projects/flink/flink-docs-release-0.9/apis/streaming_guide.html#apache-kafka&quot;&gt;about how to use the persistent Kafka source&lt;/a&gt;.&lt;/p&gt;
&lt;h3 id=&quot;table-api&quot;&gt;Table API&lt;/h3&gt;
&lt;p&gt;Flink’s new Table API offers a higher-level abstraction for interacting with structured data sources. The Table API allows users to execute logical, SQL-like queries on distributed data sets while allowing them to freely mix declarative queries with regular Flink operators. Here is an example that groups and joins two tables:&lt;/p&gt;
&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-scala&quot;&gt;&lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;clickCounts&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;clicks&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;groupBy&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;user&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;select&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;userId&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;url&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;count&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;count&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;activeUsers&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;users&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;join&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;clickCounts&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;where&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;id&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;===&lt;/span&gt; &lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;userId&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;count&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;10&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;select&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;username&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;count&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;...)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Tables consist of logical attributes that can be selected by name rather than physical Java and Scala data types. This alleviates a lot of boilerplate code for common ETL tasks and raises the abstraction for Flink programs. Tables are available for both static and streaming data sources (DataSet and DataStream APIs).&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;http://ci.apache.org/projects/flink/flink-docs-release-0.9/libs/table.html&quot;&gt;Check out the Table guide for Java and Scala&lt;/a&gt;.&lt;/p&gt;
&lt;h3 id=&quot;gelly-graph-processing-api&quot;&gt;Gelly Graph Processing API&lt;/h3&gt;
&lt;p&gt;Gelly is a Java Graph API for Flink. It contains a set of utilities for graph analysis, support for iterative graph processing and a library of graph algorithms. Gelly exposes a Graph data structure that wraps DataSets for vertices and edges, as well as methods for creating graphs from DataSets, graph transformations and utilities (e.g., in- and out- degrees of vertices), neighborhood aggregations, iterative vertex-centric graph processing, as well as a library of common graph algorithms, including PageRank, SSSP, label propagation, and community detection.&lt;/p&gt;
&lt;p&gt;Gelly internally builds on top of Flink’s&lt;a href=&quot;http://ci.apache.org/projects/flink/flink-docs-release-0.9/apis/iterations.html&quot;&gt; delta iterations&lt;/a&gt;. Iterative graph algorithms are executed leveraging mutable state, achieving similar performance with specialized graph processing systems.&lt;/p&gt;
&lt;p&gt;Gelly will eventually subsume Spargel, Flink’s Pregel-like API.&lt;/p&gt;
&lt;p&gt;Note: The Gelly library is still in beta status and subject to improvements and heavy performance tuning.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;http://ci.apache.org/projects/flink/flink-docs-release-0.9/libs/gelly_guide.html&quot;&gt;Check out the Gelly guide&lt;/a&gt;.&lt;/p&gt;
&lt;h3 id=&quot;flink-machine-learning-library&quot;&gt;Flink Machine Learning Library&lt;/h3&gt;
&lt;p&gt;This release includes the first version of Flink’s Machine Learning library. The library’s pipeline approach, which has been strongly inspired by scikit-learn’s abstraction of transformers and predictors, makes it easy to quickly set up a data processing pipeline and to get your job done.&lt;/p&gt;
&lt;p&gt;Flink distinguishes between transformers and predictors. Transformers are components which transform your input data into a new format allowing you to extract features, cleanse your data or to sample from it. Predictors on the other hand constitute the components which take your input data and train a model on it. The model you obtain from the learner can then be evaluated and used to make predictions on unseen data.&lt;/p&gt;
&lt;p&gt;Currently, the machine learning library contains transformers and predictors to do multiple tasks. The library supports multiple linear regression using stochastic gradient descent to scale to large data sizes. Furthermore, it includes an alternating least squares (ALS) implementation to factorizes large matrices. The matrix factorization can be used to do collaborative filtering. An implementation of the communication efficient distributed dual coordinate ascent (CoCoA) algorithm is the latest addition to the library. The CoCoA algorithm can be used to train distributed soft-margin SVMs.&lt;/p&gt;
&lt;p&gt;Note: The ML library is still in beta status and subject to improvements and heavy performance tuning.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;http://ci.apache.org/projects/flink/flink-docs-release-0.9/libs/ml/&quot;&gt;Check out FlinkML&lt;/a&gt;&lt;/p&gt;
&lt;h3 id=&quot;flink-on-yarn-leveraging-apache-tez&quot;&gt;Flink on YARN leveraging Apache Tez&lt;/h3&gt;
&lt;p&gt;We are introducing a new execution mode for Flink to be able to run restricted Flink programs on top of&lt;a href=&quot;http://tez.apache.org&quot;&gt; Apache Tez&lt;/a&gt;. This mode retains Flink’s APIs, optimizer, as well as Flink’s runtime operators, but instead of wrapping those in Flink tasks that are executed by Flink TaskManagers, it wraps them in Tez runtime tasks and builds a Tez DAG that represents the program.&lt;/p&gt;
&lt;p&gt;By using Flink on Tez, users have an additional choice for an execution platform for Flink programs. While Flink’s distributed runtime favors low latency, streaming shuffles, and iterative algorithms, Tez focuses on scalability and elastic resource usage in shared YARN clusters.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;http://ci.apache.org/projects/flink/flink-docs-release-0.9/setup/flink_on_tez.html&quot;&gt;Get started with Flink on Tez&lt;/a&gt;.&lt;/p&gt;
&lt;h3 id=&quot;reworked-distributed-runtime-on-akka&quot;&gt;Reworked Distributed Runtime on Akka&lt;/h3&gt;
&lt;p&gt;Flink’s RPC system has been replaced by the widely adopted&lt;a href=&quot;http://akka.io&quot;&gt; Akka&lt;/a&gt; framework. Akka’s concurrency model offers the right abstraction to develop a fast as well as robust distributed system. By using Akka’s own failure detection mechanism the stability of Flink’s runtime is significantly improved, because the system can now react in proper form to node outages. Furthermore, Akka improves Flink’s scalability by introducing asynchronous messages to the system. These asynchronous messages allow Flink to be run on many more nodes than before.&lt;/p&gt;
&lt;h3 id=&quot;improved-yarn-support&quot;&gt;Improved YARN support&lt;/h3&gt;
&lt;p&gt;Flink’s YARN client contains several improvements, such as a detached mode for starting a YARN session in the background, the ability to submit a single Flink job to a YARN cluster without starting a session, including a “fire and forget” mode. Flink is now also able to reallocate failed YARN containers to maintain the size of the requested cluster. This feature allows to implement fault-tolerant setups on top of YARN. There is also an internal Java API to deploy and control a running YARN cluster. This is being used by system integrators to easily control Flink on YARN within their Hadoop 2 cluster.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;http://ci.apache.org/projects/flink/flink-docs-release-0.9/setup/yarn_setup.html&quot;&gt;See the YARN docs&lt;/a&gt;.&lt;/p&gt;
&lt;h3 id=&quot;static-code-analysis-for-the-flink-optimizer-opening-the-udf-blackboxes&quot;&gt;Static Code Analysis for the Flink Optimizer: Opening the UDF blackboxes&lt;/h3&gt;
&lt;p&gt;This release introduces a first version of a static code analyzer that pre-interprets functions written by the user to get information about the function’s internal dataflow. The code analyzer can provide useful information about &lt;a href=&quot;http://ci.apache.org/projects/flink/flink-docs-release-0.9/apis/programming_guide.html#semantic-annotations&quot;&gt;forwarded fields&lt;/a&gt; to Flink’s optimizer and thus speedup job executions. It also informs if the code contains obvious mistakes. For stability reasons, the code analyzer is initially disabled by default. It can be activated through&lt;/p&gt;
&lt;p&gt;ExecutionEnvironment.getExecutionConfig().setCodeAnalysisMode(…)&lt;/p&gt;
&lt;p&gt;either as an assistant that gives hints during the implementation or by directly applying the optimizations that have been found.&lt;/p&gt;
&lt;h2 id=&quot;more-improvements-and-fixes&quot;&gt;More Improvements and Fixes&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://issues.apache.org/jira/browse/FLINK-1605&quot;&gt;FLINK-1605&lt;/a&gt;: Flink is not exposing its Guava and ASM dependencies to Maven projects depending on Flink. We use the maven-shade-plugin to relocate these dependencies into our own namespace. This allows users to use any Guava or ASM version.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://issues.apache.org/jira/browse/FLINK-1605&quot;&gt;FLINK-1417&lt;/a&gt;: Automatic recognition and registration of Java Types at Kryo and the internal serializers: Flink has its own type handling and serialization framework falling back to Kryo for types that it cannot handle. To get the best performance Flink is automatically registering all types a user is using in their program with Kryo.Flink also registers serializers for Protocol Buffers, Thrift, Avro and YodaTime automatically. Users can also manually register serializers to Kryo (https://issues.apache.org/jira/browse/FLINK-1399)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://issues.apache.org/jira/browse/FLINK-1296&quot;&gt;FLINK-1296&lt;/a&gt;: Add support for sorting very large records&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://issues.apache.org/jira/browse/FLINK-1679&quot;&gt;FLINK-1679&lt;/a&gt;: “degreeOfParallelism” methods renamed to “parallelism”&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://issues.apache.org/jira/browse/FLINK-1501&quot;&gt;FLINK-1501&lt;/a&gt;: Add metrics library for monitoring TaskManagers&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://issues.apache.org/jira/browse/FLINK-1760&quot;&gt;FLINK-1760&lt;/a&gt;: Add support for building Flink with Scala 2.11&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://issues.apache.org/jira/browse/FLINK-1648&quot;&gt;FLINK-1648&lt;/a&gt;: Add a mode where the system automatically sets the parallelism to the available task slots&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://issues.apache.org/jira/browse/FLINK-1622&quot;&gt;FLINK-1622&lt;/a&gt;: Add groupCombine operator&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://issues.apache.org/jira/browse/FLINK-1589&quot;&gt;FLINK-1589&lt;/a&gt;: Add option to pass Configuration to LocalExecutor&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://issues.apache.org/jira/browse/FLINK-1504&quot;&gt;FLINK-1504&lt;/a&gt;: Add support for accessing secured HDFS clusters in standalone mode&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://issues.apache.org/jira/browse/FLINK-1478&quot;&gt;FLINK-1478&lt;/a&gt;: Add strictly local input split assignment&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://issues.apache.org/jira/browse/FLINK-1512&quot;&gt;FLINK-1512&lt;/a&gt;: Add CsvReader for reading into POJOs.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://issues.apache.org/jira/browse/FLINK-1461&quot;&gt;FLINK-1461&lt;/a&gt;: Add sortPartition operator&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://issues.apache.org/jira/browse/FLINK-1450&quot;&gt;FLINK-1450&lt;/a&gt;: Add Fold operator to the Streaming api&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://issues.apache.org/jira/browse/FLINK-1389&quot;&gt;FLINK-1389&lt;/a&gt;: Allow setting custom file extensions for files created by the FileOutputFormat&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://issues.apache.org/jira/browse/FLINK-1236&quot;&gt;FLINK-1236&lt;/a&gt;: Add support for localization of Hadoop Input Splits&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://issues.apache.org/jira/browse/FLINK-1179&quot;&gt;FLINK-1179&lt;/a&gt;: Add button to JobManager web interface to request stack trace of a TaskManager&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://issues.apache.org/jira/browse/FLINK-1105&quot;&gt;FLINK-1105&lt;/a&gt;: Add support for locally sorted output&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://issues.apache.org/jira/browse/FLINK-1688&quot;&gt;FLINK-1688&lt;/a&gt;: Add socket sink&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://issues.apache.org/jira/browse/FLINK-1436&quot;&gt;FLINK-1436&lt;/a&gt;: Improve usability of command line interface&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://issues.apache.org/jira/browse/FLINK-2174&quot;&gt;FLINK-2174&lt;/a&gt;: Allow comments in ‘slaves’ file&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://issues.apache.org/jira/browse/FLINK-1698&quot;&gt;FLINK-1698&lt;/a&gt;: Add polynomial base feature mapper to ML library&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://issues.apache.org/jira/browse/FLINK-1697&quot;&gt;FLINK-1697&lt;/a&gt;: Add alternating least squares algorithm for matrix factorization to ML library&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://issues.apache.org/jira/browse/FLINK-1792&quot;&gt;FLINK-1792&lt;/a&gt;: FLINK-456 Improve TM Monitoring: CPU utilization, hide graphs by default and show summary only&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://issues.apache.org/jira/browse/FLINK-1672&quot;&gt;FLINK-1672&lt;/a&gt;: Refactor task registration/unregistration&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://issues.apache.org/jira/browse/FLINK-2001&quot;&gt;FLINK-2001&lt;/a&gt;: DistanceMetric cannot be serialized&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://issues.apache.org/jira/browse/FLINK-1676&quot;&gt;FLINK-1676&lt;/a&gt;: enableForceKryo() is not working as expected&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://issues.apache.org/jira/browse/FLINK-1959&quot;&gt;FLINK-1959&lt;/a&gt;: Accumulators BROKEN after Partitioning&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://issues.apache.org/jira/browse/FLINK-1696&quot;&gt;FLINK-1696&lt;/a&gt;: Add multiple linear regression to ML library&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://issues.apache.org/jira/browse/FLINK-1820&quot;&gt;FLINK-1820&lt;/a&gt;: Bug in DoubleParser and FloatParser - empty String is not casted to 0&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://issues.apache.org/jira/browse/FLINK-1985&quot;&gt;FLINK-1985&lt;/a&gt;: Streaming does not correctly forward ExecutionConfig to runtime&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://issues.apache.org/jira/browse/FLINK-1828&quot;&gt;FLINK-1828&lt;/a&gt;: Impossible to output data to an HBase table&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://issues.apache.org/jira/browse/FLINK-1952&quot;&gt;FLINK-1952&lt;/a&gt;: Cannot run ConnectedComponents example: Could not allocate a slot on instance&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://issues.apache.org/jira/browse/FLINK-1848&quot;&gt;FLINK-1848&lt;/a&gt;: Paths containing a Windows drive letter cannot be used in FileOutputFormats&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://issues.apache.org/jira/browse/FLINK-1954&quot;&gt;FLINK-1954&lt;/a&gt;: Task Failures and Error Handling&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://issues.apache.org/jira/browse/FLINK-2004&quot;&gt;FLINK-2004&lt;/a&gt;: Memory leak in presence of failed checkpoints in KafkaSource&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://issues.apache.org/jira/browse/FLINK-2132&quot;&gt;FLINK-2132&lt;/a&gt;: Java version parsing is not working for OpenJDK&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://issues.apache.org/jira/browse/FLINK-2098&quot;&gt;FLINK-2098&lt;/a&gt;: Checkpoint barrier initiation at source is not aligned with snapshotting&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://issues.apache.org/jira/browse/FLINK-2069&quot;&gt;FLINK-2069&lt;/a&gt;: writeAsCSV function in DataStream Scala API creates no file&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://issues.apache.org/jira/browse/FLINK-2092&quot;&gt;FLINK-2092&lt;/a&gt;: Document (new) behavior of print() and execute()&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://issues.apache.org/jira/browse/FLINK-2177&quot;&gt;FLINK-2177&lt;/a&gt;: NullPointer in task resource release&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://issues.apache.org/jira/browse/FLINK-2054&quot;&gt;FLINK-2054&lt;/a&gt;: StreamOperator rework removed copy calls when passing output to a chained operator&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://issues.apache.org/jira/browse/FLINK-2196&quot;&gt;FLINK-2196&lt;/a&gt;: Missplaced Class in flink-java SortPartitionOperator&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://issues.apache.org/jira/browse/FLINK-2191&quot;&gt;FLINK-2191&lt;/a&gt;: Inconsistent use of Closure Cleaner in Streaming API&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://issues.apache.org/jira/browse/FLINK-2206&quot;&gt;FLINK-2206&lt;/a&gt;: JobManager webinterface shows 5 finished jobs at most&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://issues.apache.org/jira/browse/FLINK-2188&quot;&gt;FLINK-2188&lt;/a&gt;: Reading from big HBase Tables&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://issues.apache.org/jira/browse/FLINK-1781&quot;&gt;FLINK-1781&lt;/a&gt;: Quickstarts broken due to Scala Version Variables&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;notice&quot;&gt;Notice&lt;/h2&gt;
&lt;p&gt;The 0.9 series of Flink is the last version to support Java 6. If you are still using Java 6, please consider upgrading to Java 8 (Java 7 ended its free support in April 2015).&lt;/p&gt;
&lt;p&gt;Flink will require at least Java 7 in major releases after 0.9.0.&lt;/p&gt;
</description>
<pubDate>Wed, 24 Jun 2015 16:00:00 +0200</pubDate>
<link>http://flink.apache.org/news/2015/06/24/announcing-apache-flink-0.9.0-release.html</link>
<guid isPermaLink="true">/news/2015/06/24/announcing-apache-flink-0.9.0-release.html</guid>
</item>
<item>
<title>April 2015 in the Flink community</title>
<description>&lt;p&gt;April was an packed month for Apache Flink.&lt;/p&gt;
&lt;h2 id=&quot;flink-090-milestone1-release&quot;&gt;Flink 0.9.0-milestone1 release&lt;/h2&gt;
&lt;p&gt;The highlight of April was of course the availability of &lt;a href=&quot;/news/2015/04/13/release-0.9.0-milestone1.html&quot;&gt;Flink 0.9-milestone1&lt;/a&gt;. This was a release packed with new features, including, a Python DataSet API, the new SQL-like Table API, FlinkML, a machine learning library on Flink, Gelly, FLink’s Graph API, as well as a mode to run Flink on YARN leveraging Tez. In case you missed it, check out the &lt;a href=&quot;/news/2015/04/13/release-0.9.0-milestone1.html&quot;&gt;release announcement blog post&lt;/a&gt; for details&lt;/p&gt;
&lt;h2 id=&quot;conferences-and-meetups&quot;&gt;Conferences and meetups&lt;/h2&gt;
&lt;p&gt;April kicked off the conference season. Apache Flink was presented at ApacheCon in Texas (&lt;a href=&quot;http://www.slideshare.net/fhueske/apache-flink&quot;&gt;slides&lt;/a&gt;), the Hadoop Summit in Brussels featured two talks on Flink (see slides &lt;a href=&quot;http://www.slideshare.net/AljoschaKrettek/data-analysis-with-apache-flink-hadoop-summit-2015&quot;&gt;here&lt;/a&gt; and &lt;a href=&quot;http://www.slideshare.net/GyulaFra/flink-streaming-hadoopsummit&quot;&gt;here&lt;/a&gt;), as well as at the Hadoop User Groups of the Netherlands (&lt;a href=&quot;http://www.slideshare.net/stephanewen1/apache-flink-overview-and-use-cases-at-prehadoop-summit-meetups&quot;&gt;slides&lt;/a&gt;) and Stockholm. The brand new &lt;a href=&quot;http://www.meetup.com/Apache-Flink-Stockholm/&quot;&gt;Apache Flink meetup Stockholm&lt;/a&gt; was also established.&lt;/p&gt;
&lt;h2 id=&quot;google-summer-of-code&quot;&gt;Google Summer of Code&lt;/h2&gt;
&lt;p&gt;Three students will work on Flink during Google’s &lt;a href=&quot;https://www.google-melange.com/gsoc/homepage/google/gsoc2015&quot;&gt;Summer of Code program&lt;/a&gt; on distributed pattern matching, exact and approximate statistics for data streams and windows, as well as asynchronous iterations and updates.&lt;/p&gt;
&lt;h2 id=&quot;flink-on-the-web&quot;&gt;Flink on the web&lt;/h2&gt;
&lt;p&gt;Fabian Hueske gave an &lt;a href=&quot;http://www.infoq.com/news/2015/04/hueske-apache-flink?utm_campaign=infoq_content&amp;amp;utm_source=infoq&amp;amp;utm_medium=feed&amp;amp;utm_term=global&quot;&gt;interview at InfoQ&lt;/a&gt; on Apache Flink.&lt;/p&gt;
&lt;h2 id=&quot;upcoming-events&quot;&gt;Upcoming events&lt;/h2&gt;
&lt;p&gt;Stay tuned for a wealth of upcoming events! Two Flink talsk will be presented at &lt;a href=&quot;http://berlinbuzzwords.de/15/sessions&quot;&gt;Berlin Buzzwords&lt;/a&gt;, Flink will be presented at the &lt;a href=&quot;http://2015.hadoopsummit.org/san-jose/&quot;&gt;Hadoop Summit in San Jose&lt;/a&gt;. A &lt;a href=&quot;http://www.meetup.com/Apache-Flink-Meetup/events/220557545/&quot;&gt;training workshop on Apache Flink&lt;/a&gt; is being organized in Berlin. Finally, &lt;a href=&quot;http://flink-forward.org&quot;&gt;Flink Forward&lt;/a&gt;, the first conference to bring together the whole Flink community is taking place in Berlin in October 2015.&lt;/p&gt;
</description>
<pubDate>Thu, 14 May 2015 12:00:00 +0200</pubDate>
<link>http://flink.apache.org/news/2015/05/14/Community-update-April.html</link>
<guid isPermaLink="true">/news/2015/05/14/Community-update-April.html</guid>
</item>
<item>
<title>Juggling with Bits and Bytes</title>
<description>&lt;h2 id=&quot;how-apache-flink-operates-on-binary-data&quot;&gt;How Apache Flink operates on binary data&lt;/h2&gt;
&lt;p&gt;Nowadays, a lot of open-source systems for analyzing large data sets are implemented in Java or other JVM-based programming languages. The most well-known example is Apache Hadoop, but also newer frameworks such as Apache Spark, Apache Drill, and also Apache Flink run on JVMs. A common challenge that JVM-based data analysis engines face is to store large amounts of data in memory - both for caching and for efficient processing such as sorting and joining of data. Managing the JVM memory well makes the difference between a system that is hard to configure and has unpredictable reliability and performance and a system that behaves robustly with few configuration knobs.&lt;/p&gt;
&lt;p&gt;In this blog post we discuss how Apache Flink manages memory, talk about its custom data de/serialization stack, and show how it operates on binary data.&lt;/p&gt;
&lt;h2 id=&quot;data-objects-lets-put-them-on-the-heap&quot;&gt;Data Objects? Let’s put them on the heap!&lt;/h2&gt;
&lt;p&gt;The most straight-forward approach to process lots of data in a JVM is to put it as objects on the heap and operate on these objects. Caching a data set as objects would be as simple as maintaining a list containing an object for each record. An in-memory sort would simply sort the list of objects.
However, this approach has a few notable drawbacks. First of all it is not trivial to watch and control heap memory usage when a lot of objects are created and invalidated constantly. Memory overallocation instantly kills the JVM with an &lt;code&gt;OutOfMemoryError&lt;/code&gt;. Another aspect is garbage collection on multi-GB JVMs which are flooded with new objects. The overhead of garbage collection in such environments can easily reach 50% and more. Finally, Java objects come with a certain space overhead depending on the JVM and platform. For data sets with many small objects this can significantly reduce the effectively usable amount of memory. Given proficient system design and careful, use-case specific system parameter tuning, heap memory usage can be more or less controlled and &lt;code&gt;OutOfMemoryErrors&lt;/code&gt; avoided. However, such setups are rather fragile especially if data characteristics or the execution environment change.&lt;/p&gt;
&lt;h2 id=&quot;what-is-flink-doing-about-that&quot;&gt;What is Flink doing about that?&lt;/h2&gt;
&lt;p&gt;Apache Flink has its roots at a research project which aimed to combine the best technologies of MapReduce-based systems and parallel database systems. Coming from this background, Flink has always had its own way of processing data in-memory. Instead of putting lots of objects on the heap, Flink serializes objects into a fixed number of pre-allocated memory segments. Its DBMS-style sort and join algorithms operate as much as possible on this binary data to keep the de/serialization overhead at a minimum. If more data needs to be processed than can be kept in memory, Flink’s operators partially spill data to disk. In fact, a lot of Flink’s internal implementations look more like C/C++ rather than common Java. The following figure gives a high-level overview of how Flink stores data serialized in memory segments and spills to disk if necessary.&lt;/p&gt;
&lt;center&gt;
&lt;img src=&quot;/img/blog/memory-mgmt.png&quot; style=&quot;width:90%;margin:15px&quot; /&gt;
&lt;/center&gt;
&lt;p&gt;Flink’s style of active memory management and operating on binary data has several benefits:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Memory-safe execution &amp;amp; efficient out-of-core algorithms.&lt;/strong&gt; Due to the fixed amount of allocated memory segments, it is trivial to monitor remaining memory resources. In case of memory shortage, processing operators can efficiently write larger batches of memory segments to disk and later them read back. Consequently, &lt;code&gt;OutOfMemoryErrors&lt;/code&gt; are effectively prevented.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Reduced garbage collection pressure.&lt;/strong&gt; Because all long-lived data is in binary representation in Flink’s managed memory, all data objects are short-lived or even mutable and can be reused. Short-lived objects can be more efficiently garbage-collected, which significantly reduces garbage collection pressure. Right now, the pre-allocated memory segments are long-lived objects on the JVM heap, but the Flink community is actively working on allocating off-heap memory for this purpose. This effort will result in much smaller JVM heaps and facilitate even faster garbage collection cycles.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Space efficient data representation.&lt;/strong&gt; Java objects have a storage overhead which can be avoided if the data is stored in a binary representation.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Efficient binary operations &amp;amp; cache sensitivity.&lt;/strong&gt; Binary data can be efficiently compared and operated on given a suitable binary representation. Furthermore, the binary representations can put related values, as well as hash codes, keys, and pointers, adjacently into memory. This gives data structures with usually more cache efficient access patterns.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;These properties of active memory management are very desirable in a data processing systems for large-scale data analytics but have a significant price tag attached. Active memory management and operating on binary data is not trivial to implement, i.e., using &lt;code&gt;java.util.HashMap&lt;/code&gt; is much easier than implementing a spillable hash-table backed by byte arrays and a custom serialization stack. Of course Apache Flink is not the only JVM-based data processing system that operates on serialized binary data. Projects such as &lt;a href=&quot;http://drill.apache.org/&quot;&gt;Apache Drill&lt;/a&gt;, &lt;a href=&quot;http://ignite.incubator.apache.org/&quot;&gt;Apache Ignite (incubating)&lt;/a&gt; or &lt;a href=&quot;http://projectgeode.org/&quot;&gt;Apache Geode (incubating)&lt;/a&gt; apply similar techniques and it was recently announced that also &lt;a href=&quot;http://spark.apache.org/&quot;&gt;Apache Spark&lt;/a&gt; will evolve into this direction with &lt;a href=&quot;https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html&quot;&gt;Project Tungsten&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;In the following we discuss in detail how Flink allocates memory, de/serializes objects, and operates on binary data. We will also show some performance numbers comparing processing objects on the heap and operating on binary data.&lt;/p&gt;
&lt;h2 id=&quot;how-does-flink-allocate-memory&quot;&gt;How does Flink allocate memory?&lt;/h2&gt;
&lt;p&gt;A Flink worker, called TaskManager, is composed of several internal components such as an actor system for coordination with the Flink master, an IOManager that takes care of spilling data to disk and reading it back, and a MemoryManager that coordinates memory usage. In the context of this blog post, the MemoryManager is of most interest.&lt;/p&gt;
&lt;p&gt;The MemoryManager takes care of allocating, accounting, and distributing MemorySegments to data processing operators such as sort and join operators. A &lt;a href=&quot;https://github.com/apache/flink/blob/release-0.9.0-milestone-1/flink-core/src/main/java/org/apache/flink/core/memory/MemorySegment.java&quot;&gt;MemorySegment&lt;/a&gt; is Flink’s distribution unit of memory and is backed by a regular Java byte array (size is 32 KB by default). A MemorySegment provides very efficient write and read access to its backed byte array using Java’s unsafe methods. You can think of a MemorySegment as a custom-tailored version of Java’s NIO ByteBuffer. In order to operate on multiple MemorySegments like on a larger chunk of consecutive memory, Flink uses logical views that implement Java’s &lt;code&gt;java.io.DataOutput&lt;/code&gt; and &lt;code&gt;java.io.DataInput&lt;/code&gt; interfaces.&lt;/p&gt;
&lt;p&gt;MemorySegments are allocated once at TaskManager start-up time and are destroyed when the TaskManager is shut down. Hence, they are reused and not garbage-collected over the whole lifetime of a TaskManager. After all internal data structures of a TaskManager have been initialized and all core services have been started, the MemoryManager starts creating MemorySegments. By default 70% of the JVM heap that is available after service initialization is allocated by the MemoryManager. It is also possible to configure an absolute amount of managed memory. The remaining JVM heap is used for objects that are instantiated during task processing, including objects created by user-defined functions. The following figure shows the memory distribution in the TaskManager JVM after startup.&lt;/p&gt;
&lt;center&gt;
&lt;img src=&quot;/img/blog/memory-alloc.png&quot; style=&quot;width:60%;margin:15px&quot; /&gt;
&lt;/center&gt;
&lt;h2 id=&quot;how-does-flink-serialize-objects&quot;&gt;How does Flink serialize objects?&lt;/h2&gt;
&lt;p&gt;The Java ecosystem offers several libraries to convert objects into a binary representation and back. Common alternatives are standard Java serialization, &lt;a href=&quot;https://github.com/EsotericSoftware/kryo&quot;&gt;Kryo&lt;/a&gt;, &lt;a href=&quot;http://avro.apache.org/&quot;&gt;Apache Avro&lt;/a&gt;, &lt;a href=&quot;http://thrift.apache.org/&quot;&gt;Apache Thrift&lt;/a&gt;, or Google’s &lt;a href=&quot;https://github.com/google/protobuf&quot;&gt;Protobuf&lt;/a&gt;. Flink includes its own custom serialization framework in order to control the binary representation of data. This is important because operating on binary data such as comparing or even manipulating binary data requires exact knowledge of the serialization layout. Further, configuring the serialization layout with respect to operations that are performed on binary data can yield a significant performance boost. Flink’s serialization stack also leverages the fact, that the type of the objects which are going through de/serialization are exactly known before a program is executed.&lt;/p&gt;
&lt;p&gt;Flink programs can process data represented as arbitrary Java or Scala objects. Before a program is optimized, the data types at each processing step of the program’s data flow need to be identified. For Java programs, Flink features a reflection-based type extraction component to analyze the return types of user-defined functions. Scala programs are analyzed with help of the Scala compiler. Flink represents each data type with a &lt;a href=&quot;https://github.com/apache/flink/blob/release-0.9.0-milestone-1/flink-core/src/main/java/org/apache/flink/api/common/typeinfo/TypeInformation.java&quot;&gt;TypeInformation&lt;/a&gt;. Flink has TypeInformations for several kinds of data types, including:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;BasicTypeInfo: Any (boxed) Java primitive type or java.lang.String.&lt;/li&gt;
&lt;li&gt;BasicArrayTypeInfo: Any array of a (boxed) Java primitive type or java.lang.String.&lt;/li&gt;
&lt;li&gt;WritableTypeInfo: Any implementation of Hadoop’s Writable interface.&lt;/li&gt;
&lt;li&gt;TupleTypeInfo: Any Flink tuple (Tuple1 to Tuple25). Flink tuples are Java representations for fixed-length tuples with typed fields.&lt;/li&gt;
&lt;li&gt;CaseClassTypeInfo: Any Scala CaseClass (including Scala tuples).&lt;/li&gt;
&lt;li&gt;PojoTypeInfo: Any POJO (Java or Scala), i.e., an object with all fields either being public or accessible through getters and setter that follow the common naming conventions.&lt;/li&gt;
&lt;li&gt;GenericTypeInfo: Any data type that cannot be identified as another type.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Each TypeInformation provides a serializer for the data type it represents. For example, a BasicTypeInfo returns a serializer that writes the respective primitive type, the serializer of a WritableTypeInfo delegates de/serialization to the write() and readFields() methods of the object implementing Hadoop’s Writable interface, and a GenericTypeInfo returns a serializer that delegates serialization to Kryo. Object serialization to a DataOutput which is backed by Flink MemorySegments goes automatically through Java’s efficient unsafe operations. For data types that can be used as keys, i.e., compared and hashed, the TypeInformation provides TypeComparators. TypeComparators compare and hash objects and can - depending on the concrete data type - also efficiently compare binary representations and extract fixed-length binary key prefixes.&lt;/p&gt;
&lt;p&gt;Tuple, Pojo, and CaseClass types are composite types, i.e., containers for one or more possibly nested data types. As such, their serializers and comparators are also composite and delegate the serialization and comparison of their member data types to the respective serializers and comparators. The following figure illustrates the serialization of a (nested) &lt;code&gt;Tuple3&amp;lt;Integer, Double, Person&amp;gt;&lt;/code&gt; object where &lt;code&gt;Person&lt;/code&gt; is a POJO and defined as follows:&lt;/p&gt;
&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-java&quot;&gt;&lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Person&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;id&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;String&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;center&gt;
&lt;img src=&quot;/img/blog/data-serialization.png&quot; style=&quot;width:80%;margin:15px&quot; /&gt;
&lt;/center&gt;
&lt;p&gt;Flink’s type system can be easily extended by providing custom TypeInformations, Serializers, and Comparators to improve the performance of serializing and comparing custom data types.&lt;/p&gt;
&lt;h2 id=&quot;how-does-flink-operate-on-binary-data&quot;&gt;How does Flink operate on binary data?&lt;/h2&gt;
&lt;p&gt;Similar to many other data processing APIs (including SQL), Flink’s APIs provide transformations to group, sort, and join data sets. These transformations operate on potentially very large data sets. Relational database systems feature very efficient algorithms for these purposes since several decades including external merge-sort, merge-join, and hybrid hash-join. Flink builds on this technology, but generalizes it to handle arbitrary objects using its custom serialization and comparison stack. In the following, we show how Flink operates with binary data by the example of Flink’s in-memory sort algorithm.&lt;/p&gt;
&lt;p&gt;Flink assigns a memory budget to its data processing operators. Upon initialization, a sort algorithm requests its memory budget from the MemoryManager and receives a corresponding set of MemorySegments. The set of MemorySegments becomes the memory pool of a so-called sort buffer which collects the data that is be sorted. The following figure illustrates how data objects are serialized into the sort buffer.&lt;/p&gt;
&lt;center&gt;
&lt;img src=&quot;/img/blog/sorting-binary-data-1.png&quot; style=&quot;width:90%;margin:15px&quot; /&gt;
&lt;/center&gt;
&lt;p&gt;The sort buffer is internally organized into two memory regions. The first region holds the full binary data of all objects. The second region contains pointers to the full binary object data and - depending on the key data type - fixed-length sort keys. When an object is added to the sort buffer, its binary data is appended to the first region, and a pointer (and possibly a key) is appended to the second region. The separation of actual data and pointers plus fixed-length keys is done for two purposes. It enables efficient swapping of fix-length entries (key+pointer) and also reduces the data that needs to be moved when sorting. If the sort key is a variable length data type such as a String, the fixed-length sort key must be a prefix key such as the first n characters of a String. Note, not all data types provide a fixed-length (prefix) sort key. When serializing objects into the sort buffer, both memory regions are extended with MemorySegments from the memory pool. Once the memory pool is empty and no more objects can be added, the sort buffer is completely filled and can be sorted. Flink’s sort buffer provides methods to compare and swap elements. This makes the actual sort algorithm pluggable. By default, Flink uses a Quicksort implementation which can fall back to HeapSort.
The following figure shows how two objects are compared.&lt;/p&gt;
&lt;center&gt;
&lt;img src=&quot;/img/blog/sorting-binary-data-2.png&quot; style=&quot;width:80%;margin:15px&quot; /&gt;
&lt;/center&gt;
&lt;p&gt;The sort buffer compares two elements by comparing their binary fix-length sort keys. The comparison is successful if either done on a full key (not a prefix key) or if the binary prefix keys are not equal. If the prefix keys are equal (or the sort key data type does not provide a binary prefix key), the sort buffer follows the pointers to the actual object data, deserializes both objects and compares the objects. Depending on the result of the comparison, the sort algorithm decides whether to swap the compared elements or not. The sort buffer swaps two elements by moving their fix-length keys and pointers. The actual data is not moved. Once the sort algorithm finishes, the pointers in the sort buffer are correctly ordered. The following figure shows how the sorted data is returned from the sort buffer.&lt;/p&gt;
&lt;center&gt;
&lt;img src=&quot;/img/blog/sorting-binary-data-3.png&quot; style=&quot;width:80%;margin:15px&quot; /&gt;
&lt;/center&gt;
&lt;p&gt;The sorted data is returned by sequentially reading the pointer region of the sort buffer, skipping the sort keys and following the sorted pointers to the actual data. This data is either deserialized and returned as objects or the binary representation is copied and written to disk in case of an external merge-sort (see this &lt;a href=&quot;http://flink.apache.org/news/2015/03/13/peeking-into-Apache-Flinks-Engine-Room.html&quot;&gt;blog post on joins in Flink&lt;/a&gt;).&lt;/p&gt;
&lt;h2 id=&quot;show-me-numbers&quot;&gt;Show me numbers!&lt;/h2&gt;
&lt;p&gt;So, what does operating on binary data mean for performance? We’ll run a benchmark that sorts 10 million &lt;code&gt;Tuple2&amp;lt;Integer, String&amp;gt;&lt;/code&gt; objects to find out. The values of the Integer field are sampled from a uniform distribution. The String field values have a length of 12 characters and are sampled from a long-tail distribution. The input data is provided by an iterator that returns a mutable object, i.e., the same tuple object instance is returned with different field values. Flink uses this technique when reading data from memory, network, or disk to avoid unnecessary object instantiations. The benchmarks are run in a JVM with 900 MB heap size which is approximately the required amount of memory to store and sort 10 million tuple objects on the heap without dying of an &lt;code&gt;OutOfMemoryError&lt;/code&gt;. We sort the tuples on the Integer field and on the String field using three sorting methods:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Object-on-heap.&lt;/strong&gt; The tuples are stored in a regular &lt;code&gt;java.util.ArrayList&lt;/code&gt; with initial capacity set to 10 million entries and sorted using Java’s regular collection sort.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Flink-serialized.&lt;/strong&gt; The tuple fields are serialized into a sort buffer of 600 MB size using Flink’s custom serializers, sorted as described above, and finally deserialized again. When sorting on the Integer field, the full Integer is used as sort key such that the sort happens entirely on binary data (no deserialization of objects required). For sorting on the String field a 8-byte prefix key is used and tuple objects are deserialized if the prefix keys are equal.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Kryo-serialized.&lt;/strong&gt; The tuple fields are serialized into a sort buffer of 600 MB size using Kryo serialization and sorted without binary sort keys. This means that each pair-wise comparison requires two object to be deserialized.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;All sort methods are implemented using a single thread. The reported times are averaged over ten runs. After each run, we call &lt;code&gt;System.gc()&lt;/code&gt; to request a garbage collection run which does not go into measured execution time. The following figure shows the time to store the input data in memory, sort it, and read it back as objects.&lt;/p&gt;
&lt;center&gt;
&lt;img src=&quot;/img/blog/sort-benchmark.png&quot; style=&quot;width:90%;margin:15px&quot; /&gt;
&lt;/center&gt;
&lt;p&gt;We see that Flink’s sort on binary data using its own serializers significantly outperforms the other two methods. Comparing to the object-on-heap method, we see that loading the data into memory is much faster. Since we actually collect the objects, there is no opportunity to reuse the object instances, but have to re-create every tuple. This is less efficient than Flink’s serializers (or Kryo serialization). On the other hand, reading objects from the heap comes for free compared to deserialization. In our benchmark, object cloning was more expensive than serialization and deserialization combined. Looking at the sorting time, we see that also sorting on the binary representation is faster than Java’s collection sort. Sorting data that was serialized using Kryo without binary sort key, is much slower than both other methods. This is due to the heavy deserialization overhead. Sorting the tuples on their String field is faster than sorting on the Integer field due to the long-tailed value distribution which significantly reduces the number of pair-wise comparisons. To get a better feeling of what is happening during sorting we monitored the executing JVM using VisualVM. The following screenshots show heap memory usage, garbage collection activity and CPU usage over the execution of 10 runs.&lt;/p&gt;
&lt;table width=&quot;100%&quot;&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;&lt;center&gt;&lt;b&gt;Garbage Collection&lt;/b&gt;&lt;/center&gt;&lt;/th&gt;
&lt;th&gt;&lt;center&gt;&lt;b&gt;Memory Usage&lt;/b&gt;&lt;/center&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;b&gt;Object-on-Heap (int)&lt;/b&gt;&lt;/td&gt;
&lt;td&gt;&lt;img src=&quot;/img/blog/objHeap-int-gc.png&quot; style=&quot;width:80%&quot; /&gt;&lt;/td&gt;
&lt;td&gt;&lt;img src=&quot;/img/blog/objHeap-int-mem.png&quot; style=&quot;width:80%&quot; /&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;b&gt;Flink-Serialized (int)&lt;/b&gt;&lt;/td&gt;
&lt;td&gt;&lt;img src=&quot;/img/blog/flinkSer-int-gc.png&quot; style=&quot;width:80%&quot; /&gt;&lt;/td&gt;
&lt;td&gt;&lt;img src=&quot;/img/blog/flinkSer-int-mem.png&quot; style=&quot;width:80%&quot; /&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;b&gt;Kryo-Serialized (int)&lt;/b&gt;&lt;/td&gt;
&lt;td&gt;&lt;img src=&quot;/img/blog/kryoSer-int-gc.png&quot; style=&quot;width:80%&quot; /&gt;&lt;/td&gt;
&lt;td&gt;&lt;img src=&quot;/img/blog/kryoSer-int-mem.png&quot; style=&quot;width:80%&quot; /&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;
&lt;p&gt;The experiments run single-threaded on an 8-core machine, so full utilization of one core only corresponds to a 12.5% overall utilization. The screenshots show that operating on binary data significantly reduces garbage collection activity. For the object-on-heap approach, the garbage collector runs in very short intervals while filling the sort buffer and causes a lot of CPU usage even for a single processing thread (sorting itself does not trigger the garbage collector). The JVM garbage collects with multiple parallel threads, explaining the high overall CPU utilization. On the other hand, the methods that operate on serialized data rarely trigger the garbage collector and have a much lower CPU utilization. In fact the garbage collector does not run at all if the tuples are sorted on the Integer field using the flink-serialized method because no objects need to be deserialized for pair-wise comparisons. The kryo-serialized method requires slightly more garbage collection since it does not use binary sort keys and deserializes two objects for each comparison.&lt;/p&gt;
&lt;p&gt;The memory usage charts shows that the flink-serialized and kryo-serialized constantly occupy a high amount of memory (plus some objects for operation). This is due to the pre-allocation of MemorySegments. The actual memory usage is much lower, because the sort buffers are not completely filled. The following table shows the memory consumption of each method. 10 million records result in about 280 MB of binary data (object data plus pointers and sort keys) depending on the used serializer and presence and size of a binary sort key. Comparing this to the memory requirements of the object-on-heap approach we see that operating on binary data can significantly improve memory efficiency. In our benchmark more than twice as much data can be sorted in-memory if serialized into a sort buffer instead of holding it as objects on the heap.&lt;/p&gt;
&lt;table width=&quot;100%&quot;&gt;
&lt;tr&gt;
&lt;th&gt;Occupied Memory&lt;/th&gt;
&lt;th&gt;Object-on-Heap&lt;/th&gt;
&lt;th&gt;Flink-Serialized&lt;/th&gt;
&lt;th&gt;Kryo-Serialized&lt;/th&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;b&gt;Sort on Integer&lt;/b&gt;&lt;/td&gt;
&lt;td&gt;approx. 700 MB (heap)&lt;/td&gt;
&lt;td&gt;277 MB (sort buffer)&lt;/td&gt;
&lt;td&gt;266 MB (sort buffer)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;b&gt;Sort on String&lt;/b&gt;&lt;/td&gt;
&lt;td&gt;approx. 700 MB (heap)&lt;/td&gt;
&lt;td&gt;315 MB (sort buffer)&lt;/td&gt;
&lt;td&gt;266 MB (sort buffer)&lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;
&lt;p&gt;&lt;br /&gt;&lt;/p&gt;
&lt;p&gt;To summarize, the experiments verify the previously stated benefits of operating on binary data.&lt;/p&gt;
&lt;h2 id=&quot;were-not-done-yet&quot;&gt;We’re not done yet!&lt;/h2&gt;
&lt;p&gt;Apache Flink features quite a bit of advanced techniques to safely and efficiently process huge amounts of data with limited memory resources. However, there are a few points that could make Flink even more efficient. The Flink community is working on moving the managed memory to off-heap memory. This will allow for smaller JVMs, lower garbage collection overhead, and also easier system configuration. With Flink’s Table API, the semantics of all operations such as aggregations and projections are known (in contrast to black-box user-defined functions). Hence we can generate code for Table API operations that directly operates on binary data. Further improvements include serialization layouts which are tailored towards the operations that are applied on the binary data and code generation for serializers and comparators.&lt;/p&gt;
&lt;p&gt;The groundwork (and a lot more) for operating on binary data is done but there is still some room for making Flink even better and faster. If you are crazy about performance and like to juggle with lot of bits and bytes, join the Flink community!&lt;/p&gt;
&lt;h2 id=&quot;tldr-give-me-three-things-to-remember&quot;&gt;TL;DR; Give me three things to remember!&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Flink’s active memory management avoids nasty &lt;code&gt;OutOfMemoryErrors&lt;/code&gt; that kill your JVMs and reduces garbage collection overhead.&lt;/li&gt;
&lt;li&gt;Flink features a highly efficient data de/serialization stack that facilitates operations on binary data and makes more data fit into memory.&lt;/li&gt;
&lt;li&gt;Flink’s DBMS-style operators operate natively on binary data yielding high performance in-memory and destage gracefully to disk if necessary.&lt;/li&gt;
&lt;/ul&gt;
</description>
<pubDate>Mon, 11 May 2015 12:00:00 +0200</pubDate>
<link>http://flink.apache.org/news/2015/05/11/Juggling-with-Bits-and-Bytes.html</link>
<guid isPermaLink="true">/news/2015/05/11/Juggling-with-Bits-and-Bytes.html</guid>
</item>
<item>
<title>Announcing Flink 0.9.0-milestone1 preview release</title>
<description>&lt;p&gt;The Apache Flink community is pleased to announce the availability of
the 0.9.0-milestone-1 release. The release is a preview of the
upcoming 0.9.0 release. It contains many new features which will be
available in the upcoming 0.9 release. Interested users are encouraged
to try it out and give feedback. As the version number indicates, this
release is a preview release that contains known issues.&lt;/p&gt;
&lt;p&gt;You can download the release
&lt;a href=&quot;http://flink.apache.org/downloads.html#preview&quot;&gt;here&lt;/a&gt; and check out the
latest documentation
&lt;a href=&quot;http://ci.apache.org/projects/flink/flink-docs-master/&quot;&gt;here&lt;/a&gt;. Feedback
through the Flink &lt;a href=&quot;http://flink.apache.org/community.html#mailing-lists&quot;&gt;mailing
lists&lt;/a&gt; is, as
always, very welcome!&lt;/p&gt;
&lt;h2 id=&quot;new-features&quot;&gt;New Features&lt;/h2&gt;
&lt;h3 id=&quot;table-api&quot;&gt;Table API&lt;/h3&gt;
&lt;p&gt;Flink’s new Table API offers a higher-level abstraction for
interacting with structured data sources. The Table API allows users
to execute logical, SQL-like queries on distributed data sets while
allowing them to freely mix declarative queries with regular Flink
operators. Here is an example that groups and joins two tables:&lt;/p&gt;
&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-scala&quot;&gt;&lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;clickCounts&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;clicks&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;groupBy&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;user&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;select&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;userId&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;url&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;count&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;count&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;activeUsers&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;users&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;join&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;clickCounts&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;where&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;id&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;===&lt;/span&gt; &lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;userId&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;count&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;10&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;select&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;username&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;count&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;...)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Tables consist of logical attributes that can be selected by name
rather than physical Java and Scala data types. This alleviates a lot
of boilerplate code for common ETL tasks and raises the abstraction
for Flink programs. Tables are available for both static and streaming
data sources (DataSet and DataStream APIs).&lt;/p&gt;
&lt;p&gt;Check out the Table guide for Java and Scala
&lt;a href=&quot;http://ci.apache.org/projects/flink/flink-docs-master/libs/table.html&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;h3 id=&quot;gelly-graph-processing-api&quot;&gt;Gelly Graph Processing API&lt;/h3&gt;
&lt;p&gt;Gelly is a Java Graph API for Flink. It contains a set of utilities
for graph analysis, support for iterative graph processing and a
library of graph algorithms. Gelly exposes a Graph data structure that
wraps DataSets for vertices and edges, as well as methods for creating
graphs from DataSets, graph transformations and utilities (e.g., in-
and out- degrees of vertices), neighborhood aggregations, iterative
vertex-centric graph processing, as well as a library of common graph
algorithms, including PageRank, SSSP, label propagation, and community
detection.&lt;/p&gt;
&lt;p&gt;Gelly internally builds on top of Flink’s &lt;a href=&quot;http://ci.apache.org/projects/flink/flink-docs-master/apis/iterations.html&quot;&gt;delta
iterations&lt;/a&gt;. Iterative
graph algorithms are executed leveraging mutable state, achieving
similar performance with specialized graph processing systems.&lt;/p&gt;
&lt;p&gt;Gelly will eventually subsume Spargel, Flink’s Pregel-like API. Check
out the Gelly guide
&lt;a href=&quot;http://ci.apache.org/projects/flink/flink-docs-master/libs/gelly_guide.html&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;h3 id=&quot;flink-machine-learning-library&quot;&gt;Flink Machine Learning Library&lt;/h3&gt;
&lt;p&gt;This release includes the first version of Flink’s Machine Learning
library. The library’s pipeline approach, which has been strongly
inspired by scikit-learn’s abstraction of transformers and estimators,
makes it easy to quickly set up a data processing pipeline and to get
your job done.&lt;/p&gt;
&lt;p&gt;Flink distinguishes between transformers and learners. Transformers
are components which transform your input data into a new format
allowing you to extract features, cleanse your data or to sample from
it. Learners on the other hand constitute the components which take
your input data and train a model on it. The model you obtain from the
learner can then be evaluated and used to make predictions on unseen
data.&lt;/p&gt;
&lt;p&gt;Currently, the machine learning library contains transformers and
learners to do multiple tasks. The library supports multiple linear
regression using a stochastic gradient implementation to scale to
large data sizes. Furthermore, it includes an alternating least
squares (ALS) implementation to factorizes large matrices. The matrix
factorization can be used to do collaborative filtering. An
implementation of the communication efficient distributed dual
coordinate ascent (CoCoA) algorithm is the latest addition to the
library. The CoCoA algorithm can be used to train distributed
soft-margin SVMs.&lt;/p&gt;
&lt;h3 id=&quot;flink-on-yarn-leveraging-apache-tez&quot;&gt;Flink on YARN leveraging Apache Tez&lt;/h3&gt;
&lt;p&gt;We are introducing a new execution mode for Flink to be able to run
restricted Flink programs on top of &lt;a href=&quot;http://tez.apache.org&quot;&gt;Apache
Tez&lt;/a&gt;. This mode retains Flink’s APIs,
optimizer, as well as Flink’s runtime operators, but instead of
wrapping those in Flink tasks that are executed by Flink TaskManagers,
it wraps them in Tez runtime tasks and builds a Tez DAG that
represents the program.&lt;/p&gt;
&lt;p&gt;By using Flink on Tez, users have an additional choice for an
execution platform for Flink programs. While Flink’s distributed
runtime favors low latency, streaming shuffles, and iterative
algorithms, Tez focuses on scalability and elastic resource usage in
shared YARN clusters.&lt;/p&gt;
&lt;p&gt;Get started with Flink on Tez
&lt;a href=&quot;http://ci.apache.org/projects/flink/flink-docs-master/setup/flink_on_tez.html&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;h3 id=&quot;reworked-distributed-runtime-on-akka&quot;&gt;Reworked Distributed Runtime on Akka&lt;/h3&gt;
&lt;p&gt;Flink’s RPC system has been replaced by the widely adopted
&lt;a href=&quot;http://akka.io&quot;&gt;Akka&lt;/a&gt; framework. Akka’s concurrency model offers the
right abstraction to develop a fast as well as robust distributed
system. By using Akka’s own failure detection mechanism the stability
of Flink’s runtime is significantly improved, because the system can
now react in proper form to node outages. Furthermore, Akka improves
Flink’s scalability by introducing asynchronous messages to the
system. These asynchronous messages allow Flink to be run on many more
nodes than before.&lt;/p&gt;
&lt;h3 id=&quot;exactly-once-processing-on-kafka-streaming-sources&quot;&gt;Exactly-once processing on Kafka Streaming Sources&lt;/h3&gt;
&lt;p&gt;This release introduces stream processing with exacly-once delivery
guarantees for Flink streaming programs that analyze streaming sources
that are persisted by &lt;a href=&quot;http://kafka.apache.org&quot;&gt;Apache Kafka&lt;/a&gt;. The
system is internally tracking the Kafka offsets to ensure that Flink
can pick up data from Kafka where it left off in case of an failure.&lt;/p&gt;
&lt;p&gt;Read
&lt;a href=&quot;http://ci.apache.org/projects/flink/flink-docs-master/apis/streaming_guide.html#apache-kafka&quot;&gt;here&lt;/a&gt;
on how to use the persistent Kafka source.&lt;/p&gt;
&lt;h3 id=&quot;improved-yarn-support&quot;&gt;Improved YARN support&lt;/h3&gt;
&lt;p&gt;Flink’s YARN client contains several improvements, such as a detached
mode for starting a YARN session in the background, the ability to
submit a single Flink job to a YARN cluster without starting a
session, including a “fire and forget” mode. Flink is now also able to
reallocate failed YARN containers to maintain the size of the
requested cluster. This feature allows to implement fault-tolerant
setups on top of YARN. There is also an internal Java API to deploy
and control a running YARN cluster. This is being used by system
integrators to easily control Flink on YARN within their Hadoop 2
cluster.&lt;/p&gt;
&lt;p&gt;See the YARN docs
&lt;a href=&quot;http://ci.apache.org/projects/flink/flink-docs-master/setup/yarn_setup.html&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id=&quot;more-improvements-and-fixes&quot;&gt;More Improvements and Fixes&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://issues.apache.org/jira/browse/FLINK-1605&quot;&gt;FLINK-1605&lt;/a&gt;:
Flink is not exposing its Guava and ASM dependencies to Maven
projects depending on Flink. We use the maven-shade-plugin to
relocate these dependencies into our own namespace. This allows
users to use any Guava or ASM version.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://issues.apache.org/jira/browse/FLINK-1605&quot;&gt;FLINK-1417&lt;/a&gt;:
Automatic recognition and registration of Java Types at Kryo and the
internal serializers: Flink has its own type handling and
serialization framework falling back to Kryo for types that it cannot
handle. To get the best performance Flink is automatically registering
all types a user is using in their program with Kryo.Flink also
registers serializers for Protocol Buffers, Thrift, Avro and YodaTime
automatically. Users can also manually register serializers to Kryo
(https://issues.apache.org/jira/browse/FLINK-1399)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://issues.apache.org/jira/browse/FLINK-1296&quot;&gt;FLINK-1296&lt;/a&gt;: Add
support for sorting very large records&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://issues.apache.org/jira/browse/FLINK-1679&quot;&gt;FLINK-1679&lt;/a&gt;:
“degreeOfParallelism” methods renamed to “parallelism”&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://issues.apache.org/jira/browse/FLINK-1501&quot;&gt;FLINK-1501&lt;/a&gt;: Add
metrics library for monitoring TaskManagers&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://issues.apache.org/jira/browse/FLINK-1760&quot;&gt;FLINK-1760&lt;/a&gt;: Add
support for building Flink with Scala 2.11&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://issues.apache.org/jira/browse/FLINK-1648&quot;&gt;FLINK-1648&lt;/a&gt;: Add
a mode where the system automatically sets the parallelism to the
available task slots&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://issues.apache.org/jira/browse/FLINK-1622&quot;&gt;FLINK-1622&lt;/a&gt;: Add
groupCombine operator&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://issues.apache.org/jira/browse/FLINK-1589&quot;&gt;FLINK-1589&lt;/a&gt;: Add
option to pass Configuration to LocalExecutor&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://issues.apache.org/jira/browse/FLINK-1504&quot;&gt;FLINK-1504&lt;/a&gt;: Add
support for accessing secured HDFS clusters in standalone mode&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://issues.apache.org/jira/browse/FLINK-1478&quot;&gt;FLINK-1478&lt;/a&gt;: Add
strictly local input split assignment&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://issues.apache.org/jira/browse/FLINK-1512&quot;&gt;FLINK-1512&lt;/a&gt;: Add
CsvReader for reading into POJOs.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://issues.apache.org/jira/browse/FLINK-1461&quot;&gt;FLINK-1461&lt;/a&gt;: Add
sortPartition operator&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://issues.apache.org/jira/browse/FLINK-1450&quot;&gt;FLINK-1450&lt;/a&gt;: Add
Fold operator to the Streaming api&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://issues.apache.org/jira/browse/FLINK-1389&quot;&gt;FLINK-1389&lt;/a&gt;:
Allow setting custom file extensions for files created by the
FileOutputFormat&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://issues.apache.org/jira/browse/FLINK-1236&quot;&gt;FLINK-1236&lt;/a&gt;: Add
support for localization of Hadoop Input Splits&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://issues.apache.org/jira/browse/FLINK-1179&quot;&gt;FLINK-1179&lt;/a&gt;: Add
button to JobManager web interface to request stack trace of a
TaskManager&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://issues.apache.org/jira/browse/FLINK-1105&quot;&gt;FLINK-1105&lt;/a&gt;: Add
support for locally sorted output&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://issues.apache.org/jira/browse/FLINK-1688&quot;&gt;FLINK-1688&lt;/a&gt;: Add
socket sink&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://issues.apache.org/jira/browse/FLINK-1436&quot;&gt;FLINK-1436&lt;/a&gt;:
Improve usability of command line interface&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
</description>
<pubDate>Mon, 13 Apr 2015 12:00:00 +0200</pubDate>
<link>http://flink.apache.org/news/2015/04/13/release-0.9.0-milestone1.html</link>
<guid isPermaLink="true">/news/2015/04/13/release-0.9.0-milestone1.html</guid>
</item>
<item>
<title>March 2015 in the Flink community</title>
<description>&lt;p&gt;March has been a busy month in the Flink community.&lt;/p&gt;
&lt;h3 id=&quot;flink-runner-for-google-cloud-dataflow&quot;&gt;Flink runner for Google Cloud Dataflow&lt;/h3&gt;
&lt;p&gt;A Flink runner for Google Cloud Dataflow was announced. See the blog
posts by &lt;a href=&quot;http://data-artisans.com/dataflow.html&quot;&gt;data Artisans&lt;/a&gt; and
the &lt;a href=&quot;http://googlecloudplatform.blogspot.de/2015/03/announcing-Google-Cloud-Dataflow-runner-for-Apache-Flink.html&quot;&gt;Google Cloud Platform Blog&lt;/a&gt;.
Google Cloud Dataflow programs can be written using and open-source
SDK and run in multiple backends, either as a managed service inside
Google’s infrastructure, or leveraging open source runners,
including Apache Flink.&lt;/p&gt;
&lt;h3 id=&quot;learn-about-the-internals-of-flink&quot;&gt;Learn about the internals of Flink&lt;/h3&gt;
&lt;p&gt;The community has started an effort to better document the internals
of Flink. Check out the first articles on the Flink wiki on &lt;a href=&quot;https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=53741525&quot;&gt;how Flink
manages
memory&lt;/a&gt;,
&lt;a href=&quot;https://cwiki.apache.org/confluence/display/FLINK/Data+exchange+between+tasks&quot;&gt;how tasks in Flink exchange
data&lt;/a&gt;,
&lt;a href=&quot;https://cwiki.apache.org/confluence/display/FLINK/Type+System%2C+Type+Extraction%2C+Serialization&quot;&gt;type extraction and serialization in
Flink&lt;/a&gt;,
as well as &lt;a href=&quot;https://cwiki.apache.org/confluence/display/FLINK/Akka+and+Actors&quot;&gt;how Flink builds on Akka for distributed
coordination&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Check out also the &lt;a href=&quot;http://flink.apache.org/news/2015/03/13/peeking-into-Apache-Flinks-Engine-Room.html&quot;&gt;new blog
post&lt;/a&gt;
on how Flink executes joins with several insights into Flink’s runtime.&lt;/p&gt;
&lt;h3 id=&quot;meetups-and-talks&quot;&gt;Meetups and talks&lt;/h3&gt;
&lt;p&gt;Flink’s machine learning efforts were presented at the &lt;a href=&quot;http://www.meetup.com/Machine-Learning-Stockholm/events/221144997/&quot;&gt;Machine
Learning Stockholm meetup
group&lt;/a&gt;. The
regular Berlin Flink meetup featured a talk on the past, present, and
future of Flink. The talk is available on
&lt;a href=&quot;https://www.youtube.com/watch?v=fw2DBE6ZiEQ&amp;amp;feature=youtu.be&quot;&gt;youtube&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id=&quot;in-the-flink-master&quot;&gt;In the Flink master&lt;/h2&gt;
&lt;h3 id=&quot;table-api-in-scala-and-java&quot;&gt;Table API in Scala and Java&lt;/h3&gt;
&lt;p&gt;The new &lt;a href=&quot;https://github.com/apache/flink/tree/master/flink-staging/flink-table&quot;&gt;Table
API&lt;/a&gt;
in Flink is now available in both Java and Scala. Check out the
examples &lt;a href=&quot;https://github.com/apache/flink/blob/master/flink-staging/flink-table/src/main/java/org/apache/flink/examples/java/JavaTableExample.java&quot;&gt;here (Java)&lt;/a&gt; and &lt;a href=&quot;https://github.com/apache/flink/tree/master/flink-staging/flink-table/src/main/scala/org/apache/flink/examples/scala&quot;&gt;here (Scala)&lt;/a&gt;.&lt;/p&gt;
&lt;h3 id=&quot;additions-to-the-machine-learning-library&quot;&gt;Additions to the Machine Learning library&lt;/h3&gt;
&lt;p&gt;Flink’s &lt;a href=&quot;https://github.com/apache/flink/tree/master/flink-staging/flink-ml&quot;&gt;Machine Learning
library&lt;/a&gt;
is seeing quite a bit of traction. Recent additions include the &lt;a href=&quot;http://arxiv.org/abs/1409.1458&quot;&gt;CoCoA
algorithm&lt;/a&gt; for distributed
optimization.&lt;/p&gt;
&lt;h3 id=&quot;exactly-once-delivery-guarantees-for-streaming-jobs&quot;&gt;Exactly-once delivery guarantees for streaming jobs&lt;/h3&gt;
&lt;p&gt;Flink streaming jobs now provide exactly once processing guarantees
when coupled with persistent sources (notably &lt;a href=&quot;http://kafka.apache.org&quot;&gt;Apache
Kafka&lt;/a&gt;). Flink periodically checkpoints and
persists the offsets of the sources and restarts from those
checkpoints at failure recovery. This functionality is currently
limited in that it does not yet handle large state and iterative
programs.&lt;/p&gt;
&lt;h3 id=&quot;flink-on-tez&quot;&gt;Flink on Tez&lt;/h3&gt;
&lt;p&gt;A new execution environment enables non-iterative Flink jobs to use
Tez as an execution backend instead of Flink’s own network stack. Learn more
&lt;a href=&quot;http://ci.apache.org/projects/flink/flink-docs-master/setup/flink_on_tez.html&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;
</description>
<pubDate>Tue, 07 Apr 2015 12:00:00 +0200</pubDate>
<link>http://flink.apache.org/news/2015/04/07/march-in-flink.html</link>
<guid isPermaLink="true">/news/2015/04/07/march-in-flink.html</guid>
</item>
<item>
<title>Peeking into Apache Flink&#39;s Engine Room</title>
<description>&lt;h3 id=&quot;join-processing-in-apache-flink&quot;&gt;Join Processing in Apache Flink&lt;/h3&gt;
&lt;p&gt;Joins are prevalent operations in many data processing applications. Most data processing systems feature APIs that make joining data sets very easy. However, the internal algorithms for join processing are much more involved – especially if large data sets need to be efficiently handled. Therefore, join processing serves as a good example to discuss the salient design points and implementation details of a data processing system.&lt;/p&gt;
&lt;p&gt;In this blog post, we cut through Apache Flink’s layered architecture and take a look at its internals with a focus on how it handles joins. Specifically, I will&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;show how easy it is to join data sets using Flink’s fluent APIs,&lt;/li&gt;
&lt;li&gt;discuss basic distributed join strategies, Flink’s join implementations, and its memory management,&lt;/li&gt;
&lt;li&gt;talk about Flink’s optimizer that automatically chooses join strategies,&lt;/li&gt;
&lt;li&gt;show some performance numbers for joining data sets of different sizes, and finally&lt;/li&gt;
&lt;li&gt;briefly discuss joining of co-located and pre-sorted data sets.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;em&gt;Disclaimer&lt;/em&gt;: This blog post is exclusively about equi-joins. Whenever I say “join” in the following, I actually mean “equi-join”.&lt;/p&gt;
&lt;h3 id=&quot;how-do-i-join-with-flink&quot;&gt;How do I join with Flink?&lt;/h3&gt;
&lt;p&gt;Flink provides fluent APIs in Java and Scala to write data flow programs. Flink’s APIs are centered around parallel data collections which are called data sets. data sets are processed by applying Transformations that compute new data sets. Flink’s transformations include Map and Reduce as known from MapReduce &lt;a href=&quot;http://research.google.com/archive/mapreduce.html&quot;&gt;[1]&lt;/a&gt; but also operators for joining, co-grouping, and iterative processing. The documentation gives an overview of all available transformations &lt;a href=&quot;http://ci.apache.org/projects/flink/flink-docs-release-0.8/dataset_transformations.html&quot;&gt;[2]&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Joining two Scala case class data sets is very easy as the following example shows:&lt;/p&gt;
&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-scala&quot;&gt;&lt;span class=&quot;c1&quot;&gt;// define your data types&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;case&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;PageVisit&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;url&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ip&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;userId&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Long&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;case&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;User&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;id&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Long&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;email&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;country&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// get your data from somewhere&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;visits&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;DataSet&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;PageVisit&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;...&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;users&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;DataSet&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;User&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;...&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// filter the users data set&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;germanUsers&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;users&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;filter&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;((&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;u&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;u&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;country&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;equals&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;de&amp;quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;))&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// join data sets&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;germanVisits&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;DataSet&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;PageVisit&lt;/span&gt;, &lt;span class=&quot;kt&quot;&gt;User&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)]&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// equi-join condition (PageVisit.userId = User.id)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;visits&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;join&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;germanUsers&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;where&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;userId&amp;quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;equalTo&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;id&amp;quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Flink’s APIs also allow to:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;apply a user-defined join function to each pair of joined elements instead returning a &lt;code&gt;($Left, $Right)&lt;/code&gt; tuple,&lt;/li&gt;
&lt;li&gt;select fields of pairs of joined Tuple elements (projection), and&lt;/li&gt;
&lt;li&gt;define composite join keys such as &lt;code&gt;.where(“orderDate”, “zipCode”).equalTo(“date”, “zip”)&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;See the documentation for more details on Flink’s join features &lt;a href=&quot;http://ci.apache.org/projects/flink/flink-docs-release-0.8/dataset_transformations.html#join&quot;&gt;[3]&lt;/a&gt;.&lt;/p&gt;
&lt;h3 id=&quot;how-does-flink-join-my-data&quot;&gt;How does Flink join my data?&lt;/h3&gt;
&lt;p&gt;Flink uses techniques which are well known from parallel database systems to efficiently execute parallel joins. A join operator must establish all pairs of elements from its input data sets for which the join condition evaluates to true. In a standalone system, the most straight-forward implementation of a join is the so-called nested-loop join which builds the full Cartesian product and evaluates the join condition for each pair of elements. This strategy has quadratic complexity and does obviously not scale to large inputs.&lt;/p&gt;
&lt;p&gt;In a distributed system joins are commonly processed in two steps:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;The data of both inputs is distributed across all parallel instances that participate in the join and&lt;/li&gt;
&lt;li&gt;each parallel instance performs a standard stand-alone join algorithm on its local partition of the overall data.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The distribution of data across parallel instances must ensure that each valid join pair can be locally built by exactly one instance. For both steps, there are multiple valid strategies that can be independently picked and which are favorable in different situations. In Flink terminology, the first phase is called Ship Strategy and the second phase Local Strategy. In the following I will describe Flink’s ship and local strategies to join two data sets &lt;em&gt;R&lt;/em&gt; and &lt;em&gt;S&lt;/em&gt;.&lt;/p&gt;
&lt;h4 id=&quot;ship-strategies&quot;&gt;Ship Strategies&lt;/h4&gt;
&lt;p&gt;Flink features two ship strategies to establish a valid data partitioning for a join:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;the &lt;em&gt;Repartition-Repartition&lt;/em&gt; strategy (RR) and&lt;/li&gt;
&lt;li&gt;the &lt;em&gt;Broadcast-Forward&lt;/em&gt; strategy (BF).&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The Repartition-Repartition strategy partitions both inputs, R and S, on their join key attributes using the same partitioning function. Each partition is assigned to exactly one parallel join instance and all data of that partition is sent to its associated instance. This ensures that all elements that share the same join key are shipped to the same parallel instance and can be locally joined. The cost of the RR strategy is a full shuffle of both data sets over the network.&lt;/p&gt;
&lt;center&gt;
&lt;img src=&quot;/img/blog/joins-repartition.png&quot; style=&quot;width:90%;margin:15px&quot; /&gt;
&lt;/center&gt;
&lt;p&gt;The Broadcast-Forward strategy sends one complete data set (R) to each parallel instance that holds a partition of the other data set (S), i.e., each parallel instance receives the full data set R. Data set S remains local and is not shipped at all. The cost of the BF strategy depends on the size of R and the number of parallel instances it is shipped to. The size of S does not matter because S is not moved. The figure below illustrates how both ship strategies work.&lt;/p&gt;
&lt;center&gt;
&lt;img src=&quot;/img/blog/joins-broadcast.png&quot; style=&quot;width:90%;margin:15px&quot; /&gt;
&lt;/center&gt;
&lt;p&gt;The Repartition-Repartition and Broadcast-Forward ship strategies establish suitable data distributions to execute a distributed join. Depending on the operations that are applied before the join, one or even both inputs of a join are already distributed in a suitable way across parallel instances. In this case, Flink will reuse such distributions and only ship one or no input at all.&lt;/p&gt;
&lt;h4 id=&quot;flinks-memory-management&quot;&gt;Flink’s Memory Management&lt;/h4&gt;
&lt;p&gt;Before delving into the details of Flink’s local join algorithms, I will briefly discuss Flink’s internal memory management. Data processing algorithms such as joining, grouping, and sorting need to hold portions of their input data in memory. While such algorithms perform best if there is enough memory available to hold all data, it is crucial to gracefully handle situations where the data size exceeds memory. Such situations are especially tricky in JVM-based systems such as Flink because the system needs to reliably recognize that it is short on memory. Failure to detect such situations can result in an &lt;code&gt;OutOfMemoryException&lt;/code&gt; and kill the JVM.&lt;/p&gt;
&lt;p&gt;Flink handles this challenge by actively managing its memory. When a worker node (TaskManager) is started, it allocates a fixed portion (70% by default) of the JVM’s heap memory that is available after initialization as 32KB byte arrays. These byte arrays are distributed as working memory to all algorithms that need to hold significant portions of data in memory. The algorithms receive their input data as Java data objects and serialize them into their working memory.&lt;/p&gt;
&lt;p&gt;This design has several nice properties. First, the number of data objects on the JVM heap is much lower resulting in less garbage collection pressure. Second, objects on the heap have a certain space overhead and the binary representation is more compact. Especially data sets of many small elements benefit from that. Third, an algorithm knows exactly when the input data exceeds its working memory and can react by writing some of its filled byte arrays to the worker’s local filesystem. After the content of a byte array is written to disk, it can be reused to process more data. Reading data back into memory is as simple as reading the binary data from the local filesystem. The following figure illustrates Flink’s memory management.&lt;/p&gt;
&lt;center&gt;
&lt;img src=&quot;/img/blog/joins-memmgmt.png&quot; style=&quot;width:90%;margin:15px&quot; /&gt;
&lt;/center&gt;
&lt;p&gt;This active memory management makes Flink extremely robust for processing very large data sets on limited memory resources while preserving all benefits of in-memory processing if data is small enough to fit in-memory. De/serializing data into and from memory has a certain cost overhead compared to simply holding all data elements on the JVM’s heap. However, Flink features efficient custom de/serializers which also allow to perform certain operations such as comparisons directly on serialized data without deserializing data objects from memory.&lt;/p&gt;
&lt;h4 id=&quot;local-strategies&quot;&gt;Local Strategies&lt;/h4&gt;
&lt;p&gt;After the data has been distributed across all parallel join instances using either a Repartition-Repartition or Broadcast-Forward ship strategy, each instance runs a local join algorithm to join the elements of its local partition. Flink’s runtime features two common join strategies to perform these local joins:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;the &lt;em&gt;Sort-Merge-Join&lt;/em&gt; strategy (SM) and&lt;/li&gt;
&lt;li&gt;the &lt;em&gt;Hybrid-Hash-Join&lt;/em&gt; strategy (HH).&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The Sort-Merge-Join works by first sorting both input data sets on their join key attributes (Sort Phase) and merging the sorted data sets as a second step (Merge Phase). The sort is done in-memory if the local partition of a data set is small enough. Otherwise, an external merge-sort is done by collecting data until the working memory is filled, sorting it, writing the sorted data to the local filesystem, and starting over by filling the working memory again with more incoming data. After all input data has been received, sorted, and written as sorted runs to the local file system, a fully sorted stream can be obtained. This is done by reading the partially sorted runs from the local filesystem and sort-merging the records on the fly. Once the sorted streams of both inputs are available, both streams are sequentially read and merge-joined in a zig-zag fashion by comparing the sorted join key attributes, building join element pairs for matching keys, and advancing the sorted stream with the lower join key. The figure below shows how the Sort-Merge-Join strategy works.&lt;/p&gt;
&lt;center&gt;
&lt;img src=&quot;/img/blog/joins-smj.png&quot; style=&quot;width:90%;margin:15px&quot; /&gt;
&lt;/center&gt;
&lt;p&gt;The Hybrid-Hash-Join distinguishes its inputs as build-side and probe-side input and works in two phases, a build phase followed by a probe phase. In the build phase, the algorithm reads the build-side input and inserts all data elements into an in-memory hash table indexed by their join key attributes. If the hash table outgrows the algorithm’s working memory, parts of the hash table (ranges of hash indexes) are written to the local filesystem. The build phase ends after the build-side input has been fully consumed. In the probe phase, the algorithm reads the probe-side input and probes the hash table for each element using its join key attribute. If the element falls into a hash index range that was spilled to disk, the element is also written to disk. Otherwise, the element is immediately joined with all matching elements from the hash table. If the hash table completely fits into the working memory, the join is finished after the probe-side input has been fully consumed. Otherwise, the current hash table is dropped and a new hash table is built using spilled parts of the build-side input. This hash table is probed by the corresponding parts of the spilled probe-side input. Eventually, all data is joined. Hybrid-Hash-Joins perform best if the hash table completely fits into the working memory because an arbitrarily large the probe-side input can be processed on-the-fly without materializing it. However even if build-side input does not fit into memory, the the Hybrid-Hash-Join has very nice properties. In this case, in-memory processing is partially preserved and only a fraction of the build-side and probe-side data needs to be written to and read from the local filesystem. The next figure illustrates how the Hybrid-Hash-Join works.&lt;/p&gt;
&lt;center&gt;
&lt;img src=&quot;/img/blog/joins-hhj.png&quot; style=&quot;width:90%;margin:15px&quot; /&gt;
&lt;/center&gt;
&lt;h3 id=&quot;how-does-flink-choose-join-strategies&quot;&gt;How does Flink choose join strategies?&lt;/h3&gt;
&lt;p&gt;Ship and local strategies do not depend on each other and can be independently chosen. Therefore, Flink can execute a join of two data sets R and S in nine different ways by combining any of the three ship strategies (RR, BF with R being broadcasted, BF with S being broadcasted) with any of the three local strategies (SM, HH with R being build-side, HH with S being build-side). Each of these strategy combinations results in different execution performance depending on the data sizes and the available amount of working memory. In case of a small data set R and a much larger data set S, broadcasting R and using it as build-side input of a Hybrid-Hash-Join is usually a good choice because the much larger data set S is not shipped and not materialized (given that the hash table completely fits into memory). If both data sets are rather large or the join is performed on many parallel instances, repartitioning both inputs is a robust choice.&lt;/p&gt;
&lt;p&gt;Flink features a cost-based optimizer which automatically chooses the execution strategies for all operators including joins. Without going into the details of cost-based optimization, this is done by computing cost estimates for execution plans with different strategies and picking the plan with the least estimated costs. Thereby, the optimizer estimates the amount of data which is shipped over the the network and written to disk. If no reliable size estimates for the input data can be obtained, the optimizer falls back to robust default choices. A key feature of the optimizer is to reason about existing data properties. For example, if the data of one input is already partitioned in a suitable way, the generated candidate plans will not repartition this input. Hence, the choice of a RR ship strategy becomes more likely. The same applies for previously sorted data and the Sort-Merge-Join strategy. Flink programs can help the optimizer to reason about existing data properties by providing semantic information about user-defined functions &lt;a href=&quot;http://ci.apache.org/projects/flink/flink-docs-master/apis/programming_guide.html#semantic-annotations&quot;&gt;[4]&lt;/a&gt;. While the optimizer is a killer feature of Flink, it can happen that a user knows better than the optimizer how to execute a specific join. Similar to relational database systems, Flink offers optimizer hints to tell the optimizer which join strategies to pick &lt;a href=&quot;http://ci.apache.org/projects/flink/flink-docs-master/apis/dataset_transformations.html#join-algorithm-hints&quot;&gt;[5]&lt;/a&gt;.&lt;/p&gt;
&lt;h3 id=&quot;how-is-flinks-join-performance&quot;&gt;How is Flink’s join performance?&lt;/h3&gt;
&lt;p&gt;Alright, that sounds good, but how fast are joins in Flink? Let’s have a look. We start with a benchmark of the single-core performance of Flink’s Hybrid-Hash-Join implementation and run a Flink program that executes a Hybrid-Hash-Join with parallelism 1. We run the program on a n1-standard-2 Google Compute Engine instance (2 vCPUs, 7.5GB memory) with two locally attached SSDs. We give 4GB as working memory to the join. The join program generates 1KB records for both inputs on-the-fly, i.e., the data is not read from disk. We run 1:N (Primary-Key/Foreign-Key) joins and generate the smaller input with unique Integer join keys and the larger input with randomly chosen Integer join keys that fall into the key range of the smaller input. Hence, each tuple of the larger side joins with exactly one tuple of the smaller side. The result of the join is immediately discarded. We vary the size of the build-side input from 1 million to 12 million elements (1GB to 12GB). The probe-side input is kept constant at 64 million elements (64GB). The following chart shows the average execution time of three runs for each setup.&lt;/p&gt;
&lt;center&gt;
&lt;img src=&quot;/img/blog/joins-single-perf.png&quot; style=&quot;width:85%;margin:15px&quot; /&gt;
&lt;/center&gt;
&lt;p&gt;The joins with 1 to 3 GB build side (blue bars) are pure in-memory joins. The other joins partially spill data to disk (4 to 12GB, orange bars). The results show that the performance of Flink’s Hybrid-Hash-Join remains stable as long as the hash table completely fits into memory. As soon as the hash table becomes larger than the working memory, parts of the hash table and corresponding parts of the probe side are spilled to disk. The chart shows that the performance of the Hybrid-Hash-Join gracefully decreases in this situation, i.e., there is no sharp increase in runtime when the join starts spilling. In combination with Flink’s robust memory management, this execution behavior gives smooth performance without the need for fine-grained, data-dependent memory tuning.&lt;/p&gt;
&lt;p&gt;So, Flink’s Hybrid-Hash-Join implementation performs well on a single thread even for limited memory resources, but how good is Flink’s performance when joining larger data sets in a distributed setting? For the next experiment we compare the performance of the most common join strategy combinations, namely:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Broadcast-Forward, Hybrid-Hash-Join (broadcasting and building with the smaller side),&lt;/li&gt;
&lt;li&gt;Repartition, Hybrid-Hash-Join (building with the smaller side), and&lt;/li&gt;
&lt;li&gt;Repartition, Sort-Merge-Join&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;for different input size ratios:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;1GB : 1000GB&lt;/li&gt;
&lt;li&gt;10GB : 1000GB&lt;/li&gt;
&lt;li&gt;100GB : 1000GB&lt;/li&gt;
&lt;li&gt;1000GB : 1000GB&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The Broadcast-Forward strategy is only executed for up to 10GB. Building a hash table from 100GB broadcasted data in 5GB working memory would result in spilling proximately 95GB (build input) + 950GB (probe input) in each parallel thread and require more than 8TB local disk storage on each machine.&lt;/p&gt;
&lt;p&gt;As in the single-core benchmark, we run 1:N joins, generate the data on-the-fly, and immediately discard the result after the join. We run the benchmark on 10 n1-highmem-8 Google Compute Engine instances. Each instance is equipped with 8 cores, 52GB RAM, 40GB of which are configured as working memory (5GB per core), and one local SSD for spilling to disk. All benchmarks are performed using the same configuration, i.e., no fine tuning for the respective data sizes is done. The programs are executed with a parallelism of 80.&lt;/p&gt;
&lt;center&gt;
&lt;img src=&quot;/img/blog/joins-dist-perf.png&quot; style=&quot;width:70%;margin:15px&quot; /&gt;
&lt;/center&gt;
&lt;p&gt;As expected, the Broadcast-Forward strategy performs best for very small inputs because the large probe side is not shipped over the network and is locally joined. However, when the size of the broadcasted side grows, two problems arise. First the amount of data which is shipped increases but also each parallel instance has to process the full broadcasted data set. The performance of both Repartitioning strategies behaves similar for growing input sizes which indicates that these strategies are mainly limited by the cost of the data transfer (at max 2TB are shipped over the network and joined). Although the Sort-Merge-Join strategy shows the worst performance all shown cases, it has a right to exist because it can nicely exploit sorted input data.&lt;/p&gt;
&lt;h3 id=&quot;ive-got-sooo-much-data-to-join-do-i-really-need-to-ship-it&quot;&gt;I’ve got sooo much data to join, do I really need to ship it?&lt;/h3&gt;
&lt;p&gt;We have seen that off-the-shelf distributed joins work really well in Flink. But what if your data is so huge that you do not want to shuffle it across your cluster? We recently added some features to Flink for specifying semantic properties (partitioning and sorting) on input splits and co-located reading of local input files. With these tools at hand, it is possible to join pre-partitioned data sets from your local filesystem without sending a single byte over your cluster’s network. If the input data is even pre-sorted, the join can be done as a Sort-Merge-Join without sorting, i.e., the join is essentially done on-the-fly. Exploiting co-location requires a very special setup though. Data needs to be stored on the local filesystem because HDFS does not feature data co-location and might move file blocks across data nodes. That means you need to take care of many things yourself which HDFS would have done for you, including replication to avoid data loss. On the other hand, performance gains of joining co-located and pre-sorted can be quite substantial.&lt;/p&gt;
&lt;h3 id=&quot;tldr-what-should-i-remember-from-all-of-this&quot;&gt;tl;dr: What should I remember from all of this?&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Flink’s fluent Scala and Java APIs make joins and other data transformations easy as cake.&lt;/li&gt;
&lt;li&gt;The optimizer does the hard choices for you, but gives you control in case you know better.&lt;/li&gt;
&lt;li&gt;Flink’s join implementations perform very good in-memory and gracefully degrade when going to disk.&lt;/li&gt;
&lt;li&gt;Due to Flink’s robust memory management, there is no need for job- or data-specific memory tuning to avoid a nasty &lt;code&gt;OutOfMemoryException&lt;/code&gt;. It just runs out-of-the-box.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4 id=&quot;references&quot;&gt;References&lt;/h4&gt;
&lt;p&gt;[1] &lt;a href=&quot;&quot;&gt;“MapReduce: Simplified data processing on large clusters”&lt;/a&gt;, Dean, Ghemawat, 2004 &lt;br /&gt;
[2] &lt;a href=&quot;http://ci.apache.org/projects/flink/flink-docs-release-0.8/dataset_transformations.html&quot;&gt;Flink 0.8.1 documentation: Data Transformations&lt;/a&gt; &lt;br /&gt;
[3] &lt;a href=&quot;http://ci.apache.org/projects/flink/flink-docs-release-0.8/dataset_transformations.html#join&quot;&gt;Flink 0.8.1 documentation: Joins&lt;/a&gt; &lt;br /&gt;
[4] &lt;a href=&quot;http://ci.apache.org/projects/flink/flink-docs-master/apis/programming_guide.html#semantic-annotations&quot;&gt;Flink 0.9-SNAPSHOT documentation: Semantic annotations&lt;/a&gt; &lt;br /&gt;
[5] &lt;a href=&quot;http://ci.apache.org/projects/flink/flink-docs-master/apis/dataset_transformations.html#join-algorithm-hints&quot;&gt;Flink 0.9-SNAPSHOT documentation: Optimizer join hints&lt;/a&gt; &lt;br /&gt;&lt;/p&gt;
</description>
<pubDate>Fri, 13 Mar 2015 11:00:00 +0100</pubDate>
<link>http://flink.apache.org/news/2015/03/13/peeking-into-Apache-Flinks-Engine-Room.html</link>
<guid isPermaLink="true">/news/2015/03/13/peeking-into-Apache-Flinks-Engine-Room.html</guid>
</item>
<item>
<title>February 2015 in the Flink community</title>
<description>&lt;p&gt;February might be the shortest month of the year, but this does not
mean that the Flink community has not been busy adding features to the
system and fixing bugs. Here’s a rundown of the activity in the Flink
community last month.&lt;/p&gt;
&lt;h3 id=&quot;release&quot;&gt;0.8.1 release&lt;/h3&gt;
&lt;p&gt;Flink 0.8.1 was released. This bugfixing release resolves a total of 22 issues.&lt;/p&gt;
&lt;h3 id=&quot;new-committer&quot;&gt;New committer&lt;/h3&gt;
&lt;p&gt;&lt;a href=&quot;https://github.com/mxm&quot;&gt;Max Michels&lt;/a&gt; has been voted a committer by the Flink PMC.&lt;/p&gt;
&lt;h3 id=&quot;flink-adapter-for-apache-samoa&quot;&gt;Flink adapter for Apache SAMOA&lt;/h3&gt;
&lt;p&gt;&lt;a href=&quot;http://samoa.incubator.apache.org&quot;&gt;Apache SAMOA (incubating)&lt;/a&gt; is a
distributed streaming machine learning (ML) framework with a
programming abstraction for distributed streaming ML algorithms. SAMOA
runs on a variety of backend engines, currently Apache Storm and
Apache S4. A &lt;a href=&quot;https://github.com/apache/incubator-samoa/pull/11&quot;&gt;pull
request&lt;/a&gt; is
available at the SAMOA repository that adds a Flink adapter for SAMOA.&lt;/p&gt;
&lt;h3 id=&quot;easy-flink-deployment-on-google-compute-cloud&quot;&gt;Easy Flink deployment on Google Compute Cloud&lt;/h3&gt;
&lt;p&gt;Flink is now integrated in bdutil, Google’s open source tool for
creating and configuring (Hadoop) clusters in Google Compute
Engine. Deployment of Flink clusters in now supported starting with
&lt;a href=&quot;https://groups.google.com/forum/#!topic/gcp-hadoop-announce/uVJ_6y9cGKM&quot;&gt;bdutil
1.2.0&lt;/a&gt;.&lt;/p&gt;
&lt;h3 id=&quot;flink-on-the-web&quot;&gt;Flink on the Web&lt;/h3&gt;
&lt;p&gt;A new blog post on &lt;a href=&quot;http://flink.apache.org/news/2015/02/09/streaming-example.html&quot;&gt;Flink
Streaming&lt;/a&gt;
was published at the blog. Flink was mentioned in several articles on
the web. Here are some examples:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;http://dataconomy.com/how-flink-became-an-apache-top-level-project/&quot;&gt;How Flink became an Apache Top-Level Project&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://www.linkedin.com/pulse/stale-synchronous-parallelism-new-frontier-apache-flink-nam-luc-tran?utm_content=buffer461af&amp;amp;utm_medium=social&amp;amp;utm_source=linkedin.com&amp;amp;utm_campaign=buffer&quot;&gt;Stale Synchronous Parallelism: The new frontier for Apache Flink?&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;http://www.hadoopsphere.com/2015/02/distributed-data-processing-with-apache.html&quot;&gt;Distributed data processing with Apache Flink&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;http://www.hadoopsphere.com/2015/02/ciao-latency-hallo-speed.html&quot;&gt;Ciao latency, hello speed&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;in-the-flink-master&quot;&gt;In the Flink master&lt;/h2&gt;
&lt;p&gt;The following features have been now merged in Flink’s master repository.&lt;/p&gt;
&lt;h3 id=&quot;gelly&quot;&gt;Gelly&lt;/h3&gt;
&lt;p&gt;Gelly, Flink’s Graph API allows users to manipulate graph-shaped data
directly. Here’s for example a calculation of shortest paths in a
graph:&lt;/p&gt;
&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-java&quot; data-lang=&quot;java&quot;&gt;&lt;span class=&quot;n&quot;&gt;Graph&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Long&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Double&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Double&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;graph&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Graph&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;fromDataSet&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;vertices&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;edges&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;env&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;DataSet&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Vertex&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Long&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Double&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;singleSourceShortestPaths&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;graph&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;run&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;SingleSourceShortestPaths&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Long&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;srcVertexId&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;maxIterations&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)).&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;getVertices&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;();&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;See more Gelly examples
&lt;a href=&quot;https://github.com/apache/flink/tree/master/flink-staging/flink-gelly/src/main/java/org/apache/flink/graph/example&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;h3 id=&quot;flink-expressions&quot;&gt;Flink Expressions&lt;/h3&gt;
&lt;p&gt;The newly merged
&lt;a href=&quot;https://github.com/apache/flink/tree/master/flink-staging/flink-table&quot;&gt;flink-table&lt;/a&gt;
module is the first step in Flink’s roadmap towards logical queries
and SQL support. Here’s a preview on how you can read two CSV file,
assign a logical schema to, and apply transformations like filters and
joins using logical attributes rather than physical data types.&lt;/p&gt;
&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-scala&quot; data-lang=&quot;scala&quot;&gt;&lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;customers&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;getCustomerDataSet&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;env&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;as&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;id&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;mktSegment&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;filter&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt; &lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;mktSegment&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;===&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;AUTOMOBILE&amp;quot;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;orders&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;getOrdersDataSet&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;env&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;filter&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;o&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dateFormat&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;parse&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;o&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;orderDate&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;before&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;date&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;as&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;orderId&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;custId&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;orderDate&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;shipPrio&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;items&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;orders&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;join&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;customers&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;where&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;custId&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;===&lt;/span&gt; &lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;id&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;select&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;orderId&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;orderDate&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;-Symbol&quot;&gt;&amp;#39;shipPrio&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;h3 id=&quot;access-to-hcatalog-tables&quot;&gt;Access to HCatalog tables&lt;/h3&gt;
&lt;p&gt;With the &lt;a href=&quot;https://github.com/apache/flink/tree/master/flink-staging/flink-hcatalog&quot;&gt;flink-hcatalog
module&lt;/a&gt;,
you can now conveniently access HCatalog/Hive tables. The module
supports projection (selection and order of fields) and partition
filters.&lt;/p&gt;
&lt;h3 id=&quot;access-to-secured-yarn-clustershdfs&quot;&gt;Access to secured YARN clusters/HDFS.&lt;/h3&gt;
&lt;p&gt;With this change users can access Kerberos secured YARN (and HDFS)
Hadoop clusters. Also, basic support for accessing secured HDFS with
a standalone Flink setup is now available.&lt;/p&gt;
</description>
<pubDate>Mon, 02 Mar 2015 11:00:00 +0100</pubDate>
<link>http://flink.apache.org/news/2015/03/02/february-2015-in-flink.html</link>
<guid isPermaLink="true">/news/2015/03/02/february-2015-in-flink.html</guid>
</item>
<item>
<title>Introducing Flink Streaming</title>
<description>&lt;p&gt;This post is the first of a series of blog posts on Flink Streaming,
the recent addition to Apache Flink that makes it possible to analyze
continuous data sources in addition to static files. Flink Streaming
uses the pipelined Flink engine to process data streams in real time
and offers a new API including definition of flexible windows.&lt;/p&gt;
&lt;p&gt;In this post, we go through an example that uses the Flink Streaming
API to compute statistics on stock market data that arrive
continuously and combine the stock market data with Twitter streams.
See the &lt;a href=&quot;http://ci.apache.org/projects/flink/flink-docs-master/apis/streaming_guide.html&quot;&gt;Streaming Programming
Guide&lt;/a&gt; for a
detailed presentation of the Streaming API.&lt;/p&gt;
&lt;p&gt;First, we read a bunch of stock price streams and combine them into
one stream of market data. We apply several transformations on this
market data stream, like rolling aggregations per stock. Then we emit
price warning alerts when the prices are rapidly changing. Moving
towards more advanced features, we compute rolling correlations
between the market data streams and a Twitter stream with stock mentions.&lt;/p&gt;
&lt;p&gt;For running the example implementation please use the &lt;em&gt;0.9-SNAPSHOT&lt;/em&gt;
version of Flink as a dependency. The full example code base can be
found &lt;a href=&quot;https://github.com/mbalassi/flink/blob/stockprices/flink-staging/flink-streaming/flink-streaming-examples/src/main/scala/org/apache/flink/streaming/scala/examples/windowing/StockPrices.scala&quot;&gt;here&lt;/a&gt; in Scala and &lt;a href=&quot;https://github.com/mbalassi/flink/blob/stockprices/flink-staging/flink-streaming/flink-streaming-examples/src/main/java/org/apache/flink/streaming/examples/windowing/StockPrices.java&quot;&gt;here&lt;/a&gt; in Java7.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;#top&quot;&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;#top&quot;&gt;Back to top&lt;/a&gt;&lt;/p&gt;
&lt;h2 id=&quot;reading-from-multiple-inputs&quot;&gt;Reading from multiple inputs&lt;/h2&gt;
&lt;p&gt;First, let us create the stream of stock prices:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Read a socket stream of stock prices&lt;/li&gt;
&lt;li&gt;Parse the text in the stream to create a stream of &lt;code&gt;StockPrice&lt;/code&gt; objects&lt;/li&gt;
&lt;li&gt;Add four other sources tagged with the stock symbol.&lt;/li&gt;
&lt;li&gt;Finally, merge the streams to create a unified stream.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;img alt=&quot;Reading from multiple inputs&quot; src=&quot;/img/blog/blog_multi_input.png&quot; width=&quot;70%&quot; class=&quot;img-responsive center-block&quot; /&gt;&lt;/p&gt;
&lt;div class=&quot;codetabs&quot;&gt;
&lt;div data-lang=&quot;scala&quot;&gt;
&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-scala&quot; data-lang=&quot;scala&quot;&gt;&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;main&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;args&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Array&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;])&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;env&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;StreamExecutionEnvironment&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;getExecutionEnvironment&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;//Read from a socket stream at map it to StockPrice objects&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;socketStockStream&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;env&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;socketTextStream&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;localhost&amp;quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;9999&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;map&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&amp;gt;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;split&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;split&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;,&amp;quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;nc&quot;&gt;StockPrice&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;split&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;split&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;toDouble&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;})&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;//Generate other stock streams&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;SPX_Stream&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;env&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;addSource&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;generateStock&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;SPX&amp;quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;10&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;_&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;FTSE_Stream&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;env&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;addSource&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;generateStock&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;FTSE&amp;quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;20&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;_&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;DJI_Stream&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;env&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;addSource&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;generateStock&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;DJI&amp;quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;30&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;_&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;BUX_Stream&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;env&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;addSource&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;generateStock&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;BUX&amp;quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;40&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;_&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;//Merge all stock streams together&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;stockStream&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;socketStockStream&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;merge&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;SPX_Stream&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;FTSE_Stream&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;nc&quot;&gt;DJI_Stream&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;BUX_Stream&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;stockStream&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;()&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;env&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;execute&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;Stock stream&amp;quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div data-lang=&quot;java7&quot;&gt;
&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-java&quot; data-lang=&quot;java&quot;&gt;&lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;static&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;void&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;main&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[]&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;args&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;throws&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Exception&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;span class=&quot;kd&quot;&gt;final&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;StreamExecutionEnvironment&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;env&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;StreamExecutionEnvironment&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;getExecutionEnvironment&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;();&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;//Read from a socket stream at map it to StockPrice objects&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;DataStream&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;StockPrice&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;socketStockStream&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;env&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;socketTextStream&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;localhost&amp;quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;9999&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;map&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;MapFunction&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;StockPrice&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;()&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;span class=&quot;kd&quot;&gt;private&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[]&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tokens&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;nd&quot;&gt;@Override&lt;/span&gt;
&lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;StockPrice&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;map&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;String&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;value&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;throws&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Exception&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;tokens&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;value&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;split&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;,&amp;quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;StockPrice&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tokens&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;],&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;Double&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;parseDouble&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tokens&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;]));&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;});&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;//Generate other stock streams&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;DataStream&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;StockPrice&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;SPX_stream&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;env&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;addSource&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;StockSource&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;SPX&amp;quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;10&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;));&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;DataStream&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;StockPrice&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;FTSE_stream&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;env&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;addSource&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;StockSource&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;FTSE&amp;quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;20&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;));&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;DataStream&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;StockPrice&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;DJI_stream&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;env&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;addSource&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;StockSource&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;DJI&amp;quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;30&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;));&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;DataStream&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;StockPrice&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;BUX_stream&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;env&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;addSource&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;StockSource&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;BUX&amp;quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;40&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;));&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;//Merge all stock streams together&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;DataStream&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;StockPrice&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;stockStream&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;socketStockStream&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;merge&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;SPX_stream&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;FTSE_stream&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;DJI_stream&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;BUX_stream&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;stockStream&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;();&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;env&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;execute&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;Stock stream&amp;quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;);&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;See
&lt;a href=&quot;http://ci.apache.org/projects/flink/flink-docs-master/apis/streaming_guide.html#sources&quot;&gt;here&lt;/a&gt;
on how you can create streaming sources for Flink Streaming
programs. Flink, of course, has support for reading in streams from
&lt;a href=&quot;http://ci.apache.org/projects/flink/flink-docs-master/apis/streaming_guide.html#stream-connectors&quot;&gt;external
sources&lt;/a&gt;
such as Apache Kafka, Apache Flume, RabbitMQ, and others. For the sake
of this example, the data streams are simply generated using the
&lt;code&gt;generateStock&lt;/code&gt; method:&lt;/p&gt;
&lt;div class=&quot;codetabs&quot;&gt;
&lt;div data-lang=&quot;scala&quot;&gt;
&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-scala&quot; data-lang=&quot;scala&quot;&gt;&lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;symbols&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;List&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;SPX&amp;quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;FTSE&amp;quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;DJI&amp;quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;DJT&amp;quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;BUX&amp;quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;DAX&amp;quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;GOOG&amp;quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;case&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;StockPrice&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;symbol&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;price&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Double&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;generateStock&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;symbol&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sigma&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Int&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;out&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Collector&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;StockPrice&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;])&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;var&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;price&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mf&quot;&gt;1000.&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;while&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;true&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;price&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;price&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Random&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;nextGaussian&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sigma&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;out&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;collect&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;StockPrice&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;symbol&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;price&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;))&lt;/span&gt;
&lt;span class=&quot;nc&quot;&gt;Thread&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sleep&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;Random&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;nextInt&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;200&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;))&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div data-lang=&quot;java7&quot;&gt;
&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-java&quot; data-lang=&quot;java&quot;&gt;&lt;span class=&quot;kd&quot;&gt;private&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;static&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;final&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ArrayList&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;SYMBOLS&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ArrayList&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;(&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;Arrays&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;asList&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;SPX&amp;quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;FTSE&amp;quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;DJI&amp;quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;DJT&amp;quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;BUX&amp;quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;DAX&amp;quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;GOOG&amp;quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;));&lt;/span&gt;
&lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;static&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;StockPrice&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;implements&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Serializable&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;String&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;symbol&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Double&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;price&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;StockPrice&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;StockPrice&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;String&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;symbol&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Double&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;price&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;this&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;symbol&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;symbol&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;this&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;price&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;price&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;nd&quot;&gt;@Override&lt;/span&gt;
&lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;String&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;toString&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;StockPrice{&amp;quot;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt;
&lt;span class=&quot;s&quot;&gt;&amp;quot;symbol=&amp;#39;&amp;quot;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;symbol&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;sc&quot;&gt;&amp;#39;\&amp;#39;&amp;#39;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt;
&lt;span class=&quot;s&quot;&gt;&amp;quot;, count=&amp;quot;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;price&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt;
&lt;span class=&quot;sc&quot;&gt;&amp;#39;}&amp;#39;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;final&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;static&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;StockSource&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;implements&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;SourceFunction&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;StockPrice&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;span class=&quot;kd&quot;&gt;private&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Double&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;price&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;kd&quot;&gt;private&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;String&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;symbol&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;kd&quot;&gt;private&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Integer&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sigma&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;StockSource&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;String&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;symbol&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Integer&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sigma&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;this&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;symbol&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;symbol&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;this&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;sigma&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sigma&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;nd&quot;&gt;@Override&lt;/span&gt;
&lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;void&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;invoke&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Collector&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;StockPrice&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;collector&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;throws&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Exception&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;price&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;DEFAULT_PRICE&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;Random&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;random&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;Random&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;();&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;while&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;true&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;price&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;price&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;random&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;nextGaussian&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sigma&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;collector&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;collect&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;StockPrice&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;symbol&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;price&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;));&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;Thread&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;sleep&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;random&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;nextInt&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;200&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;));&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;To read from the text socket stream please make sure that you have a
socket running. For the sake of the example executing the following
command in a terminal does the job. You can get
&lt;a href=&quot;http://netcat.sourceforge.net/&quot;&gt;netcat&lt;/a&gt; here if it is not available
on your machine.&lt;/p&gt;
&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code&gt;nc -lk 9999
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;If we execute the program from our IDE we see the system the
stock prices being generated:&lt;/p&gt;
&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code&gt;INFO Job execution switched to status RUNNING.
INFO Socket Stream(1/1) switched to SCHEDULED
INFO Socket Stream(1/1) switched to DEPLOYING
INFO Custom Source(1/1) switched to SCHEDULED
INFO Custom Source(1/1) switched to DEPLOYING
1&amp;gt; StockPrice{symbol=&#39;SPX&#39;, count=1011.3405732645239}
2&amp;gt; StockPrice{symbol=&#39;SPX&#39;, count=1018.3381290039248}
1&amp;gt; StockPrice{symbol=&#39;DJI&#39;, count=1036.7454894073978}
3&amp;gt; StockPrice{symbol=&#39;DJI&#39;, count=1135.1170217478427}
3&amp;gt; StockPrice{symbol=&#39;BUX&#39;, count=1053.667523187687}
4&amp;gt; StockPrice{symbol=&#39;BUX&#39;, count=1036.552601487263}
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;a href=&quot;#top&quot;&gt;Back to top&lt;/a&gt;&lt;/p&gt;
&lt;h2 id=&quot;window-aggregations&quot;&gt;Window aggregations&lt;/h2&gt;
&lt;p&gt;We first compute aggregations on time-based windows of the
data. Flink provides &lt;a href=&quot;http://ci.apache.org/projects/flink/flink-docs-master/apis/streaming_guide.html#window-operators&quot;&gt;flexible windowing semantics&lt;/a&gt; where windows can
also be defined based on count of records or any custom user defined
logic.&lt;/p&gt;
&lt;p&gt;We partition our stream into windows of 10 seconds and slide the
window every 5 seconds. We compute three statistics every 5 seconds.
The first is the minimum price of all stocks, the second produces
maximum price per stock, and the third is the mean stock price
(using a map window function). Aggregations and groupings can be
performed on named fields of POJOs, making the code more readable.&lt;/p&gt;
&lt;p&gt;&lt;img alt=&quot;Basic windowing aggregations&quot; src=&quot;/img/blog/blog_basic_window.png&quot; width=&quot;70%&quot; class=&quot;img-responsive center-block&quot; /&gt;&lt;/p&gt;
&lt;div class=&quot;codetabs&quot;&gt;
&lt;div data-lang=&quot;scala&quot;&gt;
&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-scala&quot; data-lang=&quot;scala&quot;&gt;&lt;span class=&quot;c1&quot;&gt;//Define the desired time window&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;windowedStream&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;stockStream&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;window&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;Time&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;of&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;10&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;SECONDS&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)).&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;every&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;Time&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;of&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;SECONDS&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;))&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;//Compute some simple statistics on a rolling window&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;lowest&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;windowedStream&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;minBy&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;price&amp;quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;maxByStock&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;windowedStream&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;groupBy&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;symbol&amp;quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;maxBy&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;price&amp;quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;rollingMean&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;windowedStream&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;groupBy&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;symbol&amp;quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mapWindow&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mean&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;_&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;//Compute the mean of a window&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;mean&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ts&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Iterable&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;StockPrice&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;out&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Collector&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;StockPrice&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;])&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ts&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;nonEmpty&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;out&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;collect&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;StockPrice&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ts&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;head&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;symbol&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ts&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;foldLeft&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Double&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;_&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;_&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;price&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;/&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ts&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;size&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;))&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div data-lang=&quot;java7&quot;&gt;
&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-java&quot; data-lang=&quot;java&quot;&gt;&lt;span class=&quot;c1&quot;&gt;//Define the desired time window&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;WindowedDataStream&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;StockPrice&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;windowedStream&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;stockStream&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;window&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Time&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;of&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;10&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;TimeUnit&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;SECONDS&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;))&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;every&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Time&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;of&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;TimeUnit&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;SECONDS&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;));&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;//Compute some simple statistics on a rolling window&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;DataStream&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;StockPrice&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;lowest&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;windowedStream&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;minBy&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;price&amp;quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;flatten&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;();&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;DataStream&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;StockPrice&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;maxByStock&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;windowedStream&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;groupBy&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;symbol&amp;quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;maxBy&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;price&amp;quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;flatten&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;();&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;DataStream&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;StockPrice&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;rollingMean&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;windowedStream&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;groupBy&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;symbol&amp;quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;mapWindow&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;WindowMean&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;()).&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;flatten&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;();&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;//Compute the mean of a window&lt;/span&gt;
&lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;final&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;static&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;WindowMean&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;implements&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;WindowMapFunction&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;StockPrice&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;StockPrice&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;span class=&quot;kd&quot;&gt;private&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Double&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sum&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mf&quot;&gt;0.0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;kd&quot;&gt;private&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Integer&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;count&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;kd&quot;&gt;private&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;String&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;symbol&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;&amp;quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;nd&quot;&gt;@Override&lt;/span&gt;
&lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;void&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;mapWindow&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Iterable&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;StockPrice&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;values&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Collector&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;StockPrice&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;out&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;kd&quot;&gt;throws&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Exception&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;values&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;iterator&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;().&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;hasNext&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;())&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;
&lt;span class=&quot;nf&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;StockPrice&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sp&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;values&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;sum&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sp&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;price&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;symbol&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sp&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;symbol&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;count&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;++;&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;out&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;collect&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;StockPrice&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;symbol&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sum&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;/&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;count&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;));&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;Let us note that to print a windowed stream one has to flatten it first,
thus getting rid of the windowing logic. For example execute
&lt;code&gt;maxByStock.flatten().print()&lt;/code&gt; to print the stream of maximum prices of
the time windows by stock. For Scala &lt;code&gt;flatten()&lt;/code&gt; is called implicitly
when needed.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;#top&quot;&gt;Back to top&lt;/a&gt;&lt;/p&gt;
&lt;h2 id=&quot;data-driven-windows&quot;&gt;Data-driven windows&lt;/h2&gt;
&lt;p&gt;The most interesting event in the stream is when the price of a stock
is changing rapidly. We can send a warning when a stock price changes
more than 5% since the last warning. To do that, we use a delta-based window providing a
threshold on when the computation will be triggered, a function to
compute the difference and a default value with which the first record
is compared. We also create a &lt;code&gt;Count&lt;/code&gt; data type to count the warnings
every 30 seconds.&lt;/p&gt;
&lt;p&gt;&lt;img alt=&quot;Data-driven windowing semantics&quot; src=&quot;/img/blog/blog_data_driven.png&quot; width=&quot;100%&quot; class=&quot;img-responsive center-block&quot; /&gt;&lt;/p&gt;
&lt;div class=&quot;codetabs&quot;&gt;
&lt;div data-lang=&quot;scala&quot;&gt;
&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-scala&quot; data-lang=&quot;scala&quot;&gt;&lt;span class=&quot;k&quot;&gt;case&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Count&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;symbol&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;count&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Int&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;defaultPrice&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;StockPrice&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;&amp;quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1000&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;//Use delta policy to create price change warnings&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;priceWarnings&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;stockStream&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;groupBy&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;symbol&amp;quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;window&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;Delta&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;of&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mf&quot;&gt;0.05&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;priceChange&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;defaultPrice&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;))&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mapWindow&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sendWarning&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;_&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;//Count the number of warnings every half a minute&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;warningsPerStock&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;priceWarnings&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;map&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;Count&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;_&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;))&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;groupBy&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;symbol&amp;quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;window&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;Time&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;of&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;30&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;SECONDS&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;))&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sum&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;count&amp;quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;priceChange&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;p1&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;StockPrice&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;p2&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;StockPrice&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Double&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;span class=&quot;nc&quot;&gt;Math&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;abs&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;p1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;price&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;/&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;p2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;price&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sendWarning&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ts&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Iterable&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;StockPrice&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;out&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Collector&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;])&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ts&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;nonEmpty&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;out&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;collect&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ts&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;head&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;symbol&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div data-lang=&quot;java7&quot;&gt;
&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-java&quot; data-lang=&quot;java&quot;&gt;&lt;span class=&quot;kd&quot;&gt;private&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;static&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;final&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Double&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;DEFAULT_PRICE&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1000&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.;&lt;/span&gt;
&lt;span class=&quot;kd&quot;&gt;private&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;static&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;final&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;StockPrice&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;DEFAULT_STOCK_PRICE&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;StockPrice&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;&amp;quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;DEFAULT_PRICE&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;//Use delta policy to create price change warnings&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;DataStream&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;priceWarnings&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;stockStream&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;groupBy&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;symbol&amp;quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;window&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Delta&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;of&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mf&quot;&gt;0.05&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;DeltaFunction&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;StockPrice&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;()&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;span class=&quot;nd&quot;&gt;@Override&lt;/span&gt;
&lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;double&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;getDelta&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;StockPrice&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;oldDataPoint&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;StockPrice&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;newDataPoint&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Math&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;abs&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;oldDataPoint&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;price&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;newDataPoint&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;price&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;},&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;DEFAULT_STOCK_PRICE&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;))&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;mapWindow&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;SendWarning&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;()).&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;flatten&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;();&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;//Count the number of warnings every half a minute&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;DataStream&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Count&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;warningsPerStock&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;priceWarnings&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;map&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;MapFunction&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Count&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;()&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;span class=&quot;nd&quot;&gt;@Override&lt;/span&gt;
&lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Count&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;map&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;String&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;value&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;throws&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Exception&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;Count&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;value&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}).&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;groupBy&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;symbol&amp;quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;window&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Time&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;of&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;30&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;TimeUnit&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;SECONDS&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)).&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;sum&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;count&amp;quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;flatten&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;();&lt;/span&gt;
&lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;static&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Count&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;implements&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Serializable&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;String&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;symbol&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Integer&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;count&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;Count&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;Count&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;String&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;symbol&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Integer&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;count&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;this&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;symbol&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;symbol&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;this&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;count&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;count&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;nd&quot;&gt;@Override&lt;/span&gt;
&lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;String&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;toString&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;Count{&amp;quot;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt;
&lt;span class=&quot;s&quot;&gt;&amp;quot;symbol=&amp;#39;&amp;quot;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;symbol&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;sc&quot;&gt;&amp;#39;\&amp;#39;&amp;#39;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt;
&lt;span class=&quot;s&quot;&gt;&amp;quot;, count=&amp;quot;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;count&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt;
&lt;span class=&quot;sc&quot;&gt;&amp;#39;}&amp;#39;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;static&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;final&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;SendWarning&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;implements&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;MapWindowFunction&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;StockPrice&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;span class=&quot;nd&quot;&gt;@Override&lt;/span&gt;
&lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;void&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;mapWindow&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Iterable&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;StockPrice&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;values&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Collector&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;out&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;kd&quot;&gt;throws&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Exception&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;values&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;iterator&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;().&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;hasNext&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;())&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;out&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;collect&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;values&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;iterator&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;().&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;next&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;().&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;symbol&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;&lt;a href=&quot;#top&quot;&gt;Back to top&lt;/a&gt;&lt;/p&gt;
&lt;h2 id=&quot;combining-with-a-twitter-stream&quot;&gt;Combining with a Twitter stream&lt;/h2&gt;
&lt;p&gt;Next, we will read a Twitter stream and correlate it with our stock
price stream. Flink has support for connecting to &lt;a href=&quot;http://ci.apache.org/projects/flink/flink-docs-master/apis/streaming_guide.html#twitter-streaming-api&quot;&gt;Twitter’s
API&lt;/a&gt;,
but for the sake of this example we generate dummy tweet data.&lt;/p&gt;
&lt;p&gt;&lt;img alt=&quot;Social media analytics&quot; src=&quot;/img/blog/blog_social_media.png&quot; width=&quot;100%&quot; class=&quot;img-responsive center-block&quot; /&gt;&lt;/p&gt;
&lt;div class=&quot;codetabs&quot;&gt;
&lt;div data-lang=&quot;scala&quot;&gt;
&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-scala&quot; data-lang=&quot;scala&quot;&gt;&lt;span class=&quot;c1&quot;&gt;//Read a stream of tweets&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tweetStream&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;env&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;addSource&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;generateTweets&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;_&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;//Extract the stock symbols&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;mentionedSymbols&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tweetStream&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;flatMap&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tweet&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tweet&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;split&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot; &amp;quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;))&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;map&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;_&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;toUpperCase&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;())&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;filter&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;symbols&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;contains&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;_&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;))&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;//Count the extracted symbols&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tweetsPerStock&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;mentionedSymbols&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;map&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;Count&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;_&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;))&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;groupBy&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;symbol&amp;quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;window&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;Time&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;of&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;30&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;SECONDS&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;))&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sum&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;count&amp;quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;generateTweets&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;out&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Collector&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;])&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;while&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;true&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;s&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;&amp;lt;-&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;to&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;yield&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;symbols&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;Random&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;nextInt&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;symbols&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;size&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)))&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;out&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;collect&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mkString&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot; &amp;quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;))&lt;/span&gt;
&lt;span class=&quot;nc&quot;&gt;Thread&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sleep&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;Random&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;nextInt&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;500&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;))&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div data-lang=&quot;java7&quot;&gt;
&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-java&quot; data-lang=&quot;java&quot;&gt;&lt;span class=&quot;c1&quot;&gt;//Read a stream of tweets&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;DataStream&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tweetStream&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;env&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;addSource&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;TweetSource&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;());&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;//Extract the stock symbols&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;DataStream&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;mentionedSymbols&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tweetStream&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;flatMap&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;FlatMapFunction&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;()&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;span class=&quot;nd&quot;&gt;@Override&lt;/span&gt;
&lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;void&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;flatMap&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;String&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;value&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Collector&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;out&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;throws&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Exception&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[]&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;words&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;value&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;split&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot; &amp;quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;String&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;word&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;words&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;out&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;collect&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;word&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;toUpperCase&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;());&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}).&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;filter&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;FilterFunction&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;()&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;span class=&quot;nd&quot;&gt;@Override&lt;/span&gt;
&lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;boolean&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;filter&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;String&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;value&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;throws&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Exception&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;SYMBOLS&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;contains&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;value&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;});&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;//Count the extracted symbols&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;DataStream&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Count&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tweetsPerStock&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;mentionedSymbols&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;map&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;MapFunction&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Count&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;()&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;span class=&quot;nd&quot;&gt;@Override&lt;/span&gt;
&lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Count&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;map&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;String&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;value&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;throws&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Exception&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;Count&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;value&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}).&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;groupBy&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;symbol&amp;quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;window&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Time&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;of&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;30&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;TimeUnit&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;SECONDS&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)).&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;sum&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;count&amp;quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;flatten&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;();&lt;/span&gt;
&lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;static&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;final&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;TweetSource&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;implements&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;SourceFunction&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;Random&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;random&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;StringBuilder&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;stringBuilder&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;nd&quot;&gt;@Override&lt;/span&gt;
&lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;void&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;invoke&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Collector&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;collector&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;throws&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Exception&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;random&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;Random&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;();&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;stringBuilder&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;StringBuilder&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;();&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;while&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;true&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;stringBuilder&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;setLength&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;++)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;stringBuilder&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;append&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot; &amp;quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;stringBuilder&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;append&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;SYMBOLS&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;get&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;random&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;nextInt&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;SYMBOLS&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;size&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;())));&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;collector&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;collect&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;stringBuilder&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;toString&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;());&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;Thread&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;sleep&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;500&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;&lt;a href=&quot;#top&quot;&gt;Back to top&lt;/a&gt;&lt;/p&gt;
&lt;h2 id=&quot;streaming-joins&quot;&gt;Streaming joins&lt;/h2&gt;
&lt;p&gt;Finally, we join real-time tweets and stock prices and compute a
rolling correlation between the number of price warnings and the
number of mentions of a given stock in the Twitter stream. As both of
these data streams are potentially infinite, we apply the join on a
30-second window.&lt;/p&gt;
&lt;p&gt;&lt;img alt=&quot;Streaming joins&quot; src=&quot;/img/blog/blog_stream_join.png&quot; width=&quot;60%&quot; class=&quot;img-responsive center-block&quot; /&gt;&lt;/p&gt;
&lt;div class=&quot;codetabs&quot;&gt;
&lt;div data-lang=&quot;scala&quot;&gt;
&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-scala&quot; data-lang=&quot;scala&quot;&gt;&lt;span class=&quot;c1&quot;&gt;//Join warnings and parsed tweets&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tweetsAndWarning&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;warningsPerStock&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;join&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tweetsPerStock&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;onWindow&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;30&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;SECONDS&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;where&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;symbol&amp;quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;equalTo&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;symbol&amp;quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;c1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;c2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&amp;gt;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;c1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;count&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;c2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;count&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;rollingCorrelation&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tweetsAndWarning&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;window&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;Time&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;of&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;30&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;SECONDS&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;))&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mapWindow&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;computeCorrelation&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;_&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;rollingCorrelation&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;print&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;//Compute rolling correlation&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;computeCorrelation&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;input&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Iterable&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;Int&lt;/span&gt;, &lt;span class=&quot;kt&quot;&gt;Int&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;out&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;Collector&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;Double&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;])&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;input&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;nonEmpty&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;var1&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;input&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;map&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;_&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;_1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;mean1&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;average&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;var1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;var2&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;input&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;map&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;_&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;_2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;mean2&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;average&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;var2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cov&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;average&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;var1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;zip&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;var2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;map&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;xy&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&amp;gt;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;xy&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;_1&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;mean1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;xy&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;_2&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;mean2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)))&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;d1&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Math&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sqrt&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;average&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;var1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;map&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&amp;gt;&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Math&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pow&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;((&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;mean1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;))))&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;val&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;d2&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Math&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sqrt&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;average&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;var2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;map&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;=&amp;gt;&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Math&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pow&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;((&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;mean2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;))))&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;out&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;collect&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cov&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;/&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;d1&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;d2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;))&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;div data-lang=&quot;java7&quot;&gt;
&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-java&quot; data-lang=&quot;java&quot;&gt;&lt;span class=&quot;c1&quot;&gt;//Join warnings and parsed tweets&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;DataStream&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Tuple2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Integer&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Integer&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tweetsAndWarning&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;warningsPerStock&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;join&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tweetsPerStock&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;onWindow&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;30&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;TimeUnit&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;SECONDS&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;where&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;symbol&amp;quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;equalTo&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;symbol&amp;quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;with&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;JoinFunction&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Count&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Count&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Tuple2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Integer&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Integer&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;()&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;span class=&quot;nd&quot;&gt;@Override&lt;/span&gt;
&lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Tuple2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Integer&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Integer&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;join&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Count&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;first&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Count&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;second&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;throws&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Exception&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Tuple2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Integer&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Integer&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;first&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;count&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;second&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;count&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;});&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;//Compute rolling correlation&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;DataStream&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Double&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;rollingCorrelation&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tweetsAndWarning&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;window&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Time&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;of&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;30&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;TimeUnit&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;SECONDS&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;))&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;mapWindow&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;WindowCorrelation&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;());&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;rollingCorrelation&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;();&lt;/span&gt;
&lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;static&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;final&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;WindowCorrelation&lt;/span&gt;
&lt;span class=&quot;kd&quot;&gt;implements&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;WindowMapFunction&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Tuple2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Integer&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Integer&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Double&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;span class=&quot;kd&quot;&gt;private&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Integer&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;leftSum&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;kd&quot;&gt;private&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Integer&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;rightSum&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;kd&quot;&gt;private&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Integer&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;count&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;kd&quot;&gt;private&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Double&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;leftMean&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;kd&quot;&gt;private&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Double&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;rightMean&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;kd&quot;&gt;private&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Double&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cov&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;kd&quot;&gt;private&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Double&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;leftSd&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;kd&quot;&gt;private&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Double&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;rightSd&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;nd&quot;&gt;@Override&lt;/span&gt;
&lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;void&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;mapWindow&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Iterable&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Tuple2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Integer&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Integer&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;values&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Collector&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Double&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;out&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;kd&quot;&gt;throws&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Exception&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;leftSum&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;rightSum&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;count&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;cov&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;leftSd&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;rightSd&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.;&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;//compute mean for both sides, save count&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Tuple2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Integer&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Integer&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pair&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;values&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;leftSum&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pair&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;f0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;rightSum&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pair&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;f1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;count&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;++;&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;leftMean&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;leftSum&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;doubleValue&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;/&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;count&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;rightMean&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;rightSum&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;doubleValue&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;/&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;count&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;//compute covariance &amp;amp; std. deviations&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Tuple2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Integer&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Integer&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pair&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;values&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;cov&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pair&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;f0&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;leftMean&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pair&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;f1&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;rightMean&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;/&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;count&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Tuple2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Integer&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Integer&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pair&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;values&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;leftSd&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Math&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;pow&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pair&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;f0&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;leftMean&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;/&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;count&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;rightSd&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Math&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;pow&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pair&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;f1&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;rightMean&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;/&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;count&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;leftSd&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Math&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;sqrt&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;leftSd&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;rightSd&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Math&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;sqrt&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;rightSd&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;out&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;collect&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cov&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;/&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;leftSd&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;rightSd&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;));&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;p&gt;&lt;a href=&quot;#top&quot;&gt;Back to top&lt;/a&gt;&lt;/p&gt;
&lt;h2 id=&quot;other-things-to-try&quot;&gt;Other things to try&lt;/h2&gt;
&lt;p&gt;For a full feature overview please check the &lt;a href=&quot;http://ci.apache.org/projects/flink/flink-docs-master/apis/streaming_guide.html&quot;&gt;Streaming Guide&lt;/a&gt;, which describes all the available API features.
You are very welcome to try out our features for different use-cases we are looking forward to your experiences. Feel free to &lt;a href=&quot;http://flink.apache.org/community.html#mailing-lists&quot;&gt;contact us&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id=&quot;upcoming-for-streaming&quot;&gt;Upcoming for streaming&lt;/h2&gt;
&lt;p&gt;There are some aspects of Flink Streaming that are subjects to
change by the next release making this application look even nicer.&lt;/p&gt;
&lt;p&gt;Stay tuned for later blog posts on how Flink Streaming works
internally, fault tolerance, and performance measurements!&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;#top&quot;&gt;Back to top&lt;/a&gt;&lt;/p&gt;
</description>
<pubDate>Mon, 09 Feb 2015 13:00:00 +0100</pubDate>
<link>http://flink.apache.org/news/2015/02/09/streaming-example.html</link>
<guid isPermaLink="true">/news/2015/02/09/streaming-example.html</guid>
</item>
<item>
<title>January 2015 in the Flink community</title>
<description>&lt;p&gt;Happy 2015! Here is a (hopefully digestible) summary of what happened last month in the Flink community.&lt;/p&gt;
&lt;h3 id=&quot;release&quot;&gt;0.8.0 release&lt;/h3&gt;
&lt;p&gt;Flink 0.8.0 was released. See &lt;a href=&quot;http://flink.apache.org/news/2015/01/21/release-0.8.html&quot;&gt;here&lt;/a&gt; for the release notes.&lt;/p&gt;
&lt;h3 id=&quot;flink-roadmap&quot;&gt;Flink roadmap&lt;/h3&gt;
&lt;p&gt;The community has published a &lt;a href=&quot;https://cwiki.apache.org/confluence/display/FLINK/Flink+Roadmap&quot;&gt;roadmap for 2015&lt;/a&gt; on the Flink wiki. Check it out to see what is coming up in Flink, and pick up an issue to contribute!&lt;/p&gt;
&lt;h3 id=&quot;scaling-als&quot;&gt;Scaling ALS&lt;/h3&gt;
&lt;p&gt;Flink committers employed at &lt;a href=&quot;http://data-artisans.com&quot;&gt;data Artisans&lt;/a&gt; published a &lt;a href=&quot;http://data-artisans.com/computing-recommendations-with-flink.html&quot;&gt;blog post&lt;/a&gt; on how they scaled matrix factorization with Flink and Google Compute Engine to matrices with 28 billion elements.&lt;/p&gt;
&lt;h3 id=&quot;articles-in-the-press&quot;&gt;Articles in the press&lt;/h3&gt;
&lt;p&gt;The Apache Software Foundation &lt;a href=&quot;https://blogs.apache.org/foundation/entry/the_apache_software_foundation_announces69&quot;&gt;announced&lt;/a&gt; Flink as a Top-Level Project. The announcement was picked up by the media, e.g., &lt;a href=&quot;http://sdtimes.com/inside-apache-software-foundations-newest-top-level-project-apache-flink/?utm_content=11232092&amp;amp;utm_medium=social&amp;amp;utm_source=twitter&quot;&gt;here&lt;/a&gt;, &lt;a href=&quot;http://www.datanami.com/2015/01/12/apache-flink-takes-route-distributed-data-processing/&quot;&gt;here&lt;/a&gt;, and &lt;a href=&quot;http://i-programmer.info/news/197-data-mining/8176-flink-reaches-top-level-status.html&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;h3 id=&quot;hadoop-summit&quot;&gt;Hadoop Summit&lt;/h3&gt;
&lt;p&gt;A submitted abstract on Flink Streaming &lt;a href=&quot;http://2015.hadoopsummit.org/amsterdam-blog/announcing-the-community-vote-session-winners-for-the-2015-hadoop-summit-europe/&quot;&gt;won the community&lt;/a&gt; vote at “The Future of Hadoop” track.&lt;/p&gt;
&lt;h3 id=&quot;meetups-and-talks&quot;&gt;Meetups and talks&lt;/h3&gt;
&lt;p&gt;Flink was presented at the &lt;a href=&quot;http://www.meetup.com/Hadoop-User-Group-France/events/219778022/&quot;&gt;Paris Hadoop User Group&lt;/a&gt;, the &lt;a href=&quot;http://www.meetup.com/hadoop/events/167785202/&quot;&gt;Bay Area Hadoop User Group&lt;/a&gt;, the &lt;a href=&quot;http://www.meetup.com/Apache-Tez-User-Group/events/219302692/&quot;&gt;Apache Tez User Group&lt;/a&gt;, and &lt;a href=&quot;https://fosdem.org/2015/schedule/track/graph_processing/&quot;&gt;FOSDEM 2015&lt;/a&gt;. The January &lt;a href=&quot;http://www.meetup.com/Apache-Flink-Meetup/events/219639984/&quot;&gt;Flink meetup in Berlin&lt;/a&gt; had talks on recent community updates and new features.&lt;/p&gt;
&lt;h2 id=&quot;notable-code-contributions&quot;&gt;Notable code contributions&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; Code contributions listed here may not be part of a release or even the Flink master repository yet.&lt;/p&gt;
&lt;h3 id=&quot;using-off-heap-memoryhttpsgithubcomapacheflinkpull290&quot;&gt;&lt;a href=&quot;https://github.com/apache/flink/pull/290&quot;&gt;Using off-heap memory&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;This pull request enables Flink to use off-heap memory for its internal memory uses (sort, hash, caching of intermediate data sets).&lt;/p&gt;
&lt;h3 id=&quot;gelly-flinks-graph-apihttpsgithubcomapacheflinkpull335&quot;&gt;&lt;a href=&quot;https://github.com/apache/flink/pull/335&quot;&gt;Gelly, Flink’s Graph API&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;This pull request introduces Gelly, Flink’s brand new Graph API. Gelly offers a native graph programming abstraction with functionality for vertex-centric programming, as well as available graph algorithms. See &lt;a href=&quot;http://www.slideshare.net/vkalavri/largescale-graph-processing-with-apache-flink-graphdevroom-fosdem15&quot;&gt;this slide set&lt;/a&gt; for an overview of Gelly.&lt;/p&gt;
&lt;h3 id=&quot;semantic-annotationshttpsgithubcomapacheflinkpull311&quot;&gt;&lt;a href=&quot;https://github.com/apache/flink/pull/311&quot;&gt;Semantic annotations&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Semantic annotations are a powerful mechanism to expose information about the behavior of Flink functions to Flink’s optimizer. The optimizer can leverage this information to generate more efficient execution plans. For example the output of a Reduce operator that groups on the second field of a tuple is still partitioned on that field if the Reduce function does not modify the value of the second field. By exposing this information to the optimizer, the optimizer can generate plans that avoid expensive data shuffling and reuse the partitioned output of Reduce. Semantic annotations can be defined for most data types, including (nested) tuples and POJOs. See the snapshot documentation for details (not online yet).&lt;/p&gt;
&lt;h3 id=&quot;new-yarn-clienthttpsgithubcomapacheflinkpull292&quot;&gt;&lt;a href=&quot;https://github.com/apache/flink/pull/292&quot;&gt;New YARN client&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;The improved YARN client of Flink now allows users to deploy Flink on YARN for executing a single job. Older versions only supported a long-running YARN session. The code of the YARN client has been refactored to provide an (internal) Java API for controlling YARN clusters more easily.&lt;/p&gt;
</description>
<pubDate>Wed, 04 Feb 2015 11:00:00 +0100</pubDate>
<link>http://flink.apache.org/news/2015/02/04/january-in-flink.html</link>
<guid isPermaLink="true">/news/2015/02/04/january-in-flink.html</guid>
</item>
<item>
<title>Apache Flink 0.8.0 available</title>
<description>&lt;p&gt;We are pleased to announce the availability of Flink 0.8.0. This release includes new user-facing features as well as performance and bug fixes, extends the support for filesystems and introduces the Scala API and flexible windowing semantics for Flink Streaming. A total of 33 people have contributed to this release, a big thanks to all of them!&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;http://www.apache.org/dyn/closer.cgi/flink/flink-0.8.0/flink-0.8.0-bin-hadoop2.tgz&quot;&gt;Download Flink 0.8.0&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&amp;amp;version=12328699&quot;&gt;See the release changelog&lt;/a&gt;&lt;/p&gt;
&lt;h2 id=&quot;overview-of-major-new-features&quot;&gt;Overview of major new features&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Extended filesystem support&lt;/strong&gt;: The former &lt;code&gt;DistributedFileSystem&lt;/code&gt; interface has been generalized to &lt;code&gt;HadoopFileSystem&lt;/code&gt; now supporting all sub classes of &lt;code&gt;org.apache.hadoop.fs.FileSystem&lt;/code&gt;. This allows users to use all file systems supported by Hadoop with Apache Flink.
&lt;a href=&quot;http://ci.apache.org/projects/flink/flink-docs-release-0.8/example_connectors.html&quot;&gt;See connecting to other systems&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Streaming Scala API&lt;/strong&gt;: As an alternative to the existing Java API Streaming is now also programmable in Scala. The Java and Scala APIs have now the same syntax and transformations and will be kept from now on in sync in every future release.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Streaming windowing semantics&lt;/strong&gt;: The new windowing api offers an expressive way to define custom logic for triggering the execution of a stream window and removing elements. The new features include out-of-the-box support for windows based in logical or physical time and data-driven properties on the events themselves among others. &lt;a href=&quot;http://ci.apache.org/projects/flink/flink-docs-release-0.8/streaming_guide.html#window-operators&quot;&gt;Read more here&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Mutable and immutable objects in runtime&lt;/strong&gt; All Flink versions before 0.8.0 were always passing the same objects to functions written by users. This is a common performance optimization, also used in other systems such as Hadoop.
However, this is error-prone for new users because one has to carefully check that references to the object aren’t kept in the user function. Starting from 0.8.0, Flink allows to configure a mode which is disabling that mechanism.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Performance and usability improvements&lt;/strong&gt;: The new Apache Flink 0.8.0 release brings several new features which will significantly improve the performance and the usability of the system. Amongst others, these features include:
&lt;ul&gt;
&lt;li&gt;Improved input split assignment which maximizes computation locality&lt;/li&gt;
&lt;li&gt;Smart broadcasting mechanism which minimizes network I/O&lt;/li&gt;
&lt;li&gt;Custom partitioners which let the user control how the data is partitioned within the cluster. This helps to prevent data skewness and allows to implement highly efficient algorithms.&lt;/li&gt;
&lt;li&gt;coGroup operator now supports group sorting for its inputs&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Kryo is the new fallback serializer&lt;/strong&gt;: Apache Flink has a sophisticated type analysis and serialization framework that is able to handle commonly used types very efficiently.
In addition to that, there is a fallback serializer for types which are not supported. Older versions of Flink used the reflective &lt;a href=&quot;http://avro.apache.org/&quot;&gt;Avro&lt;/a&gt; serializer for that purpose. With this release, Flink is using the powerful &lt;a href=&quot;https://github.com/EsotericSoftware/kryo&quot;&gt;Kryo&lt;/a&gt; and twitter-chill library for support of types such as Java Collections and Scala specifc types.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Hadoop 2.2.0+ is now the default Hadoop dependency&lt;/strong&gt;: With Flink 0.8.0 we made the “hadoop2” build profile the default build for Flink. This means that all users using Hadoop 1 (0.2X or 1.2.X versions) have to specify version “0.8.0-hadoop1” in their pom files.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;HBase module updated&lt;/strong&gt; The HBase version has been updated to 0.98.6.1. Also, Hbase is now available to the Hadoop1 and Hadoop2 profile of Flink.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;contributors&quot;&gt;Contributors&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Marton Balassi&lt;/li&gt;
&lt;li&gt;Daniel Bali&lt;/li&gt;
&lt;li&gt;Carsten Brandt&lt;/li&gt;
&lt;li&gt;Moritz Borgmann&lt;/li&gt;
&lt;li&gt;Stefan Bunk&lt;/li&gt;
&lt;li&gt;Paris Carbone&lt;/li&gt;
&lt;li&gt;Ufuk Celebi&lt;/li&gt;
&lt;li&gt;Nils Engelbach&lt;/li&gt;
&lt;li&gt;Stephan Ewen&lt;/li&gt;
&lt;li&gt;Gyula Fora&lt;/li&gt;
&lt;li&gt;Gabor Hermann&lt;/li&gt;
&lt;li&gt;Fabian Hueske&lt;/li&gt;
&lt;li&gt;Vasiliki Kalavri&lt;/li&gt;
&lt;li&gt;Johannes Kirschnick&lt;/li&gt;
&lt;li&gt;Aljoscha Krettek&lt;/li&gt;
&lt;li&gt;Suneel Marthi&lt;/li&gt;
&lt;li&gt;Robert Metzger&lt;/li&gt;
&lt;li&gt;Felix Neutatz&lt;/li&gt;
&lt;li&gt;Chiwan Park&lt;/li&gt;
&lt;li&gt;Flavio Pompermaier&lt;/li&gt;
&lt;li&gt;Mingliang Qi&lt;/li&gt;
&lt;li&gt;Shiva Teja Reddy&lt;/li&gt;
&lt;li&gt;Till Rohrmann&lt;/li&gt;
&lt;li&gt;Henry Saputra&lt;/li&gt;
&lt;li&gt;Kousuke Saruta&lt;/li&gt;
&lt;li&gt;Chesney Schepler&lt;/li&gt;
&lt;li&gt;Erich Schubert&lt;/li&gt;
&lt;li&gt;Peter Szabo&lt;/li&gt;
&lt;li&gt;Jonas Traub&lt;/li&gt;
&lt;li&gt;Kostas Tzoumas&lt;/li&gt;
&lt;li&gt;Timo Walther&lt;/li&gt;
&lt;li&gt;Daniel Warneke&lt;/li&gt;
&lt;li&gt;Chen Xu&lt;/li&gt;
&lt;/ul&gt;
</description>
<pubDate>Wed, 21 Jan 2015 11:00:00 +0100</pubDate>
<link>http://flink.apache.org/news/2015/01/21/release-0.8.html</link>
<guid isPermaLink="true">/news/2015/01/21/release-0.8.html</guid>
</item>
<item>
<title>December 2014 in the Flink community</title>
<description>&lt;p&gt;This is the first blog post of a “newsletter” like series where we give a summary of the monthly activity in the Flink community. As the Flink project grows, this can serve as a “tl;dr” for people that are not following the Flink dev and user mailing lists, or those that are simply overwhelmed by the traffic.&lt;/p&gt;
&lt;h3 id=&quot;flink-graduation&quot;&gt;Flink graduation&lt;/h3&gt;
&lt;p&gt;The biggest news is that the Apache board approved Flink as a top-level Apache project! The Flink team is working closely with the Apache press team for an official announcement, so stay tuned for details!&lt;/p&gt;
&lt;h3 id=&quot;new-flink-website&quot;&gt;New Flink website&lt;/h3&gt;
&lt;p&gt;The &lt;a href=&quot;http://flink.apache.org&quot;&gt;Flink website&lt;/a&gt; got a total make-over, both in terms of appearance and content.&lt;/p&gt;
&lt;h3 id=&quot;flink-irc-channel&quot;&gt;Flink IRC channel&lt;/h3&gt;
&lt;p&gt;A new IRC channel called #flink was created at irc.freenode.org. An easy way to access the IRC channel is through the &lt;a href=&quot;http://webchat.freenode.net/&quot;&gt;web client&lt;/a&gt;. Feel free to stop by to ask anything or share your ideas about Apache Flink!&lt;/p&gt;
&lt;h3 id=&quot;meetups-and-talks&quot;&gt;Meetups and Talks&lt;/h3&gt;
&lt;p&gt;Apache Flink was presented in the &lt;a href=&quot;http://www.meetup.com/Netherlands-Hadoop-User-Group/events/218635152&quot;&gt;Amsterdam Hadoop User Group&lt;/a&gt;&lt;/p&gt;
&lt;h2 id=&quot;notable-code-contributions&quot;&gt;Notable code contributions&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; Code contributions listed here may not be part of a release or even the current snapshot yet.&lt;/p&gt;
&lt;h3 id=&quot;streaming-scala-apihttpsgithubcomapacheincubator-flinkpull275&quot;&gt;&lt;a href=&quot;https://github.com/apache/incubator-flink/pull/275&quot;&gt;Streaming Scala API&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;The Flink Streaming Java API recently got its Scala counterpart. Once merged, Flink Streaming users can use both Scala and Java for their development. The Flink Streaming Scala API is built as a thin layer on top of the Java API, making sure that the APIs are kept easily in sync.&lt;/p&gt;
&lt;h3 id=&quot;intermediate-datasetshttpsgithubcomapacheincubator-flinkpull254&quot;&gt;&lt;a href=&quot;https://github.com/apache/incubator-flink/pull/254&quot;&gt;Intermediate datasets&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;This pull request introduces a major change in the Flink runtime. Currently, the Flink runtime is based on the notion of operators that exchange data through channels. With the PR, intermediate data sets that are produced by operators become first-class citizens in the runtime. While this does not have any user-facing impact yet, it lays the groundwork for a slew of future features such as blocking execution, fine-grained fault-tolerance, and more efficient data sharing between cluster and client.&lt;/p&gt;
&lt;h3 id=&quot;configurable-execution-modehttpsgithubcomapacheincubator-flinkpull259&quot;&gt;&lt;a href=&quot;https://github.com/apache/incubator-flink/pull/259&quot;&gt;Configurable execution mode&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;This pull request allows the user to change the object-reuse behaviour. Before this pull request, some operations would reuse objects passed to the user function while others would always create new objects. This introduces a system wide switch and changes all operators to either reuse objects or don’t reuse objects.&lt;/p&gt;
&lt;h3 id=&quot;distributed-coordination-via-akkahttpsgithubcomapacheincubator-flinkpull149&quot;&gt;&lt;a href=&quot;https://github.com/apache/incubator-flink/pull/149&quot;&gt;Distributed Coordination via Akka&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Another major change is a complete rewrite of the JobManager / TaskManager components in Scala. In addition to that, the old RPC service was replaced by Actors, using the Akka framework.&lt;/p&gt;
&lt;h3 id=&quot;sorting-of-very-large-recordshttpsgithubcomapacheincubator-flinkpull249-&quot;&gt;&lt;a href=&quot;https://github.com/apache/incubator-flink/pull/249&quot;&gt;Sorting of very large records&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Flink’s internal sort-algorithms were improved to better handle large records (multiple 100s of megabytes or larger). Previously, the system did in some cases hold instances of multiple large records, resulting in high memory consumption and JVM heap thrashing. Through this fix, large records are streamed through the operators, reducing the memory consumption and GC pressure. The system now requires much less memory to support algorithms that work on such large records.&lt;/p&gt;
&lt;h3 id=&quot;kryo-serialization-as-the-new-default-fallbackhttpsgithubcomapacheincubator-flinkpull271&quot;&gt;&lt;a href=&quot;https://github.com/apache/incubator-flink/pull/271&quot;&gt;Kryo Serialization as the new default fallback&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;Flink’s build-in type serialization framework is handles all common types very efficiently. Prior versions uses Avro to serialize types that the built-in framework could not handle.
Flink serialization system improved a lot over time and by now surpasses the capabilities of Avro in many cases. Kryo now serves as the default fallback serialization framework, supporting a much broader range of types.&lt;/p&gt;
&lt;h3 id=&quot;hadoop-filesystem-supporthttpsgithubcomapacheincubator-flinkpull268&quot;&gt;&lt;a href=&quot;https://github.com/apache/incubator-flink/pull/268&quot;&gt;Hadoop FileSystem support&lt;/a&gt;&lt;/h3&gt;
&lt;p&gt;This change permits users to use all file systems supported by Hadoop with Flink. In practice this means that users can use Flink with Tachyon, Google Cloud Storage (also out of the box Flink YARN support on Google Compute Cloud), FTP and all the other file system implementations for Hadoop.&lt;/p&gt;
&lt;h2 id=&quot;heading-to-the-080-release&quot;&gt;Heading to the 0.8.0 release&lt;/h2&gt;
&lt;p&gt;The community is working hard together with the Apache infra team to migrate the Flink infrastructure to a top-level project. At the same time, the Flink community is working on the Flink 0.8.0 release which should be out very soon.&lt;/p&gt;
</description>
<pubDate>Tue, 06 Jan 2015 11:00:00 +0100</pubDate>
<link>http://flink.apache.org/news/2015/01/06/december-in-flink.html</link>
<guid isPermaLink="true">/news/2015/01/06/december-in-flink.html</guid>
</item>
<item>
<title>Hadoop Compatibility in Flink</title>
<description>&lt;p&gt;&lt;a href=&quot;http://hadoop.apache.org&quot;&gt;Apache Hadoop&lt;/a&gt; is an industry standard for scalable analytical data processing. Many data analysis applications have been implemented as Hadoop MapReduce jobs and run in clusters around the world. Apache Flink can be an alternative to MapReduce and improves it in many dimensions. Among other features, Flink provides much better performance and offers APIs in Java and Scala, which are very easy to use. Similar to Hadoop, Flink’s APIs provide interfaces for Mapper and Reducer functions, as well as Input- and OutputFormats along with many more operators. While being conceptually equivalent, Hadoop’s MapReduce and Flink’s interfaces for these functions are unfortunately not source compatible.&lt;/p&gt;
&lt;h2 id=&quot;flinks-hadoop-compatibility-package&quot;&gt;Flink’s Hadoop Compatibility Package&lt;/h2&gt;
&lt;center&gt;
&lt;img src=&quot;/img/blog/hcompat-logos.png&quot; style=&quot;width:30%;margin:15px&quot; /&gt;
&lt;/center&gt;
&lt;p&gt;To close this gap, Flink provides a Hadoop Compatibility package to wrap functions implemented against Hadoop’s MapReduce interfaces and embed them in Flink programs. This package was developed as part of a &lt;a href=&quot;https://developers.google.com/open-source/soc/&quot;&gt;Google Summer of Code&lt;/a&gt; 2014 project.&lt;/p&gt;
&lt;p&gt;With the Hadoop Compatibility package, you can reuse all your Hadoop&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;InputFormats&lt;/code&gt; (mapred and mapreduce APIs)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;OutputFormats&lt;/code&gt; (mapred and mapreduce APIs)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Mappers&lt;/code&gt; (mapred API)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Reducers&lt;/code&gt; (mapred API)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;in Flink programs without changing a line of code. Moreover, Flink also natively supports all Hadoop data types (&lt;code&gt;Writables&lt;/code&gt; and &lt;code&gt;WritableComparable&lt;/code&gt;).&lt;/p&gt;
&lt;p&gt;The following code snippet shows a simple Flink WordCount program that solely uses Hadoop data types, InputFormat, OutputFormat, Mapper, and Reducer functions.&lt;/p&gt;
&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-java&quot;&gt;&lt;span class=&quot;c1&quot;&gt;// Definition of Hadoop Mapper function&lt;/span&gt;
&lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Tokenizer&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;implements&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Mapper&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;LongWritable&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Text&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Text&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;LongWritable&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;...&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// Definition of Hadoop Reducer function&lt;/span&gt;
&lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Counter&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;implements&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Reducer&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Text&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;LongWritable&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Text&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;LongWritable&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;...&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;static&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;void&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;main&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[]&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;args&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;span class=&quot;kd&quot;&gt;final&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;String&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;inputPath&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;args&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;];&lt;/span&gt;
&lt;span class=&quot;kd&quot;&gt;final&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;String&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;outputPath&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;args&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;];&lt;/span&gt;
&lt;span class=&quot;kd&quot;&gt;final&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ExecutionEnvironment&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;env&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ExecutionEnvironment&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;getExecutionEnvironment&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;();&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// Setup Hadoop’s TextInputFormat&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;HadoopInputFormat&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;LongWritable&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Text&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;hadoopInputFormat&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;HadoopInputFormat&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;LongWritable&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Text&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;(&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;TextInputFormat&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(),&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;LongWritable&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;class&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Text&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;class&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;JobConf&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;());&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;TextInputFormat&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;addInputPath&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;hadoopInputFormat&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;getJobConf&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(),&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;Path&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;inputPath&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;));&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// Read a DataSet with the Hadoop InputFormat&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;DataSet&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Tuple2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;LongWritable&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Text&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;text&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;env&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;createInput&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;hadoopInputFormat&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;DataSet&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Tuple2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Text&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;LongWritable&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;words&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;text&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// Wrap Tokenizer Mapper function&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;flatMap&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;HadoopMapFunction&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;LongWritable&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Text&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Text&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;LongWritable&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;Tokenizer&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;()))&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;groupBy&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// Wrap Counter Reducer function (used as Reducer and Combiner)&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;reduceGroup&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;HadoopReduceCombineFunction&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Text&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;LongWritable&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Text&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;LongWritable&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;(&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;Counter&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(),&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;Counter&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;()));&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// Setup Hadoop’s TextOutputFormat&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;HadoopOutputFormat&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Text&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;LongWritable&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;hadoopOutputFormat&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;HadoopOutputFormat&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Text&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;LongWritable&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;(&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;TextOutputFormat&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Text&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;LongWritable&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;(),&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;JobConf&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;());&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;hadoopOutputFormat&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;getJobConf&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;().&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;set&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;mapred.textoutputformat.separator&amp;quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot; &amp;quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;TextOutputFormat&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;setOutputPath&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;hadoopOutputFormat&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;getJobConf&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(),&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;Path&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;outputPath&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;));&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// Output &amp;amp; Execute&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;words&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;output&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;hadoopOutputFormat&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;env&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;execute&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;Hadoop Compat WordCount&amp;quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;As you can see, Flink represents Hadoop key-value pairs as &lt;code&gt;Tuple2&amp;lt;key, value&amp;gt;&lt;/code&gt; tuples. Note, that the program uses Flink’s &lt;code&gt;groupBy()&lt;/code&gt; transformation to group data on the key field (field 0 of the &lt;code&gt;Tuple2&amp;lt;key, value&amp;gt;&lt;/code&gt;) before it is given to the Reducer function. At the moment, the compatibility package does not evaluate custom Hadoop partitioners, sorting comparators, or grouping comparators.&lt;/p&gt;
&lt;p&gt;Hadoop functions can be used at any position within a Flink program and of course also be mixed with native Flink functions. This means that instead of assembling a workflow of Hadoop jobs in an external driver method or using a workflow scheduler such as &lt;a href=&quot;http://oozie.apache.org&quot;&gt;Apache Oozie&lt;/a&gt;, you can implement an arbitrary complex Flink program consisting of multiple Hadoop Input- and OutputFormats, Mapper and Reducer functions. When executing such a Flink program, data will be pipelined between your Hadoop functions and will not be written to HDFS just for the purpose of data exchange.&lt;/p&gt;
&lt;center&gt;
&lt;img src=&quot;/img/blog/hcompat-flow.png&quot; style=&quot;width:100%;margin:15px&quot; /&gt;
&lt;/center&gt;
&lt;h2 id=&quot;what-comes-next&quot;&gt;What comes next?&lt;/h2&gt;
&lt;p&gt;While the Hadoop compatibility package is already very useful, we are currently working on a dedicated Hadoop Job operation to embed and execute Hadoop jobs as a whole in Flink programs, including their custom partitioning, sorting, and grouping code. With this feature, you will be able to chain multiple Hadoop jobs, mix them with Flink functions, and other operations such as &lt;a href=&quot;http://ci.apache.org/projects/flink/flink-docs-release-0.7/spargel_guide.html&quot;&gt;Spargel&lt;/a&gt; operations (Pregel/Giraph-style jobs).&lt;/p&gt;
&lt;h2 id=&quot;summary&quot;&gt;Summary&lt;/h2&gt;
&lt;p&gt;Flink lets you reuse a lot of the code you wrote for Hadoop MapReduce, including all data types, all Input- and OutputFormats, and Mapper and Reducers of the mapred-API. Hadoop functions can be used within Flink programs and mixed with all other Flink functions. Due to Flink’s pipelined execution, Hadoop functions can arbitrarily be assembled without data exchange via HDFS. Moreover, the Flink community is currently working on a dedicated Hadoop Job operation to supporting the execution of Hadoop jobs as a whole.&lt;/p&gt;
&lt;p&gt;If you want to use Flink’s Hadoop compatibility package checkout our &lt;a href=&quot;http://ci.apache.org/projects/flink/flink-docs-release-0.7/hadoop_compatibility.html&quot;&gt;documentation&lt;/a&gt;.&lt;/p&gt;
</description>
<pubDate>Tue, 18 Nov 2014 11:00:00 +0100</pubDate>
<link>http://flink.apache.org/news/2014/11/18/hadoop-compatibility.html</link>
<guid isPermaLink="true">/news/2014/11/18/hadoop-compatibility.html</guid>
</item>
<item>
<title>Apache Flink 0.7.0 available</title>
<description>&lt;p&gt;We are pleased to announce the availability of Flink 0.7.0. This release includes new user-facing features as well as performance and bug fixes, brings the Scala and Java APIs in sync, and introduces Flink Streaming. A total of 34 people have contributed to this release, a big thanks to all of them!&lt;/p&gt;
&lt;p&gt;Download Flink 0.7.0 &lt;a href=&quot;http://flink.incubator.apache.org/downloads.html&quot;&gt;here&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;See the release changelog &lt;a href=&quot;https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&amp;amp;version=12327648&quot;&gt;here&lt;/a&gt;&lt;/p&gt;
&lt;h2 id=&quot;overview-of-major-new-features&quot;&gt;Overview of major new features&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Flink Streaming:&lt;/strong&gt; The gem of the 0.7.0 release is undoubtedly Flink Streaming. Available currently in alpha, Flink Streaming provides a Java API on top of Apache Flink that can consume streaming data sources (e.g., from Apache Kafka, Apache Flume, and others) and process them in real time. A dedicated blog post on Flink Streaming and its performance is coming up here soon. You can check out the Streaming programming guide &lt;a href=&quot;http://ci.apache.org/projects/flink/flink-docs-release-0.7/streaming_guide.html&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;New Scala API:&lt;/strong&gt; The Scala API has been completely rewritten. The Java and Scala APIs have now the same syntax and transformations and will be kept from now on in sync in every future release. See the new Scala API &lt;a href=&quot;http://ci.apache.org/projects/flink/flink-docs-release-0.7/programming_guide.html&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Logical key expressions:&lt;/strong&gt; You can now specify grouping and joining keys with logical names for member variables of POJO data types. For example, you can join two data sets as &lt;code&gt;persons.join(cities).where(“zip”).equalTo(“zipcode”)&lt;/code&gt;. Read more &lt;a href=&quot;http://ci.apache.org/projects/flink/flink-docs-release-0.7/programming_guide.html#specifying-keys&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Hadoop MapReduce compatibility:&lt;/strong&gt; You can run unmodified Hadoop Mappers and Reducers (mapred API) in Flink, use all Hadoop data types, and read data with all Hadoop InputFormats.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Collection-based execution backend:&lt;/strong&gt; The collection-based execution backend enables you to execute a Flink job as a simple Java collections program, bypassing completely the Flink runtime and optimizer. This feature is extremely useful for prototyping, and embedding Flink jobs in projects in a very lightweight manner.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Record API deprecated:&lt;/strong&gt; The (old) Stratosphere Record API has been marked as deprecated and is planned for removal in the 0.9.0 release.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;BLOB service:&lt;/strong&gt; This release contains a new service to distribute jar files and other binary data among the JobManager, TaskManagers and the client.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Intermediate data sets:&lt;/strong&gt; A major rewrite of the system internals introduces intermediate data sets as first class citizens. The internal state machine that tracks the distributed tasks has also been completely rewritten for scalability. While this is not visible as a user-facing feature yet, it is the foundation for several upcoming exciting features.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; Currently, there is limited support for Java 8 lambdas when compiling and running from an IDE. The problem is due to type erasure and whether Java compilers retain type information. We are currently working with the Eclipse and OpenJDK communities to resolve this.&lt;/p&gt;
&lt;h2 id=&quot;contributors&quot;&gt;Contributors&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Tamas Ambrus&lt;/li&gt;
&lt;li&gt;Mariem Ayadi&lt;/li&gt;
&lt;li&gt;Marton Balassi&lt;/li&gt;
&lt;li&gt;Daniel Bali&lt;/li&gt;
&lt;li&gt;Ufuk Celebi&lt;/li&gt;
&lt;li&gt;Hung Chang&lt;/li&gt;
&lt;li&gt;David Eszes&lt;/li&gt;
&lt;li&gt;Stephan Ewen&lt;/li&gt;
&lt;li&gt;Judit Feher&lt;/li&gt;
&lt;li&gt;Gyula Fora&lt;/li&gt;
&lt;li&gt;Gabor Hermann&lt;/li&gt;
&lt;li&gt;Fabian Hueske&lt;/li&gt;
&lt;li&gt;Vasiliki Kalavri&lt;/li&gt;
&lt;li&gt;Kristof Kovacs&lt;/li&gt;
&lt;li&gt;Aljoscha Krettek&lt;/li&gt;
&lt;li&gt;Sebastian Kruse&lt;/li&gt;
&lt;li&gt;Sebastian Kunert&lt;/li&gt;
&lt;li&gt;Matyas Manninger&lt;/li&gt;
&lt;li&gt;Robert Metzger&lt;/li&gt;
&lt;li&gt;Mingliang Qi&lt;/li&gt;
&lt;li&gt;Till Rohrmann&lt;/li&gt;
&lt;li&gt;Henry Saputra&lt;/li&gt;
&lt;li&gt;Chesnay Schelper&lt;/li&gt;
&lt;li&gt;Moritz Schubotz&lt;/li&gt;
&lt;li&gt;Hung Sendoh Chang&lt;/li&gt;
&lt;li&gt;Peter Szabo&lt;/li&gt;
&lt;li&gt;Jonas Traub&lt;/li&gt;
&lt;li&gt;Fabian Tschirschnitz&lt;/li&gt;
&lt;li&gt;Artem Tsikiridis&lt;/li&gt;
&lt;li&gt;Kostas Tzoumas&lt;/li&gt;
&lt;li&gt;Timo Walther&lt;/li&gt;
&lt;li&gt;Daniel Warneke&lt;/li&gt;
&lt;li&gt;Tobias Wiens&lt;/li&gt;
&lt;li&gt;Yingjun Wu&lt;/li&gt;
&lt;/ul&gt;
</description>
<pubDate>Tue, 04 Nov 2014 11:00:00 +0100</pubDate>
<link>http://flink.apache.org/news/2014/11/04/release-0.7.0.html</link>
<guid isPermaLink="true">/news/2014/11/04/release-0.7.0.html</guid>
</item>
<item>
<title>Upcoming Events</title>
<description>&lt;p&gt;We are happy to announce several upcoming Flink events both in Europe and the US. Starting with a &lt;strong&gt;Flink hackathon in Stockholm&lt;/strong&gt; (Oct 8-9) and a talk about Flink at the &lt;strong&gt;Stockholm Hadoop User Group&lt;/strong&gt; (Oct 8). This is followed by the very first &lt;strong&gt;Flink Meetup in Berlin&lt;/strong&gt; (Oct 15). In the US, there will be two Flink Meetup talks: the first one at the &lt;strong&gt;Pasadena Big Data User Group&lt;/strong&gt; (Oct 29) and the second one at &lt;strong&gt;Silicon Valley Hands On Programming Events&lt;/strong&gt; (Nov 4).&lt;/p&gt;
&lt;p&gt;We are looking forward to seeing you at any of these events. The following is an overview of each event and links to the respective Meetup pages.&lt;/p&gt;
&lt;h3 id=&quot;flink-hackathon-stockholm-oct-8-9&quot;&gt;Flink Hackathon, Stockholm (Oct 8-9)&lt;/h3&gt;
&lt;p&gt;The hackathon will take place at KTH/SICS from Oct 8th-9th. You can sign up here: https://docs.google.com/spreadsheet/viewform?formkey=dDZnMlRtZHJ3Z0hVTlFZVjU2MWtoX0E6MA.&lt;/p&gt;
&lt;p&gt;Here is a rough agenda and a list of topics to work upon or look into. Suggestions and more topics are welcome.&lt;/p&gt;
&lt;h4 id=&quot;wednesday-8th&quot;&gt;Wednesday (8th)&lt;/h4&gt;
&lt;p&gt;9:00 - 10:00 Introduction to Apache Flink, System overview, and Dev
environment (by Stephan)&lt;/p&gt;
&lt;p&gt;10:15 - 11:00 Introduction to the topics (Streaming API and system by Gyula
&amp;amp; Marton), (Graphs by Vasia / Martin / Stephan)&lt;/p&gt;
&lt;p&gt;11:00 - 12:30 Happy hacking (part 1)&lt;/p&gt;
&lt;p&gt;12:30 - Lunch (Food will be provided by KTH / SICS. A big thank you to them
and also to Paris, for organizing that)&lt;/p&gt;
&lt;p&gt;13:xx - Happy hacking (part 2)&lt;/p&gt;
&lt;h4 id=&quot;thursday-9th&quot;&gt;Thursday (9th)&lt;/h4&gt;
&lt;p&gt;Happy hacking (continued)&lt;/p&gt;
&lt;h4 id=&quot;suggestions-for-topics&quot;&gt;Suggestions for topics&lt;/h4&gt;
&lt;h5 id=&quot;streaming&quot;&gt;Streaming&lt;/h5&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Sample streaming applications (e.g. continuous heavy hitters and topics
on the twitter stream)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Implement a simple SQL to Streaming program parser. Possibly using
Apache Calcite (http://optiq.incubator.apache.org/)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Implement different windowing methods (count-based, time-based, …)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Implement different windowed operations (windowed-stream-join,
windowed-stream-co-group)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Streaming state, and interaction with other programs (that access state
of a stream program)&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h5 id=&quot;graph-analysis&quot;&gt;Graph Analysis&lt;/h5&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Prototype a Graph DSL (simple graph building, filters, graph
properties, some algorithms)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Prototype abstractions different Graph processing paradigms
(vertex-centric, partition-centric).&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Generalize the delta iterations, allow flexible state access.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&quot;meetup-hadoop-user-group-talk-stockholm-oct-8&quot;&gt;Meetup: Hadoop User Group Talk, Stockholm (Oct 8)&lt;/h3&gt;
&lt;p&gt;Hosted by Spotify, opens at 6 PM.&lt;/p&gt;
&lt;p&gt;http://www.meetup.com/stockholm-hug/events/207323222/&lt;/p&gt;
&lt;h3 id=&quot;st-flink-meetup-berlin-oct-15&quot;&gt;1st Flink Meetup, Berlin (Oct 15)&lt;/h3&gt;
&lt;p&gt;We are happy to announce the first Flink meetup in Berlin. You are very welcome to to sign up and attend. The event will be held in Betahaus Cafe.&lt;/p&gt;
&lt;p&gt;http://www.meetup.com/Apache-Flink-Meetup/events/208227422/&lt;/p&gt;
&lt;h3 id=&quot;meetup-pasadena-big-data-user-group-oct-29&quot;&gt;Meetup: Pasadena Big Data User Group (Oct 29)&lt;/h3&gt;
&lt;p&gt;http://www.meetup.com/Pasadena-Big-Data-Users-Group/&lt;/p&gt;
&lt;h3 id=&quot;meetup-silicon-valley-hands-on-programming-events-nov-4&quot;&gt;Meetup: Silicon Valley Hands On Programming Events (Nov 4)&lt;/h3&gt;
&lt;p&gt;http://www.meetup.com/HandsOnProgrammingEvents/events/210504392/&lt;/p&gt;
</description>
<pubDate>Fri, 03 Oct 2014 12:00:00 +0200</pubDate>
<link>http://flink.apache.org/news/2014/10/03/upcoming_events.html</link>
<guid isPermaLink="true">/news/2014/10/03/upcoming_events.html</guid>
</item>
<item>
<title>Apache Flink 0.6.1 available</title>
<description>&lt;p&gt;We are happy to announce the availability of Flink 0.6.1.&lt;/p&gt;
&lt;p&gt;0.6.1 is a maintenance release, which includes minor fixes across several parts
of the system. We suggest all users of Flink to work with this newest version.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;/downloads.html&quot;&gt;Download&lt;/a&gt; the release today.&lt;/p&gt;
</description>
<pubDate>Fri, 26 Sep 2014 12:00:00 +0200</pubDate>
<link>http://flink.apache.org/news/2014/09/26/release-0.6.1.html</link>
<guid isPermaLink="true">/news/2014/09/26/release-0.6.1.html</guid>
</item>
<item>
<title>Apache Flink 0.6 available</title>
<description>&lt;p&gt;We are happy to announce the availability of Flink 0.6. This is the
first release of the system inside the Apache Incubator and under the
name Flink. Releases up to 0.5 were under the name Stratosphere, the
academic and open source project that Flink originates from.&lt;/p&gt;
&lt;h2 id=&quot;what-is-flink&quot;&gt;What is Flink?&lt;/h2&gt;
&lt;p&gt;Apache Flink is a general-purpose data processing engine for
clusters. It runs on YARN clusters on top of data stored in Hadoop, as
well as stand-alone. Flink currently has programming APIs in Java and
Scala. Jobs are executed via Flink’s own runtime engine. Flink
features:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Robust in-memory and out-of-core processing:&lt;/strong&gt; once read, data stays
in memory as much as possible, and is gracefully de-staged to disk in
the presence of memory pressure from limited memory or other
applications. The runtime is designed to perform very well both in
setups with abundant memory and in setups where memory is scarce.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;POJO-based APIs:&lt;/strong&gt; when programming, you do not have to pack your
data into key-value pairs or some other framework-specific data
model. Rather, you can use arbitrary Java and Scala types to model
your data.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Efficient iterative processing:&lt;/strong&gt; Flink contains explicit “iterate” operators
that enable very efficient loops over data sets, e.g., for machine
learning and graph applications.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;A modular system stack:&lt;/strong&gt; Flink is not a direct implementation of its
APIs but a layered system. All programming APIs are translated to an
intermediate program representation that is compiled and optimized
via a cost-based optimizer. Lower-level layers of Flink also expose
programming APIs for extending the system.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Data pipelining/streaming:&lt;/strong&gt; Flink’s runtime is designed as a
pipelined data processing engine rather than a batch processing
engine. Operators do not wait for their predecessors to finish in
order to start processing data. This results to very efficient
handling of large data sets.&lt;/p&gt;
&lt;h2 id=&quot;release-06&quot;&gt;Release 0.6&lt;/h2&gt;
&lt;p&gt;Flink 0.6 builds on the latest Stratosphere 0.5 release. It includes
many bug fixes and improvements that make the system more stable and
robust, as well as breaking API changes.&lt;/p&gt;
&lt;p&gt;The full release notes are available &lt;a href=&quot;https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&amp;amp;version=12327101&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Download the release &lt;a href=&quot;http://flink.incubator.apache.org/downloads.html&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id=&quot;contributors&quot;&gt;Contributors&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Wilson Cao&lt;/li&gt;
&lt;li&gt;Ufuk Celebi&lt;/li&gt;
&lt;li&gt;Stephan Ewen&lt;/li&gt;
&lt;li&gt;Jonathan Hasenburg&lt;/li&gt;
&lt;li&gt;Markus Holzemer&lt;/li&gt;
&lt;li&gt;Fabian Hueske&lt;/li&gt;
&lt;li&gt;Sebastian Kunert&lt;/li&gt;
&lt;li&gt;Vikhyat Korrapati&lt;/li&gt;
&lt;li&gt;Aljoscha Krettek&lt;/li&gt;
&lt;li&gt;Sebastian Kruse&lt;/li&gt;
&lt;li&gt;Raymond Liu&lt;/li&gt;
&lt;li&gt;Robert Metzger&lt;/li&gt;
&lt;li&gt;Mingliang Qi&lt;/li&gt;
&lt;li&gt;Till Rohrmann&lt;/li&gt;
&lt;li&gt;Henry Saputra&lt;/li&gt;
&lt;li&gt;Chesnay Schepler&lt;/li&gt;
&lt;li&gt;Kostas Tzoumas&lt;/li&gt;
&lt;li&gt;Robert Waury&lt;/li&gt;
&lt;li&gt;Timo Walther&lt;/li&gt;
&lt;li&gt;Daniel Warneke&lt;/li&gt;
&lt;li&gt;Tobias Wiens&lt;/li&gt;
&lt;/ul&gt;
</description>
<pubDate>Tue, 26 Aug 2014 12:00:00 +0200</pubDate>
<link>http://flink.apache.org/news/2014/08/26/release-0.6.html</link>
<guid isPermaLink="true">/news/2014/08/26/release-0.6.html</guid>
</item>
<item>
<title>Stratosphere version 0.5 available</title>
<description>&lt;p&gt;We are happy to announce a new major Stratosphere release, version 0.5. This release adds many new features and improves the interoperability, stability, and performance of the system. The major theme of the release is the completely new Java API that makes it easy to write powerful distributed programs.&lt;/p&gt;
&lt;p&gt;The release can be downloaded from the [Stratosphere website] (http://stratosphere.eu/downloads/) and from [GitHub] (https://github.com/stratosphere/stratosphere/releases/tag/release-0.5). All components are available as Apache Maven dependencies, making it simple to include Stratosphere in other projects. The website provides &lt;a href=&quot;http://stratosphere.eu/docs/0.5/&quot;&gt;extensive documentation&lt;/a&gt; of the system and the new features.&lt;/p&gt;
&lt;h2 id=&quot;shortlist-of-new-features&quot;&gt;Shortlist of new Features&lt;/h2&gt;
&lt;p&gt;Below is a short list of the most important additions to the Stratosphere system.&lt;/p&gt;
&lt;h4 id=&quot;new-java-api&quot;&gt;New Java API&lt;/h4&gt;
&lt;p&gt;This release introduces a completely new &lt;strong&gt;data set-centric Java API&lt;/strong&gt;. This programming model significantly eases the development of Stratosphere programs, supports flexible use of regular Java classes as data types, and adds many new built-in operators to simplify the writing of powerful programs. The result are programs that need less code, are more readable, interoperate better with existing code, and execute faster.&lt;/p&gt;
&lt;p&gt;Take a look at the &lt;a href=&quot;http://stratosphere.eu/docs/0.5/programming_guides/examples_java.html&quot;&gt;examples&lt;/a&gt; to get a feel for the API.&lt;/p&gt;
&lt;h4 id=&quot;general-api-improvements&quot;&gt;General API Improvements&lt;/h4&gt;
&lt;p&gt;&lt;strong&gt;Broadcast Variables:&lt;/strong&gt; Publish a data set to all instances of another operator. This is handy if the your operator depends on the result of a computation, e.g., filter all values smaller than the average.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Distributed Cache:&lt;/strong&gt; Make (local and HDFS) files locally available on each machine processing a task.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Iteration Termination Improvements&lt;/strong&gt; Iterative algorithms can now terminate based on intermediate data sets, not only through aggregated statistics.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Collection data sources and sinks:&lt;/strong&gt; Speed-up the development and testing of Stratosphere programs by reading data from regular Java collections and inserting back into them.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;JDBC data sources and sinks:&lt;/strong&gt; Read data from and write data to relational databases using a JDBC driver.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Hadoop input format and output format support:&lt;/strong&gt; Read and write data with any Hadoop input or output format.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Support for Avro encoded data:&lt;/strong&gt; Read data that has been materialized using Avro.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Deflate Files:&lt;/strong&gt; Stratosphere now transparently reads &lt;code&gt;.deflate&lt;/code&gt; compressed files.&lt;/p&gt;
&lt;h4 id=&quot;runtime-and-optimizer-improvements&quot;&gt;Runtime and Optimizer Improvements&lt;/h4&gt;
&lt;p&gt;&lt;strong&gt;DAG Runtime Streaming:&lt;/strong&gt; Detection and resolution of streaming data flow deadlocks in the data flow optimizer.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Intermediate results across iteration boundaries:&lt;/strong&gt; Intermediate results computed outside iterative parts can be used inside iterative parts of the program.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Stability fixes:&lt;/strong&gt; Various stability fixes in both optimizer and runtime.&lt;/p&gt;
&lt;h4 id=&quot;setup--tooling&quot;&gt;Setup &amp;amp; Tooling&lt;/h4&gt;
&lt;p&gt;&lt;strong&gt;Improved YARN support:&lt;/strong&gt; Many improvements based on user-feedback: Packaging, Permissions, Error handling.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Java 8 compatibility&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id=&quot;contributors&quot;&gt;Contributors&lt;/h2&gt;
&lt;p&gt;In total, 26 people have contributed to Stratosphere since the last release. Thank you for making this project possible!&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Alexander Alexandrov&lt;/li&gt;
&lt;li&gt;Jesus Camacho&lt;/li&gt;
&lt;li&gt;Ufuk Celebi&lt;/li&gt;
&lt;li&gt;Mikhail Erofeev&lt;/li&gt;
&lt;li&gt;Stephan Ewen&lt;/li&gt;
&lt;li&gt;Alexandr Ferodov&lt;/li&gt;
&lt;li&gt;Filip Haase&lt;/li&gt;
&lt;li&gt;Jonathan Hasenberg&lt;/li&gt;
&lt;li&gt;Markus Holzemer&lt;/li&gt;
&lt;li&gt;Fabian Hueske&lt;/li&gt;
&lt;li&gt;Vasia Kalavri&lt;/li&gt;
&lt;li&gt;Aljoscha Krettek&lt;/li&gt;
&lt;li&gt;Rajika Kumarasiri&lt;/li&gt;
&lt;li&gt;Sebastian Kunert&lt;/li&gt;
&lt;li&gt;Aaron Lam&lt;/li&gt;
&lt;li&gt;Robert Metzger&lt;/li&gt;
&lt;li&gt;Faisal Moeen&lt;/li&gt;
&lt;li&gt;Martin Neumann&lt;/li&gt;
&lt;li&gt;Mingliang Qi&lt;/li&gt;
&lt;li&gt;Till Rohrmann&lt;/li&gt;
&lt;li&gt;Chesnay Schepler&lt;/li&gt;
&lt;li&gt;Vyachislav Soludev&lt;/li&gt;
&lt;li&gt;Tuan Trieu&lt;/li&gt;
&lt;li&gt;Artem Tsikiridis&lt;/li&gt;
&lt;li&gt;Timo Walther&lt;/li&gt;
&lt;li&gt;Robert Waury&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;stratosphere-is-going-apache&quot;&gt;Stratosphere is going Apache&lt;/h2&gt;
&lt;p&gt;The Stratosphere project has been accepted to the Apache Incubator and will continue its work under the umbrella of the Apache Software Foundation. Due to a name conflict, we are switching the name of the project. We will make future releases of Stratosphere through the Apache foundation under a new name.&lt;/p&gt;
</description>
<pubDate>Sat, 31 May 2014 12:00:00 +0200</pubDate>
<link>http://flink.apache.org/news/2014/05/31/release-0.5.html</link>
<guid isPermaLink="true">/news/2014/05/31/release-0.5.html</guid>
</item>
<item>
<title>Stratosphere accepted as Apache Incubator Project</title>
<description>&lt;p&gt;We are happy to announce that Stratosphere has been accepted as a project for the &lt;a href=&quot;https://incubator.apache.org/&quot;&gt;Apache Incubator&lt;/a&gt;. The &lt;a href=&quot;https://wiki.apache.org/incubator/StratosphereProposal&quot;&gt;proposal&lt;/a&gt; has been accepted by the Incubator PMC members earlier this week. The Apache Incubator is the first step in the process of giving a project to the &lt;a href=&quot;http://apache.org&quot;&gt;Apache Software Foundation&lt;/a&gt;. While under incubation, the project will move to the Apache infrastructure and adopt the community-driven development principles of the Apache Foundation. Projects can graduate from incubation to become top-level projects if they show activity, a healthy community dynamic, and releases.&lt;/p&gt;
&lt;p&gt;We are glad to have Alan Gates as champion on board, as well as a set of great mentors, including Sean Owen, Ted Dunning, Owen O’Malley, Henry Saputra, and Ashutosh Chauhan. We are confident that we will make this a great open source effort.&lt;/p&gt;
</description>
<pubDate>Wed, 16 Apr 2014 12:00:00 +0200</pubDate>
<link>http://flink.apache.org/news/2014/04/16/stratosphere-goes-apache-incubator.html</link>
<guid isPermaLink="true">/news/2014/04/16/stratosphere-goes-apache-incubator.html</guid>
</item>
<item>
<title>Stratosphere got accepted for Google Summer of Code 2014</title>
<description>&lt;div class=&quot;lead&quot;&gt;Students: Apply now for exciting summer projects in the Big Data / Analytics field&lt;/div&gt;
&lt;p&gt;We are pleased to announce that Stratosphere got accepted to &lt;a href=&quot;http://www.google-melange.com/gsoc/homepage/google/gsoc2014&quot;&gt;Google Summer of Code 2014&lt;/a&gt; as a mentoring organization. This means that we will host a bunch of students to conduct projects within Stratosphere over the summer. &lt;a href=&quot;http://en.flossmanuals.net/GSoCStudentGuide/&quot;&gt;Read more on the GSoC manual for students&lt;/a&gt; and the &lt;a href=&quot;http://www.google-melange.com/gsoc/document/show/gsoc_program/google/gsoc2014/help_page&quot;&gt;official FAQ&lt;/a&gt;. Students can improve their coding skills, learn to work with open-source projects, improve their CV and get a nice paycheck from Google.&lt;/p&gt;
&lt;p&gt;If you are an interested student, check out our &lt;a href=&quot;https://github.com/stratosphere/stratosphere/wiki/Google-Summer-of-Code-2014&quot;&gt;idea list&lt;/a&gt; in the wiki. It contains different projects with varying ranges of difficulty and requirement profiles. Students can also suggest their own projects.&lt;/p&gt;
&lt;p&gt;We welcome students to sign up at our &lt;a href=&quot;https://groups.google.com/forum/#!forum/stratosphere-dev&quot;&gt;developer mailing list&lt;/a&gt; to discuss their ideas.
Applying students can use our wiki (create a new page) to create a project proposal. We are happy to have a look at it.&lt;/p&gt;
</description>
<pubDate>Mon, 24 Feb 2014 21:57:18 +0100</pubDate>
<link>http://flink.apache.org/news/2014/02/24/stratosphere-google-summer-of-code-2014.html</link>
<guid isPermaLink="true">/news/2014/02/24/stratosphere-google-summer-of-code-2014.html</guid>
</item>
<item>
<title>Use Stratosphere with Amazon Elastic MapReduce</title>
<description>&lt;div class=&quot;lead&quot;&gt;Get started with Stratosphere within 10 minutes using Amazon Elastic MapReduce.&lt;/div&gt;
&lt;p&gt;This step-by-step tutorial will guide you through the setup of Stratosphere using Amazon Elastic MapReduce.&lt;/p&gt;
&lt;h3 id=&quot;background&quot;&gt;Background&lt;/h3&gt;
&lt;p&gt;&lt;a href=&quot;http://aws.amazon.com/elasticmapreduce/&quot;&gt;Amazon Elastic MapReduce&lt;/a&gt; (Amazon EMR) is part of Amazon Web services. EMR allows to create Hadoop clusters that analyze data stored in Amazon S3 (AWS’ cloud storage). Stratosphere runs on top of Hadoop using the &lt;a href=&quot;http://hadoop.apache.org/docs/r2.2.0/hadoop-project-dist/hadoop-common/releasenotes.html&quot;&gt;recently&lt;/a&gt; released cluster resource manager &lt;a href=&quot;http://hadoop.apache.org/docs/current2/hadoop-yarn/hadoop-yarn-site/YARN.html&quot;&gt;YARN&lt;/a&gt;. YARN allows to use many different data analysis tools in your cluster side by side. Tools that run with YARN are, for example &lt;a href=&quot;https://giraph.apache.org/&quot;&gt;Apache Giraph&lt;/a&gt;, &lt;a href=&quot;http://spark.incubator.apache.org/&quot;&gt;Spark&lt;/a&gt; or &lt;a href=&quot;http://hortonworks.com/blog/introducing-hoya-hbase-on-yarn/&quot;&gt;HBase&lt;/a&gt;. Stratosphere also &lt;a href=&quot;/docs/0.4/setup/yarn.html&quot;&gt;runs on YARN&lt;/a&gt; and that’s the approach for this tutorial.&lt;/p&gt;
&lt;h3 id=&quot;step-login-to-aws-and-prepare-secure-access&quot;&gt;1. Step: Login to AWS and prepare secure access&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Log in to the &lt;a href=&quot;https://console.aws.amazon.com/console/home&quot;&gt;AWS Console&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;You need to have SSH keys to access the Hadoop master node. If you do not have keys for your computer, generate them:&lt;/p&gt;
&lt;div class=&quot;row&quot; style=&quot;padding-top:15px&quot;&gt;
&lt;div class=&quot;col-md-6&quot;&gt;
&lt;a data-lightbox=&quot;example-1&quot; href=&quot;/img/blog/emr-security.png&quot;&gt;&lt;img class=&quot;img-responsive&quot; src=&quot;/img/blog/emr-security.png&quot; /&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&quot;col-md-6&quot;&gt;
&lt;ul&gt;
&lt;li&gt;Select &lt;a href=&quot;https://console.aws.amazon.com/ec2/v2/home&quot;&gt;EC2&lt;/a&gt; and click on &quot;Key Pairs&quot; in the &quot;NETWORK &amp;amp; SECURITY&quot; section.&lt;/li&gt;
&lt;li&gt;Click on &quot;Create Key Pair&quot; and give it a name&lt;/li&gt;
&lt;li&gt;After pressing &quot;Yes&quot; it will download a .pem file.&lt;/li&gt;
&lt;li&gt;Change the permissions of the .pem file&lt;/li&gt;
&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;chmod og-rwx ~/work-laptop.pem&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;h3 id=&quot;step-create-your-hadoop-cluster-in-the-cloud&quot;&gt;2. Step: Create your Hadoop Cluster in the cloud&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Select &lt;a href=&quot;https://console.aws.amazon.com/elasticmapreduce/vnext/&quot;&gt;Elastic MapReduce&lt;/a&gt; from the AWS console&lt;/li&gt;
&lt;li&gt;Click the blue “Create cluster” button.&lt;/li&gt;
&lt;/ul&gt;
&lt;div class=&quot;row&quot; style=&quot;padding-top:15px&quot;&gt;
&lt;div class=&quot;col-md-6&quot;&gt;
&lt;a data-lightbox=&quot;example-1&quot; href=&quot;/img/blog/emr-hadoopversion.png&quot;&gt;&lt;img class=&quot;img-responsive&quot; src=&quot;/img/blog/emr-hadoopversion.png&quot; /&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&quot;col-md-6&quot;&gt;
&lt;ul&gt;
&lt;li&gt;Choose a Cluster name&lt;/li&gt;
&lt;li&gt;You can let the other settings remain unchanged (termination protection, logging, debugging)&lt;/li&gt;
&lt;li&gt;For the Hadoop distribution, it is very important to choose one with YARN support. We use &lt;b&gt;3.0.3 (Hadoop 2.2.0)&lt;/b&gt; (the minor version might change over time)&lt;/li&gt;
&lt;li&gt;Remove all applications to be installed (unless you want to use them)&lt;/li&gt;
&lt;li&gt;Choose the instance types you want to start. Stratosphere runs fine with m1.large instances. Core and Task instances both run Stratosphere, but only core instances contain HDFS data nodes.&lt;/li&gt;
&lt;li&gt;Choose the &lt;b&gt;EC2 key pair&lt;/b&gt; you&#39;ve created in the previous step!&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;ul&gt;
&lt;li&gt;Thats it! You can now press the “Create cluster” button at the end of the form to boot it!&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&quot;step-launch-stratosphere&quot;&gt;3. Step: Launch Stratosphere&lt;/h3&gt;
&lt;p&gt;You might need to wait a few minutes until Amazon started your cluster. (You can monitor the progress of the instances in EC2). Use the refresh button in the top right corner.&lt;/p&gt;
&lt;p&gt;You see that the master is up if the field &lt;b&gt;Master public DNS&lt;/b&gt; contains a value (first line), connect to it using SSH.&lt;/p&gt;
&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;ssh hadoop@&amp;lt;your master public DNS&amp;gt; -i &amp;lt;path to your .pem&amp;gt;
&lt;span class=&quot;c&quot;&gt;# for my example, it looks like this:&lt;/span&gt;
ssh hadoop@ec2-54-213-61-105.us-west-2.compute.amazonaws.com -i ~/Downloads/work-laptop.pem&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;(Windows users have to follow &lt;a href=&quot;http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-connect-master-node-ssh.html&quot;&gt;these instructions&lt;/a&gt; to SSH into the machine running the master.) &amp;lt;/br&amp;gt;&amp;lt;/br&amp;gt;
Once connected to the master, download and start Stratosphere for YARN:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Download and extract Stratosphere-YARN&lt;/li&gt;
&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;wget http://stratosphere-bin.s3-website-us-east-1.amazonaws.com/stratosphere-dist-0.5-SNAPSHOT-yarn.tar.gz
&lt;span class=&quot;c&quot;&gt;# extract it&lt;/span&gt;
tar xvzf stratosphere-dist-0.5-SNAPSHOT-yarn.tar.gz&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;li&gt;Start Stratosphere in the cluster using Hadoop YARN&lt;/li&gt;
&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;&lt;span class=&quot;nb&quot;&gt;cd &lt;/span&gt;stratosphere-yarn-0.5-SNAPSHOT/
./bin/yarn-session.sh -n &lt;span class=&quot;m&quot;&gt;4&lt;/span&gt; -jm &lt;span class=&quot;m&quot;&gt;1024&lt;/span&gt; -tm 3000&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
The arguments have the following meaning
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;-n&lt;/code&gt; number of TaskManagers (=workers). This number must not exeed the number of task instances&lt;/li&gt;
&lt;li&gt;&lt;code&gt;-jm&lt;/code&gt; memory (heapspace) for the JobManager&lt;/li&gt;
&lt;li&gt;&lt;code&gt;-tm&lt;/code&gt; memory for the TaskManagers&lt;/li&gt;
&lt;/ul&gt;
&lt;/ul&gt;
&lt;p&gt;Once the output has changed from&lt;/p&gt;
&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;JobManager is now running on N/A:6123&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;to&lt;/p&gt;
&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;JobManager is now running on ip-172-31-13-68.us-west-2.compute.internal:6123&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Stratosphere has started the JobManager. It will take a few seconds until the TaskManagers (workers) have connected to the JobManager. To see how many TaskManagers connected, you have to access the JobManager’s web interface. Follow the steps below to do that …&lt;/p&gt;
&lt;h3&gt; 4. Step: Launch a Stratosphere Job&lt;/h3&gt;
&lt;p&gt;This step shows how to submit and monitor a Stratosphere Job in the Amazon Cloud.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt; Open an additional terminal and connect again to the master of your cluster. &lt;/li&gt;
We recommend to create a SOCKS-proxy with your SSH that allows you to easily connect into the cluster. (If you&#39;ve already a VPN setup with EC2, you can probably use that as well.)
&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;ssh -D localhost:2001 hadoop@&amp;lt;your master dns name&amp;gt; -i &amp;lt;your pem file&amp;gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
Notice the &lt;code&gt;-D localhost:2001&lt;/code&gt; argument: It opens a SOCKS proxy on your computer allowing any application to use it to communicate through the proxy via an SSH tunnel to the master node. This allows you to access all services in your EMR cluster, such as the HDFS NameNode or the YARN web interface.
&lt;li&gt;Configure a browser to use the SOCKS proxy. Open a browser with SOCKS proxy support (such as Firefox). Ideally, do not use your primary browser for this, since ALL traffic will be routed through Amazon.&lt;/li&gt;
&lt;div class=&quot;row&quot; style=&quot;padding-top:15px&quot;&gt;
&lt;div class=&quot;col-md-6&quot;&gt;
&lt;a data-lightbox=&quot;example-1&quot; href=&quot;/img/blog/emr-firefoxsettings.png&quot;&gt;&lt;img class=&quot;img-responsive&quot; src=&quot;/img/blog/emr-firefoxsettings.png&quot; /&gt;&lt;/a&gt;
&lt;/div&gt;
&lt;div class=&quot;col-md-6&quot;&gt;
&lt;ul&gt;
&lt;li&gt;To configure the SOCKS proxy with Firefox, click on &quot;Edit&quot;, &quot;Preferences&quot;, choose the &quot;Advanced&quot; tab and press the &quot;Settings ...&quot; button.&lt;/li&gt;
&lt;li&gt;Enter the details of the SOCKS proxy &lt;b&gt;localhost:2001&lt;/b&gt;. Choose SOCKS v4.&lt;/li&gt;
&lt;li&gt;Close the settings, your browser is now talking to the master node of your cluster&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/ul&gt;
&lt;p&gt;Since you’re connected to the master now, you can open several web interfaces: &lt;br /&gt;
&lt;b&gt;YARN Resource Manager&lt;/b&gt;: &lt;code&gt;http://&amp;lt;masterIPAddress&amp;gt;:9026/&lt;/code&gt; &lt;br /&gt;
&lt;b&gt;HDFS NameNode&lt;/b&gt;: &lt;code&gt;http://&amp;lt;masterIPAddress&amp;gt;:9101/&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;You find the &lt;code&gt;masterIPAddress&lt;/code&gt; by entering &lt;code&gt;ifconfig&lt;/code&gt; into the terminal:&lt;/p&gt;
&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;hadoop@ip-172-31-38-95 ~&lt;span class=&quot;o&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;ifconfig
eth0 Link encap:Ethernet HWaddr 02:CF:8E:CB:28:B2
inet addr:172.31.38.95 Bcast:172.31.47.255 Mask:255.255.240.0
inet6 addr: fe80::cf:8eff:fecb:28b2/64 Scope:Link
RX bytes:166314967 &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;158.6 MiB&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; TX bytes:89319246 &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;85.1 MiB&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;Optional:&lt;/strong&gt; If you want to use the hostnames within your Firefox (that also makes the NameNode links work), you have to enable DNS resolution over the SOCKS proxy. Open the Firefox config &lt;code&gt;about:config&lt;/code&gt; and set &lt;code&gt;network.proxy.socks_remote_dns&lt;/code&gt; to &lt;code&gt;true&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;The YARN ResourceManager also allows you to connect to &lt;b&gt;Stratosphere’s JobManager web interface&lt;/b&gt;. Click the &lt;b&gt;ApplicationMaster&lt;/b&gt; link in the “Tracking UI” column.&lt;/p&gt;
&lt;p&gt;To run the Wordcount example, you have to upload some sample data.&lt;/p&gt;
&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;&lt;span class=&quot;c&quot;&gt;# download a text&lt;/span&gt;
wget http://www.gnu.org/licenses/gpl.txt
&lt;span class=&quot;c&quot;&gt;# upload it to HDFS:&lt;/span&gt;
hadoop fs -copyFromLocal gpl.txt /input&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;To run a Job, enter the following command into the master’s command line:&lt;/p&gt;
&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;&lt;span class=&quot;c&quot;&gt;# optional: go to the extracted directory&lt;/span&gt;
&lt;span class=&quot;nb&quot;&gt;cd &lt;/span&gt;stratosphere-yarn-0.5-SNAPSHOT/
&lt;span class=&quot;c&quot;&gt;# run the wordcount example&lt;/span&gt;
./bin/stratosphere run -w -j examples/stratosphere-java-examples-0.5-SNAPSHOT-WordCount.jar -a &lt;span class=&quot;m&quot;&gt;16&lt;/span&gt; hdfs:///input hdfs:///output&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Make sure that the number of TaskManager’s have connected to the JobManager.&lt;/p&gt;
&lt;p&gt;Lets go through the command in detail:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;./bin/stratosphere&lt;/code&gt; is the standard launcher for Stratosphere jobs from the command line&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;-w&lt;/code&gt; flag stands for “wait”. It is a very useful to track the progress of the job.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;-j examples/stratosphere-java-examples-0.5-SNAPSHOT-WordCount.jar&lt;/code&gt; the &lt;code&gt;-j&lt;/code&gt; command sets the jar file containing the job. If you have you own application, place your Jar-file here.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;-a 16 hdfs:///input hdfs:///output&lt;/code&gt; the &lt;code&gt;-a&lt;/code&gt; command specifies the Job-specific arguments. In this case, the wordcount expects the following input &lt;code&gt;&amp;lt;numSubStasks&amp;gt; &amp;lt;input&amp;gt; &amp;lt;output&amp;gt;&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;You can monitor the progress of your job in the JobManager webinterface. Once the job has finished (which should be the case after less than 10 seconds), you can analyze it there.
Inspect the result in HDFS using:&lt;/p&gt;
&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;hadoop fs -tail /output&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;If you want to shut down the whole cluster in the cloud, use Amazon’s webinterface and click on “Terminate cluster”. If you just want to stop the YARN session, press CTRL+C in the terminal. The Stratosphere instances will be killed by YARN.&lt;/p&gt;
</description>
<pubDate>Tue, 18 Feb 2014 20:57:18 +0100</pubDate>
<link>http://flink.apache.org/news/2014/02/18/amazon-elastic-mapreduce-cloud-yarn.html</link>
<guid isPermaLink="true">/news/2014/02/18/amazon-elastic-mapreduce-cloud-yarn.html</guid>
</item>
<item>
<title>Accessing Data Stored in MongoDB with Stratosphere</title>
<description>&lt;p&gt;We recently merged a &lt;a href=&quot;https://github.com/stratosphere/stratosphere/pull/437&quot;&gt;pull request&lt;/a&gt; that allows you to use any existing Hadoop &lt;a href=&quot;http://developer.yahoo.com/hadoop/tutorial/module5.html#inputformat&quot;&gt;InputFormat&lt;/a&gt; with Stratosphere. So you can now (in the &lt;code&gt;0.5-SNAPSHOT&lt;/code&gt; and upwards versions) define a Hadoop-based data source:&lt;/p&gt;
&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-java&quot;&gt;&lt;span class=&quot;n&quot;&gt;HadoopDataSource&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;source&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;HadoopDataSource&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;TextInputFormat&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(),&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;JobConf&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(),&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;Input Lines&amp;quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;TextInputFormat&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;addInputPath&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;source&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;getJobConf&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(),&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;Path&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dataInput&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;));&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;We describe in the following article how to access data stored in &lt;a href=&quot;http://www.mongodb.org/&quot;&gt;MongoDB&lt;/a&gt; with Stratosphere. This allows users to join data from multiple sources (e.g. MonogDB and HDFS) or perform machine learning with the documents stored in MongoDB.&lt;/p&gt;
&lt;p&gt;The approach here is to use the &lt;code&gt;MongoInputFormat&lt;/code&gt; that was developed for Apache Hadoop but now also runs with Stratosphere.&lt;/p&gt;
&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-java&quot;&gt;&lt;span class=&quot;n&quot;&gt;JobConf&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;conf&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;JobConf&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;();&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;conf&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;set&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;mongo.input.uri&amp;quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;mongodb://localhost:27017/enron_mail.messages&amp;quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;HadoopDataSource&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;src&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;HadoopDataSource&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;MongoInputFormat&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(),&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;conf&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;quot;Read from Mongodb&amp;quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;new&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;WritableWrapperConverter&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;());&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;h3 id=&quot;example-program&quot;&gt;Example Program&lt;/h3&gt;
&lt;p&gt;The example program reads data from the &lt;a href=&quot;http://www.cs.cmu.edu/~enron/&quot;&gt;enron dataset&lt;/a&gt; that contains about 500k internal e-mails. The data is stored in MongoDB and the Stratosphere program counts the number of e-mails per day.&lt;/p&gt;
&lt;p&gt;The complete code of this sample program is available on &lt;a href=&quot;https://github.com/stratosphere/stratosphere-mongodb-example&quot;&gt;GitHub&lt;/a&gt;.&lt;/p&gt;
&lt;h4 id=&quot;prepare-mongodb-and-the-data&quot;&gt;Prepare MongoDB and the Data&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;Install MongoDB&lt;/li&gt;
&lt;li&gt;Download the enron dataset from &lt;a href=&quot;http://mongodb-enron-email.s3-website-us-east-1.amazonaws.com/&quot;&gt;their website&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Unpack and load it&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;code&gt;bash
bunzip2 enron_mongo.tar.bz2
tar xvf enron_mongo.tar
mongorestore dump/enron_mail/messages.bson
&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;We used &lt;a href=&quot;http://robomongo.org/&quot;&gt;Robomongo&lt;/a&gt; to visually examine the dataset stored in MongoDB.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/img/blog/robomongo.png&quot; style=&quot;width:90%;margin:15px&quot; /&gt;&lt;/p&gt;
&lt;h4 id=&quot;build-mongoinputformat&quot;&gt;Build &lt;code&gt;MongoInputFormat&lt;/code&gt;&lt;/h4&gt;
&lt;p&gt;MongoDB offers an InputFormat for Hadoop on their &lt;a href=&quot;https://github.com/mongodb/mongo-hadoop&quot;&gt;GitHub page&lt;/a&gt;. The code is not available in any Maven repository, so we have to build the jar file on our own.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Check out the repository&lt;/li&gt;
&lt;/ul&gt;
&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code&gt;git clone https://github.com/mongodb/mongo-hadoop.git
cd mongo-hadoop
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;ul&gt;
&lt;li&gt;Set the appropriate Hadoop version in the &lt;code&gt;build.sbt&lt;/code&gt;, we used &lt;code&gt;1.1&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;hadoopRelease in ThisBuild :&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&amp;quot;1.1&amp;quot;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;ul&gt;
&lt;li&gt;Build the input format&lt;/li&gt;
&lt;/ul&gt;
&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;./sbt package&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;The jar-file is now located in &lt;code&gt;core/target&lt;/code&gt;.&lt;/p&gt;
&lt;h4 id=&quot;the-stratosphere-program&quot;&gt;The Stratosphere Program&lt;/h4&gt;
&lt;p&gt;Now we have everything prepared to run the Stratosphere program. I only ran it on my local computer, out of Eclipse. To do that, check out the code …&lt;/p&gt;
&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;git clone https://github.com/stratosphere/stratosphere-mongodb-example.git&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;… and import it as a Maven project into your Eclipse. You have to manually add the previously built mongo-hadoop jar-file as a dependency.
You can now press the “Run” button and see how Stratosphere executes the little program. It was running for about 8 seconds on the 1.5 GB dataset.&lt;/p&gt;
&lt;p&gt;The result (located in &lt;code&gt;/tmp/enronCountByDay&lt;/code&gt;) now looks like this.&lt;/p&gt;
&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code&gt;11,Fri Sep 26 10:00:00 CEST 1997
154,Tue Jun 29 10:56:00 CEST 1999
292,Tue Aug 10 12:11:00 CEST 1999
185,Thu Aug 12 18:35:00 CEST 1999
26,Fri Mar 19 12:33:00 CET 1999
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;There is one thing left I want to point out here. MongoDB represents objects stored in the database as JSON-documents. Since Stratosphere’s standard types do not support JSON documents, I was using the &lt;code&gt;WritableWrapper&lt;/code&gt; here. This wrapper allows to use any Hadoop datatype with Stratosphere.&lt;/p&gt;
&lt;p&gt;The following code example shows how the JSON-documents are accessed in Stratosphere.&lt;/p&gt;
&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-java&quot;&gt;&lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;void&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;map&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Record&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;record&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Collector&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Record&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;out&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;throws&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Exception&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;Writable&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;valWr&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;record&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;getField&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;WritableWrapper&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;class&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;value&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;();&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;BSONWritable&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;value&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;BSONWritable&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;valWr&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;Object&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;headers&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;value&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;getDoc&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;().&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;get&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;headers&amp;quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;BasicDBObject&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;headerOb&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;BasicDBObject&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;headers&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;String&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;date&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;headerOb&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;get&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;Date&amp;quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// further date processing&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Please use the comments if you have questions or if you want to showcase your own MongoDB-Stratosphere integration.&lt;/p&gt;
</description>
<pubDate>Tue, 28 Jan 2014 10:00:00 +0100</pubDate>
<link>http://flink.apache.org/news/2014/01/28/querying_mongodb.html</link>
<guid isPermaLink="true">/news/2014/01/28/querying_mongodb.html</guid>
</item>
<item>
<title>Optimizer Plan Visualization Tool</title>
<description>&lt;p&gt;Stratosphere’s hybrid approach combines &lt;strong&gt;MapReduce&lt;/strong&gt; and &lt;strong&gt;MPP database&lt;/strong&gt; techniques. One central part of this approach is to have a &lt;strong&gt;separation between the programming (API) and the way programs are executed&lt;/strong&gt; &lt;em&gt;(execution plans)&lt;/em&gt;. The &lt;strong&gt;compiler/optimizer&lt;/strong&gt; decides the details concerning caching or when to partition/broadcast with a holistic view of the program. The same program may actually be executed differently in different scenarios (input data of different sizes, different number of machines).&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;If you want to know how exactly the system executes your program, you can find it out in two ways&lt;/strong&gt;:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;The &lt;strong&gt;browser-based webclient UI&lt;/strong&gt;, which takes programs packaged into JARs and draws the execution plan as a visual data flow (check out the &lt;a href=&quot;http://stratosphere.eu/docs/0.4/program_execution/web_interface.html&quot;&gt;documentation&lt;/a&gt; for details).&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;For &lt;strong&gt;programs using the &lt;a href=&quot;http://stratosphere.eu/docs/0.4/program_execution/local_executor.html&quot;&gt;Local- &lt;/a&gt; or [Remote Executor] (http://stratosphere.eu/docs/0.4/program_execution/remote_executor.html)&lt;/strong&gt;, you can get the optimizer plan using the method &lt;code&gt;LocalExecutor.optimizerPlanAsJSON(plan)&lt;/code&gt;. The &lt;strong&gt;resulting JSON&lt;/strong&gt; string describes the execution strategies chosen by the optimizer. Naturally, you do not want to parse that yourself, especially for longer programs.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The builds &lt;em&gt;0.5-SNAPSHOT&lt;/em&gt; and later come with a &lt;strong&gt;tool that visualizes the JSON&lt;/strong&gt; string. It is a standalone version of the webclient’s visualization, packed as an html document &lt;code&gt;tools/planVisualizer.html&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;If you open it in a browser (for example &lt;code&gt;chromium-browser tools/planVisualizer.html&lt;/code&gt;) it shows a text area where you can paste the JSON string and it renders that string as a dataflow plan (assuming it was a valid JSON string and plan). The pictures below show how that looks for the &lt;a href=&quot;https://github.com/stratosphere/stratosphere/blob/release-0.4/stratosphere-examples/stratosphere-java-examples/src/main/java/eu/stratosphere/example/java/record/connectedcomponents/WorksetConnectedComponents.java?source=cc&quot;&gt;included sample program&lt;/a&gt; that uses delta iterations to compute the connected components of a graph.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/img/blog/plan_visualizer1.png&quot; style=&quot;width:100%;&quot; /&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/img/blog/plan_visualizer2.png&quot; style=&quot;width:100%;&quot; /&gt;&lt;/p&gt;
</description>
<pubDate>Sun, 26 Jan 2014 10:00:00 +0100</pubDate>
<link>http://flink.apache.org/news/2014/01/26/optimizer_plan_visualization_tool.html</link>
<guid isPermaLink="true">/news/2014/01/26/optimizer_plan_visualization_tool.html</guid>
</item>
<item>
<title>Stratosphere 0.4 Released</title>
<description>&lt;p&gt;We are pleased to announce that version 0.4 of the Stratosphere system has been released.&lt;/p&gt;
&lt;p&gt;Our team has been working hard during the last few months to create an improved and stable Stratosphere version. The new version comes with many new features, usability and performance improvements in all levels, including a new Scala API for the concise specification of programs, a Pregel-like API, support for Yarn clusters, and major performance improvements. The system features now first-class support for iterative programs and thus covers traditional analytical use cases as well as data mining and graph processing use cases with great performance.&lt;/p&gt;
&lt;p&gt;In the course of the transition from v0.2 to v0.4 of the system, we have changed pre-existing APIs based on valuable user feedback. This means that, in the interest of easier programming, we have broken backwards compatibility and existing jobs must be adapted, as described in &lt;a href=&quot;/blog/tutorial/2014/01/12/0.4-migration-guide.html&quot;&gt;the migration guide&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;This article will guide you through the feature list of the new release.&lt;/p&gt;
&lt;h3 id=&quot;scala-programming-interface&quot;&gt;Scala Programming Interface&lt;/h3&gt;
&lt;p&gt;The new Stratosphere version comes with a new programming API in Scala that supports very fluent and efficient programs that can be expressed with very few lines of code. The API uses Scala’s native type system (no special boxed data types) and supports grouping and joining on types beyond key/value pairs. We use code analysis and code generation to transform Scala’s data model to the Stratosphere runtime. Stratosphere Scala programs are optimized before execution by Stratosphere’s optimizer just like Stratosphere Java programs.&lt;/p&gt;
&lt;p&gt;Learn more about the Scala API at the &lt;a href=&quot;/docs/0.4/programming_guides/scala.html&quot;&gt;Scala Programming Guide&lt;/a&gt;&lt;/p&gt;
&lt;h3 id=&quot;iterations&quot;&gt;Iterations&lt;/h3&gt;
&lt;p&gt;Stratosphere v0.4 introduces deep support for iterative algorithms, required by a large class of advanced analysis algorithms. In contrast to most other systems, “looping over the data” is done inside the system’s runtime, rather than in the client. Individual iterations (supersteps) can be as fast as sub-second times. Loop-invariant data is automatically cached in memory.&lt;/p&gt;
&lt;p&gt;We support a special form of iterations called “delta iterations” that selectively modify only some elements of intermediate solution in each iteration. These are applicable to a variety of applications, e.g., use cases of Apache Giraph. We have observed speedups of 70x when using delta iterations instead of regular iterations.&lt;/p&gt;
&lt;p&gt;Read more about the new iteration feature in &lt;a href=&quot;/docs/0.4/programming_guides/iterations.html&quot;&gt;the documentation&lt;/a&gt;&lt;/p&gt;
&lt;h3 id=&quot;hadoop-yarn-support&quot;&gt;Hadoop YARN Support&lt;/h3&gt;
&lt;p&gt;YARN (Yet Another Resource Negotiator) is the major new feature of the recently announced &lt;a href=&quot;http://hadoop.apache.org/docs/r2.2.0/&quot;&gt;Hadoop 2.2&lt;/a&gt;. It allows to share existing clusters with different runtimes. So you can run MapReduce alongside Storm and others. With the 0.4 release, Stratosphere supports YARN.
Follow &lt;a href=&quot;/docs/0.4/setup/yarn.html&quot;&gt;our guide&lt;/a&gt; on how to start a Stratosphere YARN session.&lt;/p&gt;
&lt;h3 id=&quot;improved-scripting-language-meteor&quot;&gt;Improved Scripting Language Meteor&lt;/h3&gt;
&lt;p&gt;The high-level language Meteor now natively serializes JSON trees for greater performance and offers additional operators and file formats. We greatly empowered the user to write crispier scripts by adding second-order functions, multi-output operators, and other syntactical sugar. For developers of Meteor packages, the API is much more comprehensive and allows to define custom data types that can be easily embedded in JSON trees through ad-hoc byte code generation.&lt;/p&gt;
&lt;h3 id=&quot;spargel-pregel-inspired-graph-processing&quot;&gt;Spargel: Pregel Inspired Graph Processing&lt;/h3&gt;
&lt;p&gt;Spargel is a vertex-centric API similar to the interface proposed in Google’s Pregel paper and implemented in Apache Giraph. Spargel is implemented in 500 lines of code (including comments) on top of Stratosphere’s delta iterations feature. This confirms the flexibility of Stratosphere’s architecture.&lt;/p&gt;
&lt;h3 id=&quot;web-frontend&quot;&gt;Web Frontend&lt;/h3&gt;
&lt;p&gt;Using the new web frontend, you can monitor the progress of Stratosphere jobs. For finished jobs, the frontend shows a breakdown of the execution times for each operator. The webclient also visualizes the execution strategies chosen by the optimizer.&lt;/p&gt;
&lt;h3 id=&quot;accumulators&quot;&gt;Accumulators&lt;/h3&gt;
&lt;p&gt;Stratosphere’s accumulators allow program developers to compute simple statistics, such as counts, sums, min/max values, or histograms, as a side effect of the processing functions. An example application would be to count the total number of records/tuples processed by a function. Stratosphere will not launch additional tasks (reducers), but will compute the number “on the fly” as a side-product of the functions application to the data. The concept is similar to Hadoop’s counters, but supports more types of aggregation.&lt;/p&gt;
&lt;h3 id=&quot;refactored-apis&quot;&gt;Refactored APIs&lt;/h3&gt;
&lt;p&gt;Based on valuable user feedback, we refactored the Java programming interface to make it more intuitive and easier to use. The basic concepts are still the same, however the naming of most interfaces changed and the structure of the code was adapted. When updating to the 0.4 release you will need to adapt your jobs and dependencies. A previous blog post has a guide to the necessary changes to adapt programs to Stratosphere 0.4.&lt;/p&gt;
&lt;h3 id=&quot;local-debugging&quot;&gt;Local Debugging&lt;/h3&gt;
&lt;p&gt;You can now test and debug Stratosphere jobs locally. The &lt;a href=&quot;/docs/0.4/program_execution/local_executor.html&quot;&gt;LocalExecutor&lt;/a&gt; allows to execute Stratosphere Jobs from IDE’s. The same code that runs on clusters also runs in a single JVM multi-threaded. The mode supports full debugging capabilities known from regular applications (placing breakpoints and stepping through the program’s functions). An advanced mode supports simulating fully distributed operation locally.&lt;/p&gt;
&lt;h3 id=&quot;miscellaneous&quot;&gt;Miscellaneous&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;The configuration of Stratosphere has been changed to YAML&lt;/li&gt;
&lt;li&gt;HBase support&lt;/li&gt;
&lt;li&gt;JDBC Input format&lt;/li&gt;
&lt;li&gt;Improved Windows Compatibility: Batch-files to start Stratosphere on Windows and all unit tests passing on Windows.&lt;/li&gt;
&lt;li&gt;Stratosphere is available in Maven Central and Sonatype Snapshot Repository&lt;/li&gt;
&lt;li&gt;Improved build system that supports different Hadoop versions using Maven profiles&lt;/li&gt;
&lt;li&gt;Maven Archetypes for Stratosphere Jobs.&lt;/li&gt;
&lt;li&gt;Stability and Usability improvements with many bug fixes.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&quot;download-and-get-started-with-stratosphere-v04&quot;&gt;Download and get started with Stratosphere v0.4&lt;/h3&gt;
&lt;p&gt;There are several options for getting started with Stratosphere.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Download it on the &lt;a href=&quot;/downloads&quot;&gt;download page&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Start your program with the &lt;a href=&quot;/quickstart/&quot;&gt;Quick-start guides&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Complete &lt;a href=&quot;/docs/0.4/&quot;&gt;documentation and set-up guides&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&quot;tell-us-what-you-think&quot;&gt;Tell us what you think!&lt;/h3&gt;
&lt;p&gt;Are you using, or planning to use Stratosphere? Sign up in our &lt;a href=&quot;https://groups.google.com/forum/#!forum/stratosphere-dev&quot;&gt;mailing list&lt;/a&gt; and drop us a line.&lt;/p&gt;
&lt;p&gt;Have you found a bug? &lt;a href=&quot;https://github.com/stratosphere/stratosphere&quot;&gt;Post an issue&lt;/a&gt; on GitHub.&lt;/p&gt;
&lt;p&gt;Follow us on &lt;a href=&quot;https://twitter.com/stratosphere_eu&quot;&gt;Twitter&lt;/a&gt; and &lt;a href=&quot;https://github.com/stratosphere/stratosphere&quot;&gt;GitHub&lt;/a&gt; to stay in touch with the latest news!&lt;/p&gt;
</description>
<pubDate>Mon, 13 Jan 2014 21:57:18 +0100</pubDate>
<link>http://flink.apache.org/news/2014/01/13/stratosphere-release-0.4.html</link>
<guid isPermaLink="true">/news/2014/01/13/stratosphere-release-0.4.html</guid>
</item>
<item>
<title>Stratosphere Version 0.4 Migration Guide</title>
<description>&lt;p&gt;This guide is intended to help users of previous Stratosphere versions to migrate their programs to the new API of v0.4.&lt;/p&gt;
&lt;p&gt;Version &lt;code&gt;0.4-rc1&lt;/code&gt;, &lt;code&gt;0.4&lt;/code&gt; and all newer versions have the new API. If you want to have the most recent version before the code change, please set the version to &lt;code&gt;0.4-alpha.3-SNAPSHOT&lt;/code&gt;. (Note that the &lt;code&gt;0.4-alpha&lt;/code&gt; versions are only available in the snapshot repository).&lt;/p&gt;
&lt;h4 id=&quot;maven-dependencies&quot;&gt;Maven Dependencies&lt;/h4&gt;
&lt;p&gt;Since we also reorganized the Maven project structure, existing programs need to update the Maven dependencies to &lt;code&gt;stratosphere-java&lt;/code&gt; (and &lt;code&gt;stratosphere-clients&lt;/code&gt;, for examples and executors).&lt;/p&gt;
&lt;p&gt;The typical set of Maven dependencies for Stratosphere Java programs is:&lt;/p&gt;
&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-diff&quot;&gt; &amp;lt;groupId&amp;gt;eu.stratosphere&amp;lt;/groupId&amp;gt;
&lt;span class=&quot;gd&quot;&gt;- &amp;lt;artifactId&amp;gt;pact-common&amp;lt;/artifactId&amp;gt;&lt;/span&gt;
&lt;span class=&quot;gd&quot;&gt;- &amp;lt;version&amp;gt;0.4-SNAPSHOT&amp;lt;/version&amp;gt;&lt;/span&gt;
&lt;span class=&quot;gi&quot;&gt;+ &amp;lt;artifactId&amp;gt;stratosphere-java&amp;lt;/artifactId&amp;gt;&lt;/span&gt;
&lt;span class=&quot;gi&quot;&gt;+ &amp;lt;version&amp;gt;0.4&amp;lt;/version&amp;gt;&lt;/span&gt;
&lt;span class=&quot;gd&quot;&gt;- &amp;lt;artifactId&amp;gt;pact-clients&amp;lt;/artifactId&amp;gt;&lt;/span&gt;
&lt;span class=&quot;gd&quot;&gt;- &amp;lt;version&amp;gt;0.4-SNAPSHOT&amp;lt;/version&amp;gt;&lt;/span&gt;
&lt;span class=&quot;gi&quot;&gt;+ &amp;lt;artifactId&amp;gt;stratosphere-clients&amp;lt;/artifactId&amp;gt;&lt;/span&gt;
&lt;span class=&quot;gi&quot;&gt;+ &amp;lt;version&amp;gt;0.4&amp;lt;/version&amp;gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;h4 id=&quot;renamed-classes&quot;&gt;Renamed classes&lt;/h4&gt;
&lt;p&gt;We renamed many of the most commonly used classes to make their names more intuitive:&lt;/p&gt;
&lt;table class=&quot;table table-striped&quot;&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Old Name (before &lt;code&gt;0.4&lt;/code&gt;)&lt;/th&gt;
&lt;th&gt;New Name (&lt;code&gt;0.4&lt;/code&gt; and after)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Contract&lt;/td&gt;
&lt;td&gt;Operator&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MatchContract&lt;/td&gt;
&lt;td&gt;JoinOperator&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;[Map, Reduce, ...]Stub&lt;/td&gt;
&lt;td&gt;[Map, Reduce, ...]Function&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MatchStub&lt;/td&gt;
&lt;td&gt;JoinFunction&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pact[Integer, Double, ...]&lt;/td&gt;
&lt;td&gt;IntValue, DoubleValue, ...&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PactRecord&lt;/td&gt;
&lt;td&gt;Record&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PlanAssembler&lt;/td&gt;
&lt;td&gt;Program&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PlanAssemblerDescription&lt;/td&gt;
&lt;td&gt;ProgramDescription&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RecordOutputFormat&lt;/td&gt;
&lt;td&gt;CsvOutputFormat&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Package names have been adapted as well.
For a complete overview of the renamings, have a look at &lt;a href=&quot;https://github.com/stratosphere/stratosphere/issues/257&quot;&gt;issue #257 on GitHub&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;We suggest for Eclipse user adjust the programs as follows: Delete all old Stratosphere imports, then rename the the classes (&lt;code&gt;PactRecord&lt;/code&gt; to &lt;code&gt;Record&lt;/code&gt; and so on). Finally, use the “Organize Imports” function (&lt;code&gt;CTRL+SHIFT+O&lt;/code&gt;) to choose the right imports. The names should be unique so always pick the classes that are in the &lt;code&gt;eu.stratosphere&lt;/code&gt; package.&lt;/p&gt;
&lt;p&gt;Please contact us in the comments below, on the mailing list or on GitHub if you have any issues migrating to the latest Stratosphere release.&lt;/p&gt;
</description>
<pubDate>Sun, 12 Jan 2014 20:57:18 +0100</pubDate>
<link>http://flink.apache.org/news/2014/01/12/0.4-migration-guide.html</link>
<guid isPermaLink="true">/news/2014/01/12/0.4-migration-guide.html</guid>
</item>
<item>
<title>Stratosphere got accepted to the Hadoop Summit Europe in Amsterdam</title>
<description>&lt;p&gt;The Stratosphere team is proud to announce that it is going to present at the &lt;a href=&quot;http://hadoopsummit.org/amsterdam/&quot;&gt;Hadoop Summit 2014 in Amsterdam&lt;/a&gt; on April 2-3. Our talk “Big Data looks tiny from Stratosphere” is part of the “Future of Hadoop” Track. The talk abstract already made it into the top 5 in the &lt;a href=&quot;https://hadoopsummit.uservoice.com/forums/196822-future-of-apache-hadoop/filters/top&quot;&gt;Community Vote&lt;/a&gt; that took place by the end of last year.&lt;/p&gt;
</description>
<pubDate>Fri, 10 Jan 2014 11:57:18 +0100</pubDate>
<link>http://flink.apache.org/news/2014/01/10/stratosphere-hadoop-summit.html</link>
<guid isPermaLink="true">/news/2014/01/10/stratosphere-hadoop-summit.html</guid>
</item>
<item>
<title>Stratosphere wins award at Humboldt Innovation Competition &quot;Big Data: Research meets Startups&quot;</title>
<description> &lt;p&gt; Stratosphere won the second place in
the &lt;a href=&quot;http://www.humboldt-innovation.de/de/newsdetail/News/View/Forum%2BJunge%2BSpitzenforscher%2BBIG%2BData%2B%2BResearch%2Bmeets%2BStartups-123.html&quot;&gt;competition&lt;/a&gt;
organized by Humboldt Innovation on &quot;Big Data: Research meets
Startups,&quot; where several research projects were evaluated by a
panel of experts from the Berlin startup ecosystem. The award
includes a monetary prize of 10,000 euros.
&lt;/p&gt;
&lt;p&gt;We are extremely excited about this award, as it further
showcases the relevance of the Stratosphere platform and Big Data
technology in general for the technology startup world.
&lt;/p&gt;
</description>
<pubDate>Fri, 13 Dec 2013 15:57:18 +0100</pubDate>
<link>http://flink.apache.org/news/2013/12/13/humboldt-innovation-award.html</link>
<guid isPermaLink="true">/news/2013/12/13/humboldt-innovation-award.html</guid>
</item>
<item>
<title>Paper &quot;All Roads Lead to Rome: Optimistic Recovery for Distributed Iterative Data Processing&quot; accepted at CIKM 2013</title>
<description>&lt;p&gt;Our paper ““All Roads Lead to Rome:” Optimistic Recovery for Distributed
Iterative Data Processing” authored by Sebastian Schelter, Kostas
Tzoumas, Stephan Ewen and Volker Markl has been accepted accepted at the
ACM International Conference on Information and Knowledge Management
(CIKM 2013) in San Francisco.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Abstract&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Executing data-parallel iterative algorithms on large datasets is
crucial for many advanced analytical applications in the fields of data
mining and machine learning. Current systems for executing iterative
tasks in large clusters typically achieve fault tolerance through
rollback recovery. The principle behind this pessimistic approach is to
periodically checkpoint the algorithm state. Upon failure, the system
restores a consistent state from a previously written checkpoint and
resumes execution from that point.&lt;/p&gt;
&lt;p&gt;We propose an optimistic recovery mechanism using algorithmic
compensations. Our method leverages the robust, self-correcting nature
of a large class of fixpoint algorithms used in data mining and machine
learning, which converge to the correct solution from various
intermediate consistent states. In the case of a failure, we apply a
user-defined compensate function that algorithmically creates such a
consistent state, instead of rolling back to a previous checkpointed
state. Our optimistic recovery does not checkpoint any state and hence
achieves optimal failure-free performance with respect to the overhead
necessary for guaranteeing fault tolerance. We illustrate the
applicability of this approach for three wide classes of problems.
Furthermore, we show how to implement the proposed optimistic recovery
mechanism in a data flow system. Similar to the Combine operator in
MapReduce, our proposed functionality is optional and can be applied to
increase performance without changing the semantics of programs. In an
experimental evaluation on large datasets, we show that our proposed
approach provides optimal failure-free performance. In the absence of
failures our optimistic scheme is able to outperform a pessimistic
approach by a factor of two to five. In presence of failures, our
approach provides fast recovery and outperforms pessimistic approaches
in the majority of cases.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;/assets/papers/optimistic.pdf&quot;&gt;Download the paper [PDF]&lt;/a&gt;&lt;/p&gt;
</description>
<pubDate>Mon, 21 Oct 2013 11:57:18 +0200</pubDate>
<link>http://flink.apache.org/news/2013/10/21/cikm2013-paper.html</link>
<guid isPermaLink="true">/news/2013/10/21/cikm2013-paper.html</guid>
</item>
<item>
<title>Demo Paper &quot;Large-Scale Social-Media Analytics on Stratosphere&quot; Accepted at WWW 2013</title>
<description> &lt;p&gt;Our demo submission&lt;br /&gt;
&lt;strong&gt;&lt;cite&gt;&quot;Large-Scale Social-Media Analytics on Stratosphere&quot;&lt;/cite&gt;&lt;/strong&gt;&lt;br /&gt;
by Christoph Boden, Marcel Karnstedt, Miriam Fernandez and Volker Markl&lt;br /&gt;
has been accepted for WWW 2013 in Rio de Janeiro, Brazil.&lt;/p&gt;
&lt;p&gt;Visit our demo, and talk to us if you are attending WWW 2013.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt;&lt;br /&gt;
The importance of social-media platforms and online communities - in business as well as public context - is more and more acknowledged and appreciated by industry and researchers alike. Consequently, a wide range of analytics has been proposed to understand, steer, and exploit the mechanics and laws driving their functionality and creating the resulting benefits. However, analysts usually face significant problems in scaling existing and novel approaches to match the data volume and size of modern online communities. In this work, we propose and demonstrate the usage of the massively parallel data prossesing system Stratosphere, based on second order functions as an extended notion of the MapReduce paradigm, to provide a new level of scalability to such social-media analytics. Based on the popular example of role analysis, we present and illustrate how this massively parallel approach can be leveraged to scale out complex data-mining tasks, while providing a programming approach that eases the formulation of complete analytical workflows.&lt;/p&gt;
</description>
<pubDate>Wed, 27 Mar 2013 15:57:18 +0100</pubDate>
<link>http://flink.apache.org/news/2013/03/27/www-demo-paper.html</link>
<guid isPermaLink="true">/news/2013/03/27/www-demo-paper.html</guid>
</item>
<item>
<title>ICDE 2013 Demo Preview</title>
<description> &lt;p&gt;This is a preview of our demo that will be presented at ICDE 2013 in Brisbane.&lt;br /&gt;
The demo shows how static code analysis can be leveraged to reordered UDF operators in data flow programs.&lt;/p&gt;
&lt;p&gt;Detailed information can be found in our papers which are available on the &lt;a href=&quot;/publications&quot;&gt;publication&lt;/a&gt; page.&lt;/p&gt;
&lt;p&gt;&lt;iframe width=&quot;420&quot; height=&quot;315&quot; src=&quot;http://www.youtube.com/embed/ZYwCMgPXFVE&quot; frameborder=&quot;0&quot; allowfullscreen&gt;&lt;/iframe&gt;&lt;/p&gt;</description>
<pubDate>Wed, 21 Nov 2012 15:57:18 +0100</pubDate>
<link>http://flink.apache.org/news/2012/11/21/previewICDE2013.html</link>
<guid isPermaLink="true">/news/2012/11/21/previewICDE2013.html</guid>
</item>
<item>
<title>Stratosphere Demo Paper Accepted for BTW 2013</title>
<description> &lt;p&gt;Our demo submission&lt;br /&gt;
&lt;strong&gt;&lt;cite&gt;&quot;Applying Stratosphere for Big Data Analytics&quot;&lt;/cite&gt;&lt;/strong&gt;&lt;br /&gt;
has been accepted for BTW 2013 in Magdeburg, Germany.&lt;br /&gt;
The demo focuses on Stratosphere&#39;s query language Meteor, which has been presented in our paper &lt;cite&gt;&quot;Meteor/Sopremo: An Extensible Query Language and Operator Model&quot;&lt;/cite&gt; &lt;a href=&quot;/assets/papers/Sopremo_Meteor BigData.pdf&quot;&gt;[pdf]&lt;/a&gt; at the BigData workshop associated with VLDB 2012 in Istanbul.&lt;/p&gt;
&lt;p&gt;Visit our demo, and talk to us if you are going to attend BTW 2013.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt;&lt;br /&gt;
Analyzing big data sets as they occur in modern business and science applications requires query languages that allow for the specification of complex data processing tasks. Moreover, these ideally declarative query specifications have to be optimized, parallelized and scheduled for processing on massively parallel data processing platforms. This paper demonstrates the application of Stratosphere to different kinds of Big Data Analytics tasks. Using examples from different application domains, we show how to formulate analytical tasks as Meteor queries and execute them with Stratosphere. These examples include data cleansing and information extraction tasks, and a correlation analysis of microblogging and stock trade volume data that we describe in detail in this paper.&lt;/p&gt;
</description>
<pubDate>Mon, 12 Nov 2012 15:57:18 +0100</pubDate>
<link>http://flink.apache.org/news/2012/11/12/btw2013demo.html</link>
<guid isPermaLink="true">/news/2012/11/12/btw2013demo.html</guid>
</item>
<item>
<title>Stratosphere Demo Accepted for ICDE 2013</title>
<description> &lt;p&gt;Our demo submission&lt;br /&gt;
&lt;strong&gt;&lt;cite&gt;&quot;Peeking into the Optimization of Data Flow Programs with MapReduce-style UDFs&quot;&lt;/cite&gt;&lt;/strong&gt;&lt;br /&gt;
has been accepted for ICDE 2013 in Brisbane, Australia.&lt;br /&gt;
The demo illustrates the contributions of our VLDB 2012 paper &lt;cite&gt;&quot;Opening the Black Boxes in Data Flow Optimization&quot;&lt;/cite&gt; &lt;a href=&quot;/assets/papers/optimizationOfDataFlowsWithUDFs_13.pdf&quot;&gt;[PDF]&lt;/a&gt; and &lt;a href=&quot;/assets/papers/optimizationOfDataFlowsWithUDFs_poster_13.pdf&quot;&gt;[Poster PDF]&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Visit our poster, enjoy the demo, and talk to us if you are going to attend ICDE 2013.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt;&lt;br /&gt;
Data flows are a popular abstraction to define data-intensive processing tasks. In order to support a wide range of use cases, many data processing systems feature MapReduce-style user-defined functions (UDFs). In contrast to UDFs as known from relational DBMS, MapReduce-style UDFs have less strict templates. These templates do not alone provide all the information needed to decide whether they can be reordered with relational operators and other UDFs. However, it is well-known that reordering operators such as filters, joins, and aggregations can yield runtime improvements by orders of magnitude.&lt;br /&gt;
We demonstrate an optimizer for data flows that is able to reorder operators with MapReduce-style UDFs written in an imperative language. Our approach leverages static code analysis to extract information from UDFs which is used to reason about the reorderbility of UDF operators. This information is sufficient to enumerate a large fraction of the search space covered by conventional RDBMS optimizers including filter and aggregation push-down, bushy join orders, and choice of physical execution strategies based on interesting properties.&lt;br /&gt;
We demonstrate our optimizer and a job submission client that allows users to peek step-by-step into each phase of the optimization process: the static code analysis of UDFs, the enumeration of reordered candidate data flows, the generation of physical execution plans, and their parallel execution. For the demonstration, we provide a selection of relational and non-relational data flow programs which highlight the salient features of our approach.&lt;/p&gt;
</description>
<pubDate>Mon, 15 Oct 2012 16:57:18 +0200</pubDate>
<link>http://flink.apache.org/news/2012/10/15/icde2013.html</link>
<guid isPermaLink="true">/news/2012/10/15/icde2013.html</guid>
</item>
<item>
<title>Version 0.2 Released</title>
<description>&lt;p&gt;We are happy to announce that version 0.2 of the Stratosphere System has been released. It has a lot of performance improvements as well as a bunch of exciting new features like:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The new Sopremo Algebra Layer and the Meteor Scripting Language&lt;/li&gt;
&lt;li&gt;The whole new tuple data model for the PACT API&lt;/li&gt;
&lt;li&gt;Fault tolerance through local checkpoints&lt;/li&gt;
&lt;li&gt;A ton of performance improvements on all layers&lt;/li&gt;
&lt;li&gt;Support for plug-ins on the data flow channel layer&lt;/li&gt;
&lt;li&gt;Many new library classes (for example new Input-/Output-Formats)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For a complete list of new features, check out the &lt;a href=&quot;https://stratosphere.eu/wiki/doku.php/wiki:changesrelease0.2&quot;&gt;change log&lt;/a&gt;.&lt;/p&gt;</description>
<pubDate>Tue, 21 Aug 2012 16:57:18 +0200</pubDate>
<link>http://flink.apache.org/news/2012/08/21/release02.html</link>
<guid isPermaLink="true">/news/2012/08/21/release02.html</guid>
</item>
</channel>
</rss>