| --- |
| date: "2015-04-13T10:00:00Z" |
| title: Announcing Flink 0.9.0-milestone1 preview release |
| aliases: |
| - /news/2015/04/13/release-0.9.0-milestone1.html |
| --- |
| |
| The Apache Flink community is pleased to announce the availability of |
| the 0.9.0-milestone-1 release. The release is a preview of the |
| upcoming 0.9.0 release. It contains many new features which will be |
| available in the upcoming 0.9 release. Interested users are encouraged |
| to try it out and give feedback. As the version number indicates, this |
| release is a preview release that contains known issues. |
| |
| You can download the release |
| [here](http://flink.apache.org/downloads.html#preview) and check out the |
| latest documentation |
| [here]({{< param DocsBaseUrl >}}flink-docs-master/). Feedback |
| through the Flink [mailing |
| lists](http://flink.apache.org/community.html#mailing-lists) is, as |
| always, very welcome! |
| |
| ## New Features |
| |
| ### Table API |
| |
| Flink’s new Table API offers a higher-level abstraction for |
| interacting with structured data sources. The Table API allows users |
| to execute logical, SQL-like queries on distributed data sets while |
| allowing them to freely mix declarative queries with regular Flink |
| operators. Here is an example that groups and joins two tables: |
| |
| ```scala |
| val clickCounts = clicks |
| .groupBy('user).select('userId, 'url.count as 'count) |
| |
| val activeUsers = users.join(clickCounts) |
| .where('id === 'userId && 'count > 10).select('username, 'count, ...) |
| ``` |
| |
| Tables consist of logical attributes that can be selected by name |
| rather than physical Java and Scala data types. This alleviates a lot |
| of boilerplate code for common ETL tasks and raises the abstraction |
| for Flink programs. Tables are available for both static and streaming |
| data sources (DataSet and DataStream APIs). |
| |
| Check out the Table guide for Java and Scala |
| [here]({{< param DocsBaseUrl >}}flink-docs-master/apis/batch/libs/table.html). |
| |
| ### Gelly Graph Processing API |
| |
| Gelly is a Java Graph API for Flink. It contains a set of utilities |
| for graph analysis, support for iterative graph processing and a |
| library of graph algorithms. Gelly exposes a Graph data structure that |
| wraps DataSets for vertices and edges, as well as methods for creating |
| graphs from DataSets, graph transformations and utilities (e.g., in- |
| and out- degrees of vertices), neighborhood aggregations, iterative |
| vertex-centric graph processing, as well as a library of common graph |
| algorithms, including PageRank, SSSP, label propagation, and community |
| detection. |
| |
| Gelly internally builds on top of Flink’s [delta |
| iterations]({{< param DocsBaseUrl >}}flink-docs-master/apis/batch/iterations.html). Iterative |
| graph algorithms are executed leveraging mutable state, achieving |
| similar performance with specialized graph processing systems. |
| |
| Gelly will eventually subsume Spargel, Flink’s Pregel-like API. Check |
| out the Gelly guide |
| [here]({{< param DocsBaseUrl >}}flink-docs-master/apis/batch/libs/gelly.html). |
| |
| ### Flink Machine Learning Library |
| |
| This release includes the first version of Flink’s Machine Learning |
| library. The library’s pipeline approach, which has been strongly |
| inspired by scikit-learn’s abstraction of transformers and estimators, |
| makes it easy to quickly set up a data processing pipeline and to get |
| your job done. |
| |
| Flink distinguishes between transformers and learners. Transformers |
| are components which transform your input data into a new format |
| allowing you to extract features, cleanse your data or to sample from |
| it. Learners on the other hand constitute the components which take |
| your input data and train a model on it. The model you obtain from the |
| learner can then be evaluated and used to make predictions on unseen |
| data. |
| |
| Currently, the machine learning library contains transformers and |
| learners to do multiple tasks. The library supports multiple linear |
| regression using a stochastic gradient implementation to scale to |
| large data sizes. Furthermore, it includes an alternating least |
| squares (ALS) implementation to factorizes large matrices. The matrix |
| factorization can be used to do collaborative filtering. An |
| implementation of the communication efficient distributed dual |
| coordinate ascent (CoCoA) algorithm is the latest addition to the |
| library. The CoCoA algorithm can be used to train distributed |
| soft-margin SVMs. |
| |
| ### Flink on YARN leveraging Apache Tez |
| |
| We are introducing a new execution mode for Flink to be able to run |
| restricted Flink programs on top of [Apache |
| Tez](http://tez.apache.org). This mode retains Flink’s APIs, |
| optimizer, as well as Flink’s runtime operators, but instead of |
| wrapping those in Flink tasks that are executed by Flink TaskManagers, |
| it wraps them in Tez runtime tasks and builds a Tez DAG that |
| represents the program. |
| |
| By using Flink on Tez, users have an additional choice for an |
| execution platform for Flink programs. While Flink’s distributed |
| runtime favors low latency, streaming shuffles, and iterative |
| algorithms, Tez focuses on scalability and elastic resource usage in |
| shared YARN clusters. |
| |
| Get started with Flink on Tez |
| [here]({{< param DocsBaseUrl >}}flink-docs-master/setup/flink_on_tez.html). |
| |
| ### Reworked Distributed Runtime on Akka |
| |
| Flink’s RPC system has been replaced by the widely adopted |
| [Akka](http://akka.io) framework. Akka’s concurrency model offers the |
| right abstraction to develop a fast as well as robust distributed |
| system. By using Akka’s own failure detection mechanism the stability |
| of Flink’s runtime is significantly improved, because the system can |
| now react in proper form to node outages. Furthermore, Akka improves |
| Flink’s scalability by introducing asynchronous messages to the |
| system. These asynchronous messages allow Flink to be run on many more |
| nodes than before. |
| |
| ### Exactly-once processing on Kafka Streaming Sources |
| |
| This release introduces stream processing with exacly-once delivery |
| guarantees for Flink streaming programs that analyze streaming sources |
| that are persisted by [Apache Kafka](http://kafka.apache.org). The |
| system is internally tracking the Kafka offsets to ensure that Flink |
| can pick up data from Kafka where it left off in case of an failure. |
| |
| Read |
| [here]({{< param DocsBaseUrl >}}flink-docs-master/apis/streaming_guide.html#apache-kafka) |
| on how to use the persistent Kafka source. |
| |
| ### Improved YARN support |
| |
| Flink’s YARN client contains several improvements, such as a detached |
| mode for starting a YARN session in the background, the ability to |
| submit a single Flink job to a YARN cluster without starting a |
| session, including a “fire and forget” mode. Flink is now also able to |
| reallocate failed YARN containers to maintain the size of the |
| requested cluster. This feature allows to implement fault-tolerant |
| setups on top of YARN. There is also an internal Java API to deploy |
| and control a running YARN cluster. This is being used by system |
| integrators to easily control Flink on YARN within their Hadoop 2 |
| cluster. |
| |
| See the YARN docs |
| [here]({{< param DocsBaseUrl >}}flink-docs-master/setup/yarn_setup.html). |
| |
| ## More Improvements and Fixes |
| |
| * [FLINK-1605](https://issues.apache.org/jira/browse/FLINK-1605): |
| Flink is not exposing its Guava and ASM dependencies to Maven |
| projects depending on Flink. We use the maven-shade-plugin to |
| relocate these dependencies into our own namespace. This allows |
| users to use any Guava or ASM version. |
| |
| * [FLINK-1417](https://issues.apache.org/jira/browse/FLINK-1605): |
| Automatic recognition and registration of Java Types at Kryo and the |
| internal serializers: Flink has its own type handling and |
| serialization framework falling back to Kryo for types that it cannot |
| handle. To get the best performance Flink is automatically registering |
| all types a user is using in their program with Kryo.Flink also |
| registers serializers for Protocol Buffers, Thrift, Avro and YodaTime |
| automatically. Users can also manually register serializers to Kryo |
| (https://issues.apache.org/jira/browse/FLINK-1399) |
| |
| * [FLINK-1296](https://issues.apache.org/jira/browse/FLINK-1296): Add |
| support for sorting very large records |
| |
| * [FLINK-1679](https://issues.apache.org/jira/browse/FLINK-1679): |
| "degreeOfParallelism" methods renamed to "parallelism" |
| |
| * [FLINK-1501](https://issues.apache.org/jira/browse/FLINK-1501): Add |
| metrics library for monitoring TaskManagers |
| |
| * [FLINK-1760](https://issues.apache.org/jira/browse/FLINK-1760): Add |
| support for building Flink with Scala 2.11 |
| |
| * [FLINK-1648](https://issues.apache.org/jira/browse/FLINK-1648): Add |
| a mode where the system automatically sets the parallelism to the |
| available task slots |
| |
| * [FLINK-1622](https://issues.apache.org/jira/browse/FLINK-1622): Add |
| groupCombine operator |
| |
| * [FLINK-1589](https://issues.apache.org/jira/browse/FLINK-1589): Add |
| option to pass Configuration to LocalExecutor |
| |
| * [FLINK-1504](https://issues.apache.org/jira/browse/FLINK-1504): Add |
| support for accessing secured HDFS clusters in standalone mode |
| |
| * [FLINK-1478](https://issues.apache.org/jira/browse/FLINK-1478): Add |
| strictly local input split assignment |
| |
| * [FLINK-1512](https://issues.apache.org/jira/browse/FLINK-1512): Add |
| CsvReader for reading into POJOs. |
| |
| * [FLINK-1461](https://issues.apache.org/jira/browse/FLINK-1461): Add |
| sortPartition operator |
| |
| * [FLINK-1450](https://issues.apache.org/jira/browse/FLINK-1450): Add |
| Fold operator to the Streaming api |
| |
| * [FLINK-1389](https://issues.apache.org/jira/browse/FLINK-1389): |
| Allow setting custom file extensions for files created by the |
| FileOutputFormat |
| |
| * [FLINK-1236](https://issues.apache.org/jira/browse/FLINK-1236): Add |
| support for localization of Hadoop Input Splits |
| |
| * [FLINK-1179](https://issues.apache.org/jira/browse/FLINK-1179): Add |
| button to JobManager web interface to request stack trace of a |
| TaskManager |
| |
| * [FLINK-1105](https://issues.apache.org/jira/browse/FLINK-1105): Add |
| support for locally sorted output |
| |
| * [FLINK-1688](https://issues.apache.org/jira/browse/FLINK-1688): Add |
| socket sink |
| |
| * [FLINK-1436](https://issues.apache.org/jira/browse/FLINK-1436): |
| Improve usability of command line interface |