docs/content/posts/2015-04-13-release-0.9.0-milestone1.md - flink-web - Git at Google

 ---
 date: "2015-04-13T10:00:00Z"
 title: Announcing Flink 0.9.0-milestone1 preview release
 aliases:
 - /news/2015/04/13/release-0.9.0-milestone1.html
 ---

 The Apache Flink community is pleased to announce the availability of
 the 0.9.0-milestone-1 release. The release is a preview of the
 upcoming 0.9.0 release. It contains many new features which will be
 available in the upcoming 0.9 release. Interested users are encouraged
 to try it out and give feedback. As the version number indicates, this
 release is a preview release that contains known issues.

 You can download the release
 [here](http://flink.apache.org/downloads.html#preview) and check out the
 latest documentation
 [here]({{< param DocsBaseUrl >}}flink-docs-master/). Feedback
 through the Flink [mailing
 lists](http://flink.apache.org/community.html#mailing-lists) is, as
 always, very welcome!

 ## New Features

 ### Table API

 Flink’s new Table API offers a higher-level abstraction for
 interacting with structured data sources. The Table API allows users
 to execute logical, SQL-like queries on distributed data sets while
 allowing them to freely mix declarative queries with regular Flink
 operators. Here is an example that groups and joins two tables:

 ```scala
 val clickCounts = clicks
   .groupBy('user).select('userId, 'url.count as 'count)

 val activeUsers = users.join(clickCounts)
   .where('id === 'userId && 'count > 10).select('username, 'count, ...)
 ```

 Tables consist of logical attributes that can be selected by name
 rather than physical Java and Scala data types. This alleviates a lot
 of boilerplate code for common ETL tasks and raises the abstraction
 for Flink programs. Tables are available for both static and streaming
 data sources (DataSet and DataStream APIs).

 Check out the Table guide for Java and Scala
 [here]({{< param DocsBaseUrl >}}flink-docs-master/apis/batch/libs/table.html).

 ### Gelly Graph Processing API

 Gelly is a Java Graph API for Flink. It contains a set of utilities
 for graph analysis, support for iterative graph processing and a
 library of graph algorithms. Gelly exposes a Graph data structure that
 wraps DataSets for vertices and edges, as well as methods for creating
 graphs from DataSets, graph transformations and utilities (e.g., in-
 and out- degrees of vertices), neighborhood aggregations, iterative
 vertex-centric graph processing, as well as a library of common graph
 algorithms, including PageRank, SSSP, label propagation, and community
 detection.

 Gelly internally builds on top of Flink’s [delta
 iterations]({{< param DocsBaseUrl >}}flink-docs-master/apis/batch/iterations.html). Iterative
 graph algorithms are executed leveraging mutable state, achieving
 similar performance with specialized graph processing systems.

 Gelly will eventually subsume Spargel, Flink’s Pregel-like API. Check
 out the Gelly guide
 [here]({{< param DocsBaseUrl >}}flink-docs-master/apis/batch/libs/gelly.html).

 ### Flink Machine Learning Library

 This release includes the first version of Flink’s Machine Learning
 library. The library’s pipeline approach, which has been strongly
 inspired by scikit-learn’s abstraction of transformers and estimators,
 makes it easy to quickly set up a data processing pipeline and to get
 your job done.

 Flink distinguishes between transformers and learners. Transformers
 are components which transform your input data into a new format
 allowing you to extract features, cleanse your data or to sample from
 it. Learners on the other hand constitute the components which take
 your input data and train a model on it. The model you obtain from the
 learner can then be evaluated and used to make predictions on unseen
 data.

 Currently, the machine learning library contains transformers and
 learners to do multiple tasks. The library supports multiple linear
 regression using a stochastic gradient implementation to scale to
 large data sizes. Furthermore, it includes an alternating least
 squares (ALS) implementation to factorizes large matrices. The matrix
 factorization can be used to do collaborative filtering. An
 implementation of the communication efficient distributed dual
 coordinate ascent (CoCoA) algorithm is the latest addition to the
 library. The CoCoA algorithm can be used to train distributed
 soft-margin SVMs.

 ### Flink on YARN leveraging Apache Tez

 We are introducing a new execution mode for Flink to be able to run
 restricted Flink programs on top of [Apache
 Tez](http://tez.apache.org). This mode retains Flink’s APIs,
 optimizer, as well as Flink’s runtime operators, but instead of
 wrapping those in Flink tasks that are executed by Flink TaskManagers,
 it wraps them in Tez runtime tasks and builds a Tez DAG that
 represents the program.

 By using Flink on Tez, users have an additional choice for an
 execution platform for Flink programs. While Flink’s distributed
 runtime favors low latency, streaming shuffles, and iterative
 algorithms, Tez focuses on scalability and elastic resource usage in
 shared YARN clusters.

 Get started with Flink on Tez
 [here]({{< param DocsBaseUrl >}}flink-docs-master/setup/flink_on_tez.html).

 ### Reworked Distributed Runtime on Akka

 Flink’s RPC system has been replaced by the widely adopted
 [Akka](http://akka.io) framework. Akka’s concurrency model offers the
 right abstraction to develop a fast as well as robust distributed
 system. By using Akka’s own failure detection mechanism the stability
 of Flink’s runtime is significantly improved, because the system can
 now react in proper form to node outages. Furthermore, Akka improves
 Flink’s scalability by introducing asynchronous messages to the
 system. These asynchronous messages allow Flink to be run on many more
 nodes than before.

 ### Exactly-once processing on Kafka Streaming Sources

 This release introduces stream processing with exacly-once delivery
 guarantees for Flink streaming programs that analyze streaming sources
 that are persisted by [Apache Kafka](http://kafka.apache.org). The
 system is internally tracking the Kafka offsets to ensure that Flink
 can pick up data from Kafka where it left off in case of an failure.

 Read
 [here]({{< param DocsBaseUrl >}}flink-docs-master/apis/streaming_guide.html#apache-kafka)
 on how to use the persistent Kafka source.

 ### Improved YARN support

 Flink’s YARN client contains several improvements, such as a detached
 mode for starting a YARN session in the background, the ability to
 submit a single Flink job to a YARN cluster without starting a
 session, including a “fire and forget” mode. Flink is now also able to
 reallocate failed YARN containers to maintain the size of the
 requested cluster. This feature allows to implement fault-tolerant
 setups on top of YARN. There is also an internal Java API to deploy
 and control a running YARN cluster. This is being used by system
 integrators to easily control Flink on YARN within their Hadoop 2
 cluster.

 See the YARN docs
 [here]({{< param DocsBaseUrl >}}flink-docs-master/setup/yarn_setup.html).

 ## More Improvements and Fixes

 * [FLINK-1605](https://issues.apache.org/jira/browse/FLINK-1605):
   Flink is not exposing its Guava and ASM dependencies to Maven
   projects depending on Flink. We use the maven-shade-plugin to
   relocate these dependencies into our own namespace. This allows
   users to use any Guava or ASM version.

 * [FLINK-1417](https://issues.apache.org/jira/browse/FLINK-1605):
 Automatic recognition and registration of Java Types at Kryo and the
 internal serializers: Flink has its own type handling and
 serialization framework falling back to Kryo for types that it cannot
 handle. To get the best performance Flink is automatically registering
 all types a user is using in their program with Kryo.Flink also
 registers serializers for Protocol Buffers, Thrift, Avro and YodaTime
 automatically.  Users can also manually register serializers to Kryo
 (https://issues.apache.org/jira/browse/FLINK-1399)

 * [FLINK-1296](https://issues.apache.org/jira/browse/FLINK-1296): Add
   support for sorting very large records

 * [FLINK-1679](https://issues.apache.org/jira/browse/FLINK-1679):
   "degreeOfParallelism" methods renamed to "parallelism"

 * [FLINK-1501](https://issues.apache.org/jira/browse/FLINK-1501): Add
   metrics library for monitoring TaskManagers

 * [FLINK-1760](https://issues.apache.org/jira/browse/FLINK-1760): Add
   support for building Flink with Scala 2.11

 * [FLINK-1648](https://issues.apache.org/jira/browse/FLINK-1648): Add
   a mode where the system automatically sets the parallelism to the
   available task slots

 * [FLINK-1622](https://issues.apache.org/jira/browse/FLINK-1622): Add
   groupCombine operator

 * [FLINK-1589](https://issues.apache.org/jira/browse/FLINK-1589): Add
   option to pass Configuration to LocalExecutor

 * [FLINK-1504](https://issues.apache.org/jira/browse/FLINK-1504): Add
   support for accessing secured HDFS clusters in standalone mode

 * [FLINK-1478](https://issues.apache.org/jira/browse/FLINK-1478): Add
   strictly local input split assignment

 * [FLINK-1512](https://issues.apache.org/jira/browse/FLINK-1512): Add
   CsvReader for reading into POJOs.

 * [FLINK-1461](https://issues.apache.org/jira/browse/FLINK-1461): Add
   sortPartition operator

 * [FLINK-1450](https://issues.apache.org/jira/browse/FLINK-1450): Add
   Fold operator to the Streaming api

 * [FLINK-1389](https://issues.apache.org/jira/browse/FLINK-1389):
   Allow setting custom file extensions for files created by the
   FileOutputFormat

 * [FLINK-1236](https://issues.apache.org/jira/browse/FLINK-1236): Add
   support for localization of Hadoop Input Splits

 * [FLINK-1179](https://issues.apache.org/jira/browse/FLINK-1179): Add
   button to JobManager web interface to request stack trace of a
   TaskManager

 * [FLINK-1105](https://issues.apache.org/jira/browse/FLINK-1105): Add
   support for locally sorted output

 * [FLINK-1688](https://issues.apache.org/jira/browse/FLINK-1688): Add
   socket sink

 * [FLINK-1436](https://issues.apache.org/jira/browse/FLINK-1436):
   Improve usability of command line interface
	---
	date: "2015-04-13T10:00:00Z"
	title: Announcing Flink 0.9.0-milestone1 preview release
	aliases:
	- /news/2015/04/13/release-0.9.0-milestone1.html
	---

	The Apache Flink community is pleased to announce the availability of
	the 0.9.0-milestone-1 release. The release is a preview of the
	upcoming 0.9.0 release. It contains many new features which will be
	available in the upcoming 0.9 release. Interested users are encouraged
	to try it out and give feedback. As the version number indicates, this
	release is a preview release that contains known issues.

	You can download the release
	[here](http://flink.apache.org/downloads.html#preview) and check out the
	latest documentation
	[here]({{< param DocsBaseUrl >}}flink-docs-master/). Feedback
	through the Flink [mailing
	lists](http://flink.apache.org/community.html#mailing-lists) is, as
	always, very welcome!

	## New Features

	### Table API

	Flink’s new Table API offers a higher-level abstraction for
	interacting with structured data sources. The Table API allows users
	to execute logical, SQL-like queries on distributed data sets while
	allowing them to freely mix declarative queries with regular Flink
	operators. Here is an example that groups and joins two tables:

	```scala
	val clickCounts = clicks
	.groupBy('user).select('userId, 'url.count as 'count)

	val activeUsers = users.join(clickCounts)
	.where('id === 'userId && 'count > 10).select('username, 'count, ...)
	```

	Tables consist of logical attributes that can be selected by name
	rather than physical Java and Scala data types. This alleviates a lot
	of boilerplate code for common ETL tasks and raises the abstraction
	for Flink programs. Tables are available for both static and streaming
	data sources (DataSet and DataStream APIs).

	Check out the Table guide for Java and Scala
	[here]({{< param DocsBaseUrl >}}flink-docs-master/apis/batch/libs/table.html).

	### Gelly Graph Processing API

	Gelly is a Java Graph API for Flink. It contains a set of utilities
	for graph analysis, support for iterative graph processing and a
	library of graph algorithms. Gelly exposes a Graph data structure that
	wraps DataSets for vertices and edges, as well as methods for creating
	graphs from DataSets, graph transformations and utilities (e.g., in-
	and out- degrees of vertices), neighborhood aggregations, iterative
	vertex-centric graph processing, as well as a library of common graph
	algorithms, including PageRank, SSSP, label propagation, and community
	detection.

	Gelly internally builds on top of Flink’s [delta
	iterations]({{< param DocsBaseUrl >}}flink-docs-master/apis/batch/iterations.html). Iterative
	graph algorithms are executed leveraging mutable state, achieving
	similar performance with specialized graph processing systems.

	Gelly will eventually subsume Spargel, Flink’s Pregel-like API. Check
	out the Gelly guide
	[here]({{< param DocsBaseUrl >}}flink-docs-master/apis/batch/libs/gelly.html).

	### Flink Machine Learning Library

	This release includes the first version of Flink’s Machine Learning
	library. The library’s pipeline approach, which has been strongly
	inspired by scikit-learn’s abstraction of transformers and estimators,
	makes it easy to quickly set up a data processing pipeline and to get
	your job done.

	Flink distinguishes between transformers and learners. Transformers
	are components which transform your input data into a new format
	allowing you to extract features, cleanse your data or to sample from
	it. Learners on the other hand constitute the components which take
	your input data and train a model on it. The model you obtain from the
	learner can then be evaluated and used to make predictions on unseen
	data.

	Currently, the machine learning library contains transformers and
	learners to do multiple tasks. The library supports multiple linear
	regression using a stochastic gradient implementation to scale to
	large data sizes. Furthermore, it includes an alternating least
	squares (ALS) implementation to factorizes large matrices. The matrix
	factorization can be used to do collaborative filtering. An
	implementation of the communication efficient distributed dual
	coordinate ascent (CoCoA) algorithm is the latest addition to the
	library. The CoCoA algorithm can be used to train distributed
	soft-margin SVMs.

	### Flink on YARN leveraging Apache Tez

	We are introducing a new execution mode for Flink to be able to run
	restricted Flink programs on top of [Apache
	Tez](http://tez.apache.org). This mode retains Flink’s APIs,
	optimizer, as well as Flink’s runtime operators, but instead of
	wrapping those in Flink tasks that are executed by Flink TaskManagers,
	it wraps them in Tez runtime tasks and builds a Tez DAG that
	represents the program.

	By using Flink on Tez, users have an additional choice for an
	execution platform for Flink programs. While Flink’s distributed
	runtime favors low latency, streaming shuffles, and iterative
	algorithms, Tez focuses on scalability and elastic resource usage in
	shared YARN clusters.

	Get started with Flink on Tez
	[here]({{< param DocsBaseUrl >}}flink-docs-master/setup/flink_on_tez.html).

	### Reworked Distributed Runtime on Akka

	Flink’s RPC system has been replaced by the widely adopted
	[Akka](http://akka.io) framework. Akka’s concurrency model offers the
	right abstraction to develop a fast as well as robust distributed
	system. By using Akka’s own failure detection mechanism the stability
	of Flink’s runtime is significantly improved, because the system can
	now react in proper form to node outages. Furthermore, Akka improves
	Flink’s scalability by introducing asynchronous messages to the
	system. These asynchronous messages allow Flink to be run on many more
	nodes than before.

	### Exactly-once processing on Kafka Streaming Sources

	This release introduces stream processing with exacly-once delivery
	guarantees for Flink streaming programs that analyze streaming sources
	that are persisted by [Apache Kafka](http://kafka.apache.org). The
	system is internally tracking the Kafka offsets to ensure that Flink
	can pick up data from Kafka where it left off in case of an failure.

	Read
	[here]({{< param DocsBaseUrl >}}flink-docs-master/apis/streaming_guide.html#apache-kafka)
	on how to use the persistent Kafka source.

	### Improved YARN support

	Flink’s YARN client contains several improvements, such as a detached
	mode for starting a YARN session in the background, the ability to
	submit a single Flink job to a YARN cluster without starting a
	session, including a “fire and forget” mode. Flink is now also able to
	reallocate failed YARN containers to maintain the size of the
	requested cluster. This feature allows to implement fault-tolerant
	setups on top of YARN. There is also an internal Java API to deploy
	and control a running YARN cluster. This is being used by system
	integrators to easily control Flink on YARN within their Hadoop 2
	cluster.

	See the YARN docs
	[here]({{< param DocsBaseUrl >}}flink-docs-master/setup/yarn_setup.html).

	## More Improvements and Fixes

	* [FLINK-1605](https://issues.apache.org/jira/browse/FLINK-1605):
	Flink is not exposing its Guava and ASM dependencies to Maven
	projects depending on Flink. We use the maven-shade-plugin to
	relocate these dependencies into our own namespace. This allows
	users to use any Guava or ASM version.

	* [FLINK-1417](https://issues.apache.org/jira/browse/FLINK-1605):
	Automatic recognition and registration of Java Types at Kryo and the
	internal serializers: Flink has its own type handling and
	serialization framework falling back to Kryo for types that it cannot
	handle. To get the best performance Flink is automatically registering
	all types a user is using in their program with Kryo.Flink also
	registers serializers for Protocol Buffers, Thrift, Avro and YodaTime
	automatically. Users can also manually register serializers to Kryo
	(https://issues.apache.org/jira/browse/FLINK-1399)

	* [FLINK-1296](https://issues.apache.org/jira/browse/FLINK-1296): Add
	support for sorting very large records

	* [FLINK-1679](https://issues.apache.org/jira/browse/FLINK-1679):
	"degreeOfParallelism" methods renamed to "parallelism"

	* [FLINK-1501](https://issues.apache.org/jira/browse/FLINK-1501): Add
	metrics library for monitoring TaskManagers

	* [FLINK-1760](https://issues.apache.org/jira/browse/FLINK-1760): Add
	support for building Flink with Scala 2.11

	* [FLINK-1648](https://issues.apache.org/jira/browse/FLINK-1648): Add
	a mode where the system automatically sets the parallelism to the
	available task slots

	* [FLINK-1622](https://issues.apache.org/jira/browse/FLINK-1622): Add
	groupCombine operator

	* [FLINK-1589](https://issues.apache.org/jira/browse/FLINK-1589): Add
	option to pass Configuration to LocalExecutor

	* [FLINK-1504](https://issues.apache.org/jira/browse/FLINK-1504): Add
	support for accessing secured HDFS clusters in standalone mode

	* [FLINK-1478](https://issues.apache.org/jira/browse/FLINK-1478): Add
	strictly local input split assignment

	* [FLINK-1512](https://issues.apache.org/jira/browse/FLINK-1512): Add
	CsvReader for reading into POJOs.

	* [FLINK-1461](https://issues.apache.org/jira/browse/FLINK-1461): Add
	sortPartition operator

	* [FLINK-1450](https://issues.apache.org/jira/browse/FLINK-1450): Add
	Fold operator to the Streaming api

	* [FLINK-1389](https://issues.apache.org/jira/browse/FLINK-1389):
	Allow setting custom file extensions for files created by the
	FileOutputFormat

	* [FLINK-1236](https://issues.apache.org/jira/browse/FLINK-1236): Add
	support for localization of Hadoop Input Splits

	* [FLINK-1179](https://issues.apache.org/jira/browse/FLINK-1179): Add
	button to JobManager web interface to request stack trace of a
	TaskManager

	* [FLINK-1105](https://issues.apache.org/jira/browse/FLINK-1105): Add
	support for locally sorted output

	* [FLINK-1688](https://issues.apache.org/jira/browse/FLINK-1688): Add
	socket sink

	* [FLINK-1436](https://issues.apache.org/jira/browse/FLINK-1436):
	Improve usability of command line interface