blob: cff0d2d9630f1568ed1b93486595ea5feac6e129 [file] [log] [blame] [view]
---
date: "2015-04-13T10:00:00Z"
title: Announcing Flink 0.9.0-milestone1 preview release
aliases:
- /news/2015/04/13/release-0.9.0-milestone1.html
---
The Apache Flink community is pleased to announce the availability of
the 0.9.0-milestone-1 release. The release is a preview of the
upcoming 0.9.0 release. It contains many new features which will be
available in the upcoming 0.9 release. Interested users are encouraged
to try it out and give feedback. As the version number indicates, this
release is a preview release that contains known issues.
You can download the release
[here](http://flink.apache.org/downloads.html#preview) and check out the
latest documentation
[here]({{< param DocsBaseUrl >}}flink-docs-master/). Feedback
through the Flink [mailing
lists](http://flink.apache.org/community.html#mailing-lists) is, as
always, very welcome!
## New Features
### Table API
Flinks new Table API offers a higher-level abstraction for
interacting with structured data sources. The Table API allows users
to execute logical, SQL-like queries on distributed data sets while
allowing them to freely mix declarative queries with regular Flink
operators. Here is an example that groups and joins two tables:
```scala
val clickCounts = clicks
.groupBy('user).select('userId, 'url.count as 'count)
val activeUsers = users.join(clickCounts)
.where('id === 'userId && 'count > 10).select('username, 'count, ...)
```
Tables consist of logical attributes that can be selected by name
rather than physical Java and Scala data types. This alleviates a lot
of boilerplate code for common ETL tasks and raises the abstraction
for Flink programs. Tables are available for both static and streaming
data sources (DataSet and DataStream APIs).
Check out the Table guide for Java and Scala
[here]({{< param DocsBaseUrl >}}flink-docs-master/apis/batch/libs/table.html).
### Gelly Graph Processing API
Gelly is a Java Graph API for Flink. It contains a set of utilities
for graph analysis, support for iterative graph processing and a
library of graph algorithms. Gelly exposes a Graph data structure that
wraps DataSets for vertices and edges, as well as methods for creating
graphs from DataSets, graph transformations and utilities (e.g., in-
and out- degrees of vertices), neighborhood aggregations, iterative
vertex-centric graph processing, as well as a library of common graph
algorithms, including PageRank, SSSP, label propagation, and community
detection.
Gelly internally builds on top of Flinks [delta
iterations]({{< param DocsBaseUrl >}}flink-docs-master/apis/batch/iterations.html). Iterative
graph algorithms are executed leveraging mutable state, achieving
similar performance with specialized graph processing systems.
Gelly will eventually subsume Spargel, Flinks Pregel-like API. Check
out the Gelly guide
[here]({{< param DocsBaseUrl >}}flink-docs-master/apis/batch/libs/gelly.html).
### Flink Machine Learning Library
This release includes the first version of Flinks Machine Learning
library. The librarys pipeline approach, which has been strongly
inspired by scikit-learns abstraction of transformers and estimators,
makes it easy to quickly set up a data processing pipeline and to get
your job done.
Flink distinguishes between transformers and learners. Transformers
are components which transform your input data into a new format
allowing you to extract features, cleanse your data or to sample from
it. Learners on the other hand constitute the components which take
your input data and train a model on it. The model you obtain from the
learner can then be evaluated and used to make predictions on unseen
data.
Currently, the machine learning library contains transformers and
learners to do multiple tasks. The library supports multiple linear
regression using a stochastic gradient implementation to scale to
large data sizes. Furthermore, it includes an alternating least
squares (ALS) implementation to factorizes large matrices. The matrix
factorization can be used to do collaborative filtering. An
implementation of the communication efficient distributed dual
coordinate ascent (CoCoA) algorithm is the latest addition to the
library. The CoCoA algorithm can be used to train distributed
soft-margin SVMs.
### Flink on YARN leveraging Apache Tez
We are introducing a new execution mode for Flink to be able to run
restricted Flink programs on top of [Apache
Tez](http://tez.apache.org). This mode retains Flink’s APIs,
optimizer, as well as Flinks runtime operators, but instead of
wrapping those in Flink tasks that are executed by Flink TaskManagers,
it wraps them in Tez runtime tasks and builds a Tez DAG that
represents the program.
By using Flink on Tez, users have an additional choice for an
execution platform for Flink programs. While Flinks distributed
runtime favors low latency, streaming shuffles, and iterative
algorithms, Tez focuses on scalability and elastic resource usage in
shared YARN clusters.
Get started with Flink on Tez
[here]({{< param DocsBaseUrl >}}flink-docs-master/setup/flink_on_tez.html).
### Reworked Distributed Runtime on Akka
Flinks RPC system has been replaced by the widely adopted
[Akka](http://akka.io) framework. Akka’s concurrency model offers the
right abstraction to develop a fast as well as robust distributed
system. By using Akkas own failure detection mechanism the stability
of Flinks runtime is significantly improved, because the system can
now react in proper form to node outages. Furthermore, Akka improves
Flinks scalability by introducing asynchronous messages to the
system. These asynchronous messages allow Flink to be run on many more
nodes than before.
### Exactly-once processing on Kafka Streaming Sources
This release introduces stream processing with exacly-once delivery
guarantees for Flink streaming programs that analyze streaming sources
that are persisted by [Apache Kafka](http://kafka.apache.org). The
system is internally tracking the Kafka offsets to ensure that Flink
can pick up data from Kafka where it left off in case of an failure.
Read
[here]({{< param DocsBaseUrl >}}flink-docs-master/apis/streaming_guide.html#apache-kafka)
on how to use the persistent Kafka source.
### Improved YARN support
Flinks YARN client contains several improvements, such as a detached
mode for starting a YARN session in the background, the ability to
submit a single Flink job to a YARN cluster without starting a
session, including a fire and forget mode. Flink is now also able to
reallocate failed YARN containers to maintain the size of the
requested cluster. This feature allows to implement fault-tolerant
setups on top of YARN. There is also an internal Java API to deploy
and control a running YARN cluster. This is being used by system
integrators to easily control Flink on YARN within their Hadoop 2
cluster.
See the YARN docs
[here]({{< param DocsBaseUrl >}}flink-docs-master/setup/yarn_setup.html).
## More Improvements and Fixes
* [FLINK-1605](https://issues.apache.org/jira/browse/FLINK-1605):
Flink is not exposing its Guava and ASM dependencies to Maven
projects depending on Flink. We use the maven-shade-plugin to
relocate these dependencies into our own namespace. This allows
users to use any Guava or ASM version.
* [FLINK-1417](https://issues.apache.org/jira/browse/FLINK-1605):
Automatic recognition and registration of Java Types at Kryo and the
internal serializers: Flink has its own type handling and
serialization framework falling back to Kryo for types that it cannot
handle. To get the best performance Flink is automatically registering
all types a user is using in their program with Kryo.Flink also
registers serializers for Protocol Buffers, Thrift, Avro and YodaTime
automatically. Users can also manually register serializers to Kryo
(https://issues.apache.org/jira/browse/FLINK-1399)
* [FLINK-1296](https://issues.apache.org/jira/browse/FLINK-1296): Add
support for sorting very large records
* [FLINK-1679](https://issues.apache.org/jira/browse/FLINK-1679):
"degreeOfParallelism" methods renamed to "parallelism"
* [FLINK-1501](https://issues.apache.org/jira/browse/FLINK-1501): Add
metrics library for monitoring TaskManagers
* [FLINK-1760](https://issues.apache.org/jira/browse/FLINK-1760): Add
support for building Flink with Scala 2.11
* [FLINK-1648](https://issues.apache.org/jira/browse/FLINK-1648): Add
a mode where the system automatically sets the parallelism to the
available task slots
* [FLINK-1622](https://issues.apache.org/jira/browse/FLINK-1622): Add
groupCombine operator
* [FLINK-1589](https://issues.apache.org/jira/browse/FLINK-1589): Add
option to pass Configuration to LocalExecutor
* [FLINK-1504](https://issues.apache.org/jira/browse/FLINK-1504): Add
support for accessing secured HDFS clusters in standalone mode
* [FLINK-1478](https://issues.apache.org/jira/browse/FLINK-1478): Add
strictly local input split assignment
* [FLINK-1512](https://issues.apache.org/jira/browse/FLINK-1512): Add
CsvReader for reading into POJOs.
* [FLINK-1461](https://issues.apache.org/jira/browse/FLINK-1461): Add
sortPartition operator
* [FLINK-1450](https://issues.apache.org/jira/browse/FLINK-1450): Add
Fold operator to the Streaming api
* [FLINK-1389](https://issues.apache.org/jira/browse/FLINK-1389):
Allow setting custom file extensions for files created by the
FileOutputFormat
* [FLINK-1236](https://issues.apache.org/jira/browse/FLINK-1236): Add
support for localization of Hadoop Input Splits
* [FLINK-1179](https://issues.apache.org/jira/browse/FLINK-1179): Add
button to JobManager web interface to request stack trace of a
TaskManager
* [FLINK-1105](https://issues.apache.org/jira/browse/FLINK-1105): Add
support for locally sorted output
* [FLINK-1688](https://issues.apache.org/jira/browse/FLINK-1688): Add
socket sink
* [FLINK-1436](https://issues.apache.org/jira/browse/FLINK-1436):
Improve usability of command line interface