| --- |
| authors: |
| - name: Zhipeng Zhang |
| zhangzhipeng: null |
| - lindong: null |
| name: Dong Lin |
| date: "2022-07-12T08:00:00Z" |
| excerpt: The Apache Flink community is excited to announce the release of Flink ML |
| 2.1.0! This release focuses on improving Flink ML's infrastructure, such as Python |
| SDK, memory management, and benchmark framework, to facilitate the development of |
| performant, memory-safe, and easy-to-use algorithm libraries. We validated the enhanced |
| infrastructure via benchmarks and confirmed that Flink ML can meet or exceed the |
| performance of selected algorithms from alternative popular ML libraries. In addition, |
| this release added example Python and Java programs to help users learn and use |
| Flink ML. |
| title: Apache Flink ML 2.1.0 Release Announcement |
| aliases: |
| - /news/2022/07/12/release-ml-2.1.0.html |
| --- |
| |
| The Apache Flink community is excited to announce the release of Flink ML 2.1.0! |
| This release focuses on improving Flink ML's infrastructure, such as Python SDK, |
| memory management, and benchmark framework, to facilitate the development of |
| performant, memory-safe, and easy-to-use algorithm libraries. We validated the |
| enhanced infrastructure by implementing, benchmarking, and optimizing 10 new |
| algorithms in Flink ML, and confirmed that Flink ML can meet or exceed the |
| performance of selected algorithms from alternative popular ML libraries. |
| In addition, this release added example Python and Java programs for each |
| algorithm in the library to help users learn and use Flink ML. |
| |
| With the improvements and performance benchmarks made in this release, we |
| believe Flink ML's infrastructure is ready for use by the interested developers |
| in the community to build performant pythonic machine learning libraries. |
| |
| We encourage you to [download the release](https://flink.apache.org/downloads.html) |
| and share your feedback with the community through the Flink |
| [mailing lists](https://flink.apache.org/community.html#mailing-lists) or |
| [JIRA](https://issues.apache.org/jira/browse/flink)! We hope you like the new |
| release and we’d be eager to learn about your experience with it. |
| |
| # Notable Features |
| |
| ## API and Infrastructure |
| |
| ### Supporting fine-grained per-operator memory management |
| |
| Before this release, algorithm operators with internal states (e.g. the training |
| data to be replayed for each round of iteration) store state data using the |
| state-backend API (e.g. ListState). Such an operator either needs to store all |
| data in memory, which risks OOM, or it needs to always store data on disk. |
| In the latter case, it needs to read and de-serialize all data from disks |
| repeatedly in each round of iteration even if the data can fit in RAM, leading |
| to sub-optimal performance when the training data size is small. This makes it |
| hard for developers to write performant and memory-safe operators. |
| |
| This release enhances the Flink ML infrastructure with the mechanism to specify |
| the amount of managed memory that an operator can consume. This allows algorithm |
| operators to write and read data from managed memory when the data size is below |
| the quota, and automatically spill those data that exceeds the memory quota to |
| disks to avoid OOM. Algorithm developers can use this mechanism to achieve |
| optimal algorithm performance as input data size varies. Please feel free to |
| check out the implementation of the KMeans operator for example. |
| |
| ### Improved infrastructure for developing online learning algorithms |
| |
| A key objective of Flink ML is to facilitate the development of online learning |
| applications. In the last release, we enhanced the Flink ML API with |
| setModelData() and getModelData(), which allows users of online learning |
| algorithms to transmit and persist model data as unbounded data streams. |
| This release continues the effort by improving and validating the infrastructure |
| needed to develop online learning algorithms. |
| |
| Specifically, this release added two online learning algorithm prototypes (i.e. |
| OnlineKMeans and OnlineLogisticRegression) with tests covering the entire |
| lifecycle of using these algorithms. These two algorithms introduce concepts |
| such as global batch size and model version, together with metrics and APIs to |
| set and get those values. While the online algorithm prototypes have not been |
| optimized for prediction accuracy yet, this line of work is an important step |
| toward setting up best practices for building online learning algorithms in |
| Flink ML. We hope more contributors from the community can join this effort. |
| |
| ### Algorithm benchmark framework |
| |
| An easy-to-use benchmark framework is critical to developing and maintaining |
| performant algorithm libraries in Flink ML. This release added a benchmark |
| framework that provides APIs to write pluggable and reusable data generators, |
| takes benchmark configuration in JSON format, and outputs benchmark results in |
| JSON format to enable custom analysis. An off-the-shelf script is provided to |
| visualize benchmark results using Matplotlib. Feel free to check out this |
| [README](https://github.com/apache/flink-ml/blob/release-2.1/flink-ml-benchmark/README.md) |
| for instructions on how to use this benchmark framework. |
| |
| The benchmark framework currently supports evaluating algorithm throughput. |
| In the future release, we plan to support evaluating algorithm latency and |
| accuracy. |
| |
| ## Python SDK |
| |
| This release enhances the Python SDK so that operators in the Flink ML Python |
| library can invoke the corresponding operators in the Java library. The Python |
| operator is a thin-wrapper around the Java operator and delivers the same |
| performance as the Java operator during execution. This capability significantly |
| improves developer velocity by allowing algorithm developers to maintain both |
| the Python and the Java libraries of algorithms without having to implement |
| those algorithms twice. |
| |
| ## Algorithm Library |
| This release continues to extend the algorithm library in Flink ML, with the |
| focus on validating the functionalities and the performance of Flink ML |
| infrastructure using representative algorithms in different categories. |
| |
| Below are the lists of algorithms newly supported in this release, grouped by |
| their categories: |
| |
| - Feature engineering (MinMaxScaler, StringIndexer, VectorAssembler, |
| StandardScaler, Bucketizer) |
| - Online learning (OnlineKmeans, OnlineLogisiticRegression) |
| - Regression (LinearRegression) |
| - Classification (LinearSVC) |
| - Evaluation (BinaryClassificationEvaluator) |
| |
| Example Python and Java programs for these algorithms are provided on the Apache |
| Flink ML [website](https://nightlies.apache.org/flink/flink-ml-docs-release-2.1/) to |
| help users learn and try out Flink ML. And we also provided example benchmark |
| [configuration files](https://github.com/apache/flink-ml/tree/release-2.1/flink-ml-benchmark/src/main/resources) |
| in the repo for users to validate Flink ML performance. Feel free to check out |
| this [README](https://github.com/apache/flink-ml/blob/release-2.1/flink-ml-benchmark/README.md) |
| for instructions on how to run those benchmarks. |
| |
| # Upgrade Notes |
| |
| Please review this note for a list of adjustments to make and issues to check |
| if you plan to upgrade to Flink ML 2.1.0. |
| |
| This note discusses any critical information about incompatibilities and |
| breaking changes, performance changes, and any other changes that might impact |
| your production deployment of Flink ML. |
| |
| |
| * **Flink dependency is changed from 1.14 to 1.15**. |
| |
| This change introduces all the breaking changes listed in the Flink 1.15 |
| [release notes](https://nightlies.apache.org/flink/flink-docs-release-1.15/release-notes/flink-1.15/). |
| |
| # Release Notes and Resources |
| |
| Please take a look at the [release notes](https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12351141) |
| for a detailed list of changes and new features. |
| |
| The source artifacts is now available on the updated |
| [Downloads page](https://flink.apache.org/downloads.html) of the Flink website, |
| and the most recent distribution of Flink ML Python package is available on |
| [PyPI](https://pypi.org/project/apache-flink-ml). |
| |
| # List of Contributors |
| |
| The Apache Flink community would like to thank each one of the contributors |
| that have made this release possible: |
| |
| Yunfeng Zhou, Zhipeng Zhang, huangxingbo, weibo, Dong Lin, Yun Gao, Jingsong Li |
| and mumuhhh. |
| |