title: “Apache Hudi 2022 - A year in Review” excerpt: “2022 was the best year for Apache Hudi yet! Huge thank you to everyone who contributed!” author: Sivabalan Narayanan category: blog image: /assets/images/blog/Apache-Hudi-2022-Review.png tags:
<img src=“/assets/images/blog/Apache-Hudi-2022-Review.png” alt=“drawing” style={{width:‘80%’, display:‘block’, marginLeft:‘auto’, marginRight:‘auto’}} />
As we wrap up 2022 I want to take the opportunity to reflect on and highlight the incredible progress of the Apache Hudi project and most importantly, the community. First and foremost, I want to thank all of the contributors who have made 2022 the best year for the project ever. There were over 2,200 PRs created (+38% YoY) and over 600+ users engaged on Github. The Apache Hudi community slack channel has grown to more than 2,600 users (+100% YoY growth) averaging nearly 200 messages per month! The most impressive stat is that with this volume growth, the median response time to questions is ~3h. Come join the community where people are sharing and helping each other!
<img src=“/assets/images/blog/Apache-Hudi-Pull-Request-History.png” alt=“drawing” style={{width:‘80%’, display:‘block’, marginLeft:‘auto’, marginRight:‘auto’}} />
2022 has been a year jam packed with exciting new features for Apache Hudi across 0.11.0 and 0.12.0 releases. In addition to new features, vendor/ecosystem partnerships and relationships have been strengthened across many in the community. AWS continues to double down on Apache Hudi, upgrading versions in EMR, Athena, Redshift, and announcing a new native connector inside Glue. Presto and Trino merged native Hudi connectors for interactive analytics. DBT, Confluent, Datahub, and several others have added support for Hudi tables. While Google has supported Hudi for a while in BigQuery and Dataproc, it also announced plans to add Hudi in BigLake. The first tutorial for Hudi on Azure Synapse Analytics was published.
While there are too many features added in 2022 to list them all, take a look at some of the exciting highlights:
Apache Hudi is a global community and thankfully we live in a world today that empowers virtual collaboration and productivity. In addition to connecting virtually this year we have seen the Apache Hudi community gather at many events in person. Re:Invent, Data+AI Summit, Flink Forward, Alluxio Day, Data Council, PrestoCon, Confluent Current, DBT Coalesce, Cinco de Trino, Data Platform Summit, and many more.
<img src=“/assets/images/blog/Apache-Hudi-Conferences.png” alt=“drawing” style={{width:‘80%’, display:‘block’, marginLeft:‘auto’, marginRight:‘auto’}} />
You don’t have to travel far to meet and collaborate with the Hudi community. We hold monthly virtual meetups, weekly office hours, and there are plenty of friendly faces on Hudi Slack who like to talk shop. Join us via Zoom for the next Hudi meetup!
A wide diversity of organizations around the globe use Apache Hudi as the foundation of their production data platforms. Over 800+ organizations have engaged with Hudi (up 60% YoY) Here are a few highlights of content written by the community sharing their experiences, designs, and best practices:
Thanks to the strength of the community, Apache Hudi has a bright future for 2023. Check out this recording from our Re:Invent meetup where Vinoth Chandar talks about exciting new features to expect in 2023.
0.13.0 will be the next major release, with a package of exciting new features. Here are a few highlights:
The long-term vision of Apache Hudi is to make streaming data lake the mainstream, achieving sub-minute commit SLAs with stellar query performance and incremental ETLs. We plan to harden the indexing subsystem with Table APIs for easy integration with query engines and access to Hudi metadata and indexes, Indexing Functions and a Federated Storage Layer to eliminate the notion of partitions and reduce I/O, and new secondary indexes. To realize fast queries, we will provide an option of a standalone MetaServer serving Hudi metadata to plan queries in milliseconds and a Hudi-aware lake cache that speeds up the read performance of MOR tables along with fast writes for updates. Incremental and streaming SQL will be enhanced in Spark and Flink. For Hudi on Flink, we plan to make the multi-modal indexing production-ready, bring read and write compatibility between Flink and Spark engines, and harden the streaming capabilities, including CDC, streaming ETL semantics, pre-aggregation models and materialized views.
Check out Hudi Roadmap for more to come in 2023!
If you haven't tried Apache Hudi yet, 2023 is your year! Here are a few useful links to help you get started:
If you enjoyed Hudi in 2022 don't forget to give it a little star on Github ⭐