The Rust implementation of Apache Arrow has just released version 9.0.2
.
While a major version of this magnitude may shock some in the Rust community to whom it implies a slow moving 20 year old piece of software, nothing could be further from the truth!
With regular and predictable bi-weekly releases, the library continues to evolve rapidly, and 9.0.2
is no exception. Some recent highlights:
parquet
: async, performance, safety and nested typesThe parquet 9.0.2
release includes an async
reader, a long time requested feature. Using the async
reader it is now possible to read only the relevant parts of a parquet file from a networked source such as object storage. Previously the entire file had to be buffered locally. We are hoping to add an async
writer in a future release and would love some help.
It is also significantly faster to read parquet data (up to 60x in some cases) than with previous versions of the parquet
crate. Kudos to tustvold and yordan-pavlov for their contributions in these areas.
With 8.0.0
and later, the code that reads and writes RecordBatch
es to and from Parquet now supports all types, including deeply nested structs and lists. Thanks helgikrs for cleaning up the last corner cases!
Other notable recent additions to parquet are UTF-8
validation on string data for improved security against malicious inputs.
Planned upcoming work includes pushing more filtering directly into the parquet scan as well as an async
writer.
arrow
: performance, dyn kernels, and DecimalArrayThe compute kernels have been improved significantly in arrow 9.0.2
. Some filter benchmarks are twice as fast and the SIMD kernels are also significantly faster. Many thanks to tustvold and jhorstmann. Additional substantial improvements are likely to land in arrow 10.0.0
.
We are working on new set of “dynamic” dyn_
kernels (for example, eq_dyn
) that make it easier to invoke the heavily optimized kernels provided by the arrow
crate. Work is underway to expand the breadth of types supported by these new kernels to make them even more useful. Thanks to matthewmturner and viirya for their help in this effort.
While arrow
has had basic support for DecimalArray
since version 3.0.0
, support has been expanded for Decimal
type in calculation kernels such as sort
, take
and filter
thanks to some great contributions from liukun4515. There is ongoing work to improve the API ergonomics and performance of DecimalArray
as well.
The 6.4.0
release resolved the last outstanding RUSTSEC advisory on the arrow crate and the 8.0.0
release resolved the last outstanding known security issues. While these security issues were mostly limited misuse of the low level “power user” APIs which most users do not (and should not) be using, it was good to tighten up that area.
Now that arrow-rs
is releasing major versions every other week, we are also able to update dependencies at the same pace, helping to ensure that security fixes upstream can flow more quickly to downstream projects.
It takes a community to build great software, and we would like to thank everyone who has contributed to the arrow-rs repository since the 7.0.0
release:
git shortlog -sn 7.0.0..9.0.0 22 Raphael Taylor-Davies 18 Andrew Lamb 6 Helgi Kristvin Sigurbjarnarson 6 Remzi Yang 5 Jรถrn Horstmann 4 Liang-Chi Hsieh 3 Jiayu Liu 2 dependabot[bot] 2 Yijie Shen 1 Matthew Turner 1 Kun Liu 1 Yang 1 Edd Robinson 1 Patrick More
If you are interested in contributing to the Rust subproject in Apache Arrow, you can find a list of open issues suitable for beginners here and the full list here.
Other ways to get involved include trying out Arrow on some of your data and filing bug reports, and helping to improve the documentation.