The Apache Arrow team is pleased to announce the 0.8.0 release. It is the product of 10 weeks of development and includes 286 resolved JIRAs with many new features and bug fixes to the various language implementations. This is the largest release since 0.3.0 earlier this year.
As part of work towards a stabilizing the Arrow format and making a 1.0.0 release sometime in 2018, we made a series of backwards-incompatible changes to the serialized Arrow metadata that requires Arrow readers and writers (0.7.1 and earlier) to upgrade in order to be compatible with 0.8.0 and higher. We expect future backwards-incompatible changes to be rare going forward.
See the Install Page to learn how to get the libraries for your platform. The complete changelog is also available.
We discuss some highlights from the release and other project news in this post.
A growing ecosystem of projects are using Arrow to solve in-memory analytics and data interchange problems. We have added a new Powered By page to the Arrow website where we can acknowledge open source projects and companies which are using Arrow. If you would like to add your project to the list as an Arrow user, please let us know.
Since the last release, we have added 5 new Apache committers:
Welcome to the Arrow team, and thank you for your contributions!
Siddharth Teotia led efforts to revamp the Java vector API to make things simpler and faster. As part of this, we removed the dichotomy between nullable and non-nullable vectors.
See Sidd's blog post for more about these changes.
Phillip Cloud led efforts this release to harden details about exact decimal values in the Arrow specification and ensure a consistent implementation across Java, C++, and Python.
Arrow now supports decimals represented internally as a 128-bit little-endian integer, with a set precision and scale (as defined in many SQL-based systems). As part of this work, we needed to change Java's internal representation from big- to little-endian.
We are now integration testing decimals between Java, C++, and Python, which will facilitate Arrow adoption in Apache Spark and other systems that use both Java and Python.
Decimal data can now be read and written by the Apache Parquet C++ library, including via pyarrow.
In the future, we may implement support for smaller-precision decimals represented by 32- or 64-bit integers.
In C++, we have continued developing the new arrow::compute
submodule consisting of native computation fuctions for Arrow data. New contributor Licht Takeuchi helped expand the supported types for type casting in compute::Cast
. We have also implemented new kernels Unique
and DictionaryEncode
for computing the distinct elements of an array and dictionary encoding (conversion to categorical), respectively.
We expect the C++ computation “kernel” library to be a major expansion area for the project over the next year and beyond. Here, we can also implement SIMD- and GPU-accelerated versions of basic in-memory analytics functionality.
As minor breaking API change in C++, we have made the RecordBatch
and Table
APIs “virtual” or abstract interfaces, to enable different implementations of a record batch or table which conform to the standard interface. This will help enable features like lazy IO or column loading.
There was significant work improving the C++ library generally and supporting work happening in Python and C. See the change log for full details.
Developing of the GLib-based C bindings has generally tracked work happening in the C++ library. These bindings are being used to develop data science tools for Ruby users and elsewhere.
The C bindings now support the Meson build system in addition to autotools, which enables them to be built on Windows.
The Arrow GPU extension library is now also supported in the C bindings.
Brian Hulette and Paul Taylor have been continuing to drive efforts on the TypeScript-based JavaScript implementation.
Since the last release, we made a first JavaScript-only Apache release, version 0.2.0, which is now available on NPM. We decided to make separate JavaScript releases to enable the JS library to release more frequently than the rest of the project.
In addition to some of the new features mentioned above, we have made a variety of usability and performance improvements for integrations with pandas, NumPy, Dask, and other Python projects which may make use of pyarrow, the Arrow Python library.
Some of these improvements include:
pyarrow.serialize
and pyarrow.deserialize
. This includes a special pyarrow.pandas_serialization_context
which further accelerates certain internal details of pandas serialization * Support zero-copy reads forpandas.DataFrame
using pyarrow.deserialize
for objects without Python objectspandas.DataFrame
to pyarrow.Table
(we already supported multithreaded conversions from Arrow back to pandas)pyarrow.compress
and pyarrow.decompress
The 0.8.0 release includes some API and format changes, but upcoming releases will focus on ompleting and stabilizing critical functionality to move the project closer to a 1.0.0 release.
With the ecosystem of projects using Arrow expanding rapidly, we will be working to improve and expand the libraries in support of downstream use cases.
We continue to look for more JavaScript, Julia, R, Rust, and other programming language developers to join the project and expand the available implementations and bindings to more languages.