The Apache Arrow team is pleased to announce the 0.15.0 release. This covers about 3 months of development work and includes 687 resolved issues from 80 distinct contributors. See the Install Page to learn how to get the libraries for your platform. The complete changelog is also available.
About a third of issues closed (240) were classified as bug fixes, so this release brings many stability, memory use, and performance improvements over 0.14.x. We will discuss some of the language-specific improvements and new features below.
Since the 0.14.0 release, we've added four new committers:
In addition, Sebastien Binet and Micah Kornfield have joined the PMC.
Thank you for all your contributions!
The format gets new datatypes : LargeList(ARROW-4810), LargeBinary and LargeString (ARROW-750). LargeList is similar to List but with 64-bit offsets instead of 32-bit. The same relationship holds for LargeBinary and LargeString with respect to Binary and String.
Since the last major release, we have also made a significant overhaul of the columnar format documentation to be clearer and easier to follow for implementation creators.
The Arrow community has decided to make a 1.0.0 release of the project marking formal stability of the columnar format and binary protocol, including explicit forward and backward compatibility guarantees. You can read about these guarantees in the new documentation page about versioning.
Starting with 1.0.0, we will give the columnar format and libraries separate version numbers. This will allow the library versions to evolve without creating confusion or uncertainty about whether the Arrow columnar format remains stable or not.
Since 0.14.0 we have modified the IPC “encapsulated message” format to insert 4 bytes of additional data in the message preamble to ensure that the Flatbuffers metadata starts on an aligned offset. By default, IPC streams generated by 0.15.0 and later will not be readable by library versions 0.14.1 and prior. Implementations have offered options to write messages using the now “legacy” message format.
For users who cannot upgrade to version 0.15.0 in all parts of their system, such as Apache Spark users, we recommend one of the two routes:
ARROW_PRE_0_15_IPC_FORMAT=1
when using 0.15.0 and sending data to an old libraryWe do not anticipate making this kind of change again in the near future and would not have made such a non-forward-compatible change unless we deemed it very important.
A GetFlightSchema method is added to the Flight RPC protocol (ARROW-6094). As the name suggests, it returns the schema for a given Flight descriptor on the server. This is useful for cases where the Flight locations are not immediately available, depending on the server implementation.
Flight implementations for C++ and Java now implement half-closed semantics for DoPut (ARROW-6063). The client can close the writing end of the stream to signal that it has finished sending the Flight data, but still receive the batch-specific response and its associated metadata.
C++ now supports the LargeList, LargeBinary and LargeString datatypes.
The Status class gains the ability to carry additional subsystem-specific data with it, under the form of an opaque StatusDetail interface (ARROW-4036). This allows, for example, to store not only an exception message coming from Python but the actual Python exception object, such as to raise it again if the Status is propagated back to Python. It can also enable the consumer of Status to inspect the subsystem-specific error, such as a finer-grained Flight error code.
DataType and Schema equality are significantly faster (ARROW-6038).
The Column class is completely removed, as it did not have a strong enough motivation for existing between ChunkedArray and RecordBatch / Table (ARROW-5893).
The 0.15 release includes many improvements to the Apache Parquet C++ internals, resulting in greatly improved read and write performance. We described the work and published some benchmarks in a recent blog post.
The CSV reader is now more flexible in terms of how column names are chosen (ARROW-6231) and column selection (ARROW-5977).
Arrow now has the option to allocate memory using the mimalloc memory allocator. jemalloc is still preferred for best performance, but mimalloc is a reasonable alternative to the system allocator on Windows where jemalloc is not currently supported.
Also, we now expose explicit global functions to get a MemoryPool for each of the jemalloc allocator, mimalloc allocator and system allocator (ARROW-6292).
The vendored jemalloc version is bumped from 4.5.x to 5.2.x (ARROW-6549). Performance characteristics may differ on memory allocation-heavy workloads, though we did not notice any significant regression on our suite of micro-benchmarks (and a multi-threaded benchmark of reading a CSV file showed a 25% speedup).
A FileSystem implementation to access Amazon S3-compatible filesystems is now available. It depends on the AWS SDK for C++.
Significant improvements were made to the Arrow I/O stack.
There are three improvements of Tensor and SparseTensor in this release.
We have fixed some bugs causing incompatibilities between C# and other Arrow implementations.
The FileSystem API, implemented in C++, is now available in Python (ARROW-5494).
The API for extension types has been straightened and definition of custom extension types in Python is now more powerful (ARROW-5610).
Sparse tensors are now available in Python (ARROW-4453).
A potential crash when handling Python dates and datetimes was fixed (ARROW-6597).
Based on a mailing list discussion, we are looking for help with maintaining our Python wheels. Community members have found that the wheels take up a great deal of maintenance time, so if you or your organization depend on pip install pyarrow
working, we would appreciate your assistance.
Ruby and C GLib continues to follow the features in the C++ project. Ruby includes the following backward incompatible changes.
Ruby improves the performance of Arrow#values.
A number of core Arrow improvements were made to the Rust library.
Improvements related to Rust Parquet and DataFusion are detailed next.
A major development since the 0.14 release was the arrival of the arrow
R package on CRAN. We wrote about this in August on the Arrow blog. In addition to the package availability on CRAN, we also published package documentation on the Arrow website.
The 0.15 R package includes many of the enhancements in the C++ library release, such as the Parquet performance improvements and the FileSystem API. In addition, there are a number of upgrades that make it easier to read and write data, specify types and schema, and interact with Arrow tables and record batches in R.
For more on what's in the 0.15 R package, see the changelog.
There are a number of active discussions ongoing on the developer dev@arrow.apache.org mailing list. We look forward to hearing from the community there.