The Apache Arrow team is pleased to announce the 9.0.0 release. This covers over 3 months of development work and includes 509 resolved issues from [114 distinct contributors][2]. See the Install Page to learn how to get the libraries for your platform.
The release notes below are not exhaustive and only expose selected highlights of the release. Many other bug fixes and improvements have been made: we refer you to the [complete changelog][3].
Since the 8.0.0 release, Dewey Dunnington, Alenka Frim and Rok Mihevc have been invited to be committers. Thanks for your contributions and participation in the project!
Arrow Flight is now available in MacOS M1 Python wheels (ARROW-16779). Arrow Flight SQL is now buildable on Windows (ARROW-16902). Ruby now exposes more of the Flight and Flight SQL APIs (various JIRAs).
AlmaLinux 9 is now supported. (ARROW-16745)
AmazonLinux 2 aarch64 is now supported. (ARROW-16477)
STL-like iteration is now provided over chunked arrays (ARROW-602).
The C++ compute and execution engine is now officially named “Acero”, though its C++ namespaces have not changed.
New light-weight data holder abstractions have been introduced in order to reduce the overhead of invoking compute functions and kernels, especially at the small data sizes desirable for efficient parallelization (typically L1- or L2-sized). Specifically, the non-owning ArraySpan
and ExecSpan
structures have internally superseded the much heavier ExecBatch
, which is still supported for compatibility at the API level (ARROW-16756, ARROW-16824, ARROW-16852).
In a similar vein, the ValueDescr
class was removed and ScalarKernel
implementations now always receive at least one non-scalar input, removing the special case where a ScalarKernel
needs to output a scalar rather than an array. The higher-level compute APIs still allow executing a scalar function over all-scalar inputs; but those scalars are internally broadcasted to 1-element arrays so as to simplify kernel implementation (ARROW-16757).
Some performance improvements were made to the hash join node. These changes do not require additional configuration. The hash join exec node has been improved to more efficiently use CPU cache and make better use of available vectorization hardware (ARROW-14182).
Some plans containing a sequence of hash join operators will now use bloom filters to eliminate rows earlier in the plan, reducing the overall CPU cost of the plan (ARROW-15498).
Timestamp comparison is now supported (ARROW-16425).
A cumulative sum function is implemented over numeric inputs (ARROW-13530). Note that this is a vector function so cannot be used in an Acero ExecPlan.
A rank vector kernel has been added (ARROW-16234).
Temporal rounding functions received additional options to control how rounding is done (ARROW-14821).
Improper computation of the “mode” function on boolean input was fixed (ARROW-17096).
Function registries can now be nested (ARROW-16677).
The autogenerate_column_names
option for CSV reading is now handled correctly (ARROW-16436).
Fix InMemoryDataset::ReplaceSchema
to actually replace the schema (ARROW-16085).
Fix FilenamePartitioning
to properly support null values (ARROW-16302).
A number of bug fixes and improvements were made to the Google Cloud Storage filesystem implementation (ARROW-14892).
By default, the S3 filesystem implementation does not create or drop buckets anymore (ARROW-15906). This is a compatibility-breaking change intended to prevent user errors from having potentially catastrophic consequences. Options have been added to restore the previous behavior if necessary.
The default Parquet version is now 2.4 for writing, enabling use of more recent logical types by default (ARROW-12203).
Non-nullable fields are now handled correctly by the Parquet reader (ARROW-16116).
Reading encrypted files should now be thread-safe (ARROW-14114).
Statistics equality now works correctly with minmax (ARROW-16487).
The minimum Thrift version required for building is now 0.13 (ARROW-16721).
The Thrift deserialization limits can now be configured to accommodate for data files with very large metadata (ARROW-16546).
The Substrait spec has been updated to 0.6.0 (ARROW-16816). In addition, a larger subset of the Substrait specification is now supported (ARROW-15587, ARROW-15590, ARROW-15901, ARROW-16657, ARROW-15591).
tableFromIPC
input is an async RecordBatchReader
(ARROW-16704)Compatibility notes:
PyArrow now requires Python >= 3.7 (ARROW-16474).
The default behaviour regarding memory mapping has changed in several APIs (reading of Feather or Parquet files, IPC RecordBatchFileReader and RecordBatchStreamReader) to disable memory mapping by default (ARROW-16382).
The default Parquet version is now 2.4 for writing, enabling use of more recent logical types by default such as unsigned integers (ARROW-12203). One can specify version="2.6"
to also enable support for nanosecond timestamps. Use version="1.0"
to restore the old behaviour and maximizes file compatibility.
Some deprecated APIs (deprecated at least since pyarrow 1.0.0) have been removed: IPC methods in the top-level namespace, the Value
scalar classes and the pyarrow.compat
module (ARROW-17010).
New features:
Google Cloud Storage (GCS) File System support is now available in the Python bindings (ARROW-14892).
The Table.filter()
method now supports passing an expression in addition to a boolean array (ARROW-16469).
When implementing extension types in Python, it is now possible to also customize which Python scalar gets returned (in Array.to_pylist()
or Scalar.as_py()
) by subclassing ExtensionScalar
(ARROW-13612, (ARROW-17065)).
It is now possible to register User Defined Functions (UDF) for scalar functions using register_scalar_function
(ARROW-15639).
Basic support for consuming a Substrait plan has been exposed in Python as pyarrow.substrait.run_query
(ARROW-15779).
The cast
method and compute kernel now exposes the fine grained options in addition to safe/unsafe casting (ARROW-15365).
In addition, this release includes several bug fixes and documention improvements (such as expanded examples in docstrings (ARROW-16091)).
Further, the Python bindings benefit from improvements in the C++ library (e.g. new compute functions); see the C++ notes above for additional details.
Highlights include several new dplyr
verbs, including glimpse()
and union_all()
, as well as many more datetime functions from lubridate
. There is also experimental support for user-defined scalar functions in the query engine, and most packages include native support for datasets in Google Cloud Storage (opt-in in the Linux full source build).
For more on what’s in the 9.0.0 R package, see the [R changelog][4].
FlightSQL is now supported but there are minimum features for now.
More Flight features are now supported.
Enumerable
compatible methods such as #min
and #max
on Arrow::Array
, Arrow::ChunkedArray
and Arrow::Column
are implemented by C++'s [compute functions]({{ site.baseurl }}/docs/cpp/compute.html). This improves performance. (ARROW-15222)
This release fixed some memory leaks. (ARROW-14790)
This release improved support for interval type arrays such as Arrow::MonthIntervalArray
. (ARROW-16206)
This release improved auto data type conversion. (ARROW-16874)
Vala is now supported. (ARROW-15671). See c_glib/example/vala/
for examples.
GArrowQuantil eOptions
is added. (ARROW-16623)
The Rust projects have moved to separate repositories outside the main Arrow monorepo. For notes on the 19.0.0 release of the Rust implementation, see the [Arrow Rust changelog][5].
[2]: {{ site.baseurl }}/release/9.0.0.html#contributors [3]: {{ site.baseurl }}/release/9.0.0.html#changelog [4]: {{ site.baseurl }}/docs/r/news/ [5]: https://github.com/apache/arrow-rs/blob/19.0.0/CHANGELOG.md