The Apache Arrow team is pleased to announce the 13.0.0 release. This covers over 3 months of development work and includes 456 resolved issues from [108 distinct contributors][2]. See the Install Page to learn how to get the libraries for your platform.
The release notes below are not exhaustive and only expose selected highlights of the release. Many other bugfixes and improvements have been made: we refer you to the [complete changelog][3].
Since the 12.0.0 release, Marco Neumann, Gang Wu, Mehmet Ozan Kabak and Kevin Gurney have been invited to be committers. Matt Topol, Jie Wen, Ben Baumgold and Dewey Dunnington have joined the Project Management Committee (PMC).
Thanks for your contributions and participation in the project!
The run-end encoded layout has been added. This layout can allow data with long runs of duplicate values to be encoded and processed efficiently. Initial support has been added for C++ and Go.
An experimental new specification, the C Device Data Interface, has been accepted for inclusion GH-34971. It builds on the existing C Data Interface to provide a runtime-agnostic zero-copy sharing mechanism for Arrow data residing on non-CPU devices.
Reference implementations of the C Device Data Interface will progressively be added to the standard Arrow libraries after the 13.0.0 release.
Support for flagging ordered result sets to clients is now added. (GH-34852)
gRPC 1.30 is now the minimum supported version in C++/Python/R/etc. (GH-34679)
In C++, various methods now receive a full ServerCallContext
(GH-35442, GH-35377) and the context now exposes headers sent by the client (GH-35375).
CMake 3.16 or later is now required for building Arrow C++ GH-34921.
Optimizations are not disabled anymore when the RelWithDebInfo
build type is selected GH-35850. Furthermore, compiler flags can now properly be customized per-build type using ARROW_C_FLAGS_DEBUG
, ARROW_CXX_FLAGS_DEBUG
and related variables GH-35870.
Handling of unaligned buffers is input nodes can be configured programmatically or by setting the environment variable ACERO_ALIGNMENT_HANDLING
. The default behavior is to warn when an unaligned buffer is detected GH-35498.
Several new functions have been added:
Sorting now works on dictionary arrays, with a much better performance than the naive approach of sorting the decoded dictionary GH-29887. Sorting also works on struct arrays, and nested sort keys are supported using FieldRed
GH-33206.
The check_overflow
option has been removed from CumulativeSumOptions
as it was redundant with the availability of two different functions: “cumulative_sum” and “cumulative_sum_checked” GH-35789.
Run-end encoded filters are efficiently supported GH-35749.
Duration types are supported with the “is_in” and “index_in” functions GH-36047. They can be multiplied with all integer types GH-36128.
“is_in” and “index_in” now cast their inputs more flexibly: they first attempt to cast the value set to the input type, then in the other direction if the former fails GH-36203.
Multiple bugs have been fixed in “utf8_slice_codeunits” when the stop
option is omitted GH-36311.
A custom schema can now be passed when writing a dataset GH-35730. The custom schema can alter nullability or metadata information, but is not allowed to change the datatypes written.
The S3 filesystem now writes files in equal-sized chunks, for compatibility with Cloudflare's “R2” Storage GH-34363.
A long-standing issue where S3 support could crash at shutdown because of resources still being alive after S3 finalization has been fixed GH-36346. Now, attempts to use S3 resources (such as making filesystem calls) after S3 finalization should result in a clean error.
The GCS filesystem accepts a new option to set the project id GH-36227.
Nullability and metadata information for sub-fields of map types is now preserved when deserializing Arrow IPC GH-35297.
The Orc adapter now maps Arrow field metadata to Orc type attributes when writing, and vice-versa when reading GH-35304.
It is now possible to write additional metadata while a ParquetFileWriter
is open GH-34888.
Writing a page index can be enabled selectively per-column GH-34949. In addition, page header statistics are not written anymore if the page index is enabled for the given column GH-34375, as the information would be redundant and less efficiently accessed.
Parquet writer properties allow specifying the sorting columns GH-35331. The user is responsible for ensuring that the data written to the file actually complies with the given sorting.
CRC computation has been implemented for v2 data pages GH-35171. It was already implemented for v1 data pages.
Writing compliant nested types is now enabled by default GH-29781. This should not have any negative implication.
Attempting to load a subset of an Arrow extension type is now forbidden GH-20385. Previously, if an extension type's storage is nested (for example a “Point” extension type backed by a struct<x: float64, y: float64>
), it was possible to load selectively some of the columns of the storage type.
Support for various functions has been added: “stddev”, “variance”, “first”, “last” (GH-35247, GH-35506).
Deserializing sorts is now supported GH-32763. However, some features, such as clustered sort direction or custom sort functions, are not implemented.
FieldRef
sports additional methods to get a flattened version of nested fields GH-14946. Compared to their non-flattened counterparts, the methods GetFlattened
, GetAllFlattened
, GetOneFlattened
and GetOneOrNoneFlattened
combine a child‘s null bitmap with its ancestors’ null bitmaps such as to compute the field's overall logical validity bitmap.
In other words, given the struct array [null, {'x': null}, {'x': 5}]
, FieldRef("x")::Get
might return [0, null, 5]
while FieldRef("y")::GetFlattened
will always return [null, null, 5]
.
Scalar::hash()
has been fixed for sliced nested arrays GH-35360.
A new floating-point to decimal conversion algorithm exhibits much better precision GH-35576.
It is now possible to cast between scalars of different list-like types GH-36309.
The C Data Interface is now supported in the .NET Apache.Arrow library. The main entry points are CArrowArrayImporter.ImportArray
, CArrowArrayExporter.ExportArray
, CArrowArrayStreamImporter.ImportArrayStream
, and CArrowArrayStreamExporter.ExportArrayStream
in the Apache.Arrow.C
namespace. (GH-33856, GH-33857, GH-36120, and GH-35809).
ArrowBuffer.BitmapBuilder
adds Append(ReadOnlySpan<byte> source, int validBits)
and AppendRange(bool value, int length)
to improve performance of array concatenation (GH-32605)
AppendValueFromString
for extension types and properly reads empty values as null (GH-35188 and GH-35190)MapType.ValueField
and MapType.ValueType
are now deprecated in favor of MapType.Elem().(*StructType)
(GH-35909)array.ArraySliceEqual
in favor of array.SliceEqual
) (GH-36198)ValueStr
method on Timestamp arrays now includes the zone in the output (GH-36568)FixedSizeListBuilder.AppendNull
no longer requires manually appending nulls to the underlying list (GH-35482)Fields
method for Schema and StructType now returns a copy of the slice to ensure immutability (GH-35306 and GH-35866)array.ApproxEqual
for Maps now allows entries for a given element to be presented in any order (GH-35828)ListOf(types)
if null (GH-35684)The JNI bindings for Arrow Dataset now support execute Substrait plans via the Acero query engine. (GH-34223)
Arrow packages that depend on Netty (most notably, arrow-memory-netty
, but also Arrow Flight) now require either Netty < 4.1.94.Final or Netty >= 4.1.96.Final. In Netty versions 4.1.94.Final and 4.1.95.Final, there was a breaking change in an internal API that affected Arrow; this was reverted in 4.1.96.Final (GH-36209, GH-36928)
VectorSchemaRoot#slice
now always makes a copy, including when the slice covers all rows (previously it did not make a copy in this case). This is a potentially-breaking change if your application depended on the old behavior. (GH-35275)
Debug info for allocations is no longer automatically enabled when assertions are enabled (e.g. when running unit tests). Instead, support must be explicitly enabled. This is not quite a breaking change, but may be surprising if you are used to using this information while debugging tests. However, performance should be greatly improved while running tests. (GH-34338)
Support for the upcoming Java 21 was added, though we do not yet test this in CI (GH-5053). The JNI bindings for Arrow Dataset now expose JSON support (GH-36421). Dictionary replacement is now supported when writing the IPC stream format (GH-18547).
Compatibility notes:
New features:
keys_sorted
property of MapType is now exposed GH-35112Other improvements:
Table
and RecordBatch
classes has been consolidated ( GH-36129, GH-35415, GH-35390, GH-34979, GH-34868, GH-31868)FixedShapeTensorType
has been improved (__reduce__
GH-36038, picklability GH-35599)array
constructor GH-21761pc.map_lookup
/ MapLookupOptions
is improved GH-36045zero_copy_only
keyword can now also be accepted in ChunkedArray.to_numpy()
GH-34787Relevant bug fixes:
__array__
numpy conversion for Table and RecordBatch is now corrected so that np.asarray(pa.Table)
doesn’t return a transposed result GH-34886parquet.write_to_dataset
doesn't create empty files for non-observed dictionary (category) values anymore GH-23870pyarrow.dataset.Partitioning
with other type is now correctly handled GH-36659pa.FixedShapeTensorArray.to_numpy_ndarray
is not failing on sliced arrays GH-35573__from_arrow__
is now also implemented for Array.to_pandas
for pandas extension data types GH-36096New features:
open_dataset()
now works with ND-JSON files GH-35055schema()
on multiple Arrow objects now returns the object's schema GH-35543.by
/by
argument now supported in arrow implementation of dplyr verbs GH-35667Other improvements:
arrow_array()
can be used to create Arrow Arrays GH-36381scalar()
can be used to create Arrow Scalars GH-36265RecordBatchReader::ReadNext()
from DuckDB from the main R thread GH-36307set_io_thread_count()
with num_threads
< 2 GH-36304strptime()
in arrow will return a timezone-aware timestamp if %z
is part of the format string GH-35671group_by()
and across()
now matches dplyr GH-35473For more on what’s in the 13.0.0 R package, see the [R changelog][4].
CallExpression.new
GH-35915#select_columns
GH-35681Expression.try_convert
GH-35915Parquet::ArrowFileReader#each_row_group
GH-36008GAFlightServerCallContext
object in several methods of GAFlightServerCustomAuthHandler
GH-35377GArrowRunEndEncodedDataType
GH-35417GArrowRunEndEncodedArray
GH-35418The Rust projects have moved to separate repositories outside the main Arrow monorepo. For notes on the latest release of the Rust implementation, see the latest [Arrow Rust changelog][5].
[2]: {{ site.baseurl }}/release/13.0.0.html#contributors [3]: {{ site.baseurl }}/release/13.0.0.html#changelog [4]: {{ site.baseurl }}/docs/r/news/ [5]: https://github.com/apache/arrow-rs/tags