The Apache Arrow team is pleased to announce the 14.0.0 release. This covers over 3 months of development work and includes 483 resolved issues from [116 distinct contributors][2]. See the Install Page to learn how to get the libraries for your platform.
The release notes below are not exhaustive and only expose selected highlights of the release. Many other bugfixes and improvements have been made: we refer you to the [complete changelog][3].
Since the 13.0.0 release, Metehan Yildirim and Oleks V. have been invited to be committers.
Thanks for your contributions and participation in the project!
Motivated by recent innovations in DuckDB and Meta's Velox engine, new “view” data types were added to the Arrow columnar format spec:
A VariableShapeTensorType
was added to the Arrow specification as a canonical extension type (GH-24868).
Integration testing has been added for the C Data Interface accross Arrow implementations, ensuring mutual compatibility. (GH-37537). The C++, C# and Go implementations are covered, with Arrow Java soon to come.
A new RPC method was added to allow polling for completion in long-running queries as an alternative to the blocking GetFlightInfo call (GH-36155). Also, app_metadata
was added to FlightInfo
and FlightEndpoint
(GH-37635).
In C++ and Python, an experimental asynchronous GetFlightInfo call was added to the client-side API (GH-36512). ServerCallContext
now exposes conveniences to send headers/trailers without having to use middleware (GH-36952). The implementation was fixed to not reject unknown field tags to enable interoperability with future versions of Flight that could add new fields (GH-36975). The CMake configuration was fixed to correctly require linking to Arrow Flight RPC when using Arrow Flight SQL (GH-37406).
In Go, the underlying generated Protobuf code is now exposed for easier low-level integrations with Flight (GH-36893).
In Java, the stateful “login” authentication APIs using the Handshake RPC are deprecated; it will not be removed, but it should not be used unless you specifically want the old behavior (GH-37722). Utilities were added to help implement basic Flight SQL services for unit testing (GH-37795).
Experimental APIs for exporting and importing non-CPU arrays using the C Device Data Interface have been added (GH-36488), together with an experimental API for device synchronization (GH-36103).
Initial compatibility with Emscripten without threading support has been added (GH-35176).
New compute functions:
cumulative_mean
function on numeric data (GH-36931);Improved compute functions:
divide
function now supports duration inputs (GH-36789);take
and filter
now support sparse unions in addition to dense unions (GH-36905);if_else
, coalesce
, choose
and case_when
now support duration inputs (GH-37028);mean
on integer inputs now uses a floating-point representation for its intermediate sum, avoiding integer overflow on large inputs (GH-34909);Support for writing encrypted Parquet datasets has been added (GH-29238).
Gandiva now supports linking dynamically to LLVM on non-Windows platforms (GH-37410). Previously, Gandiva would always link LLVM statically into libgandiva
.
RLE is used by default when encoding boolean values if v2 data pages are enabled (GH-36882).
Page indexes can now be encrypted as per the specification (GH-34950).
A bug in the DELTA_BINARY_PACKED encoder leading to suboptimal column sizes was fixed (GH-37939).
It is now possible to serialize and deserialize individual expressions using Substrait, not only full query plans (GH-33985).
A new CodecOptions
class allows customizing compression parameters per-codec (GH-35287).
The environment variable AWS_ENDPOINT_URL
is now respected when resolving S3 URIs (GH-36770).
Recursively listing S3 filesystem trees should now issue less requests, leading to improved performance (GH-34213).
Comparing a ChunkedArray
to itself now behaves correctly with NaN values (GH-37515).
The use of BMI2 instructions on x86 was incorrectly guarded. Those instructions could be executed on platforms without BMI2 support, leading to crashes (GH-37017).
The following features have been added to the C# implementation apart from other minor ones and some fixes.
go1.19
instead of go1.17
(GH-37636)TimestampType
is seconds (GH-35770)Concatenate
function if there is a panic that is recovered (GH-36850)MarshalJSON
on some timestamps (GH-36935)writer.sink.Close()
errors from writer.Close()
when writing a Parquet file (GH-36645)pqarrow
column chunk reader (GH-37845)String()
method to arrow.Table
(GH-35296)array.Null
type support handling for arrow/csv writing (GH-36623)GetOrInsert
function for memo table handling of dictionary builders (GH-36671)compute
package (GH-36936)ValueLen
function to string array (GH-37584)SetNull(i int)
to array builders (GH-37694)pqarrow.FileWriter
(GH-35775)MapOf
and ListOf
helper functions have been improved to provide clearer error messages and have better documentation (GH-36696)parquet:"-"
will be allowed to skip fields when converting a struct to a parquet schema (GH-36793)Java 21 is enabled and validated in CI (GH-37914).
The Gandiva module implemented a breaking change by moving Types.proto
into a subfolder (GH-37893).
DefaultVectorComparators
added support for LargeVarCharVector
, LargeVarBinaryVector
(GH-25659) and for BitVector
, DateDayVector
, DateMilliVector
Decimal256Vector
, DecimalVector
, DurationVector
, IntervalDayVector
, TimeMicroVector
, TimeMilliVector
, TimeNanoVector
, TimeSecVector
, TimeStampVector
(GH-37701).
A bug was fixed in VectorAppender
to prevent resizing the data buffer twice when appending variable-length vectors (GH-37829).
VarCharWriter
added support for writing from Text
and String
(GH-37706). VarBinaryWriter
added support for writing from byte[]
and ByteBuffer
(GH-37705).
The JDBC driver will now ignore username and password authentication if a token is provided (GH-37073).
A bug was fixed in the Java C-Data interface when importing a vector with an empty array (GH-37056).
A bug was fixed in the S3 file system implementation when closing the connection (GH-36069).
Arrow datasets now support Substrait ExtendedExpression
s as inputs to filter and project operations (GH-34252).
Compatibility notes:
pyarrow.compute.CumulativeSumOptions
has been deprecated, use pyarrow.compute.CumulativeOptions
instead (GH-36240)New features:
pyarrow.concat_tables
(GH-36845)Other improvements:
pre_buffer
is now set to True
for reading Parquet when using pyarrow.dataset
directly. This can give significant speed-up on filesystems like S3 and is now aligned to pyarrow.parquet.read_table
interface (GH-36765)pyarrow.MapScalar.as_py
can now be called with custom field name (GH-36809)FixedShapeTensorType
string representation now prints the type parameters (GH-35623)Relevant bug fixes:
pyarrow.Table.filter
(GH-37650)use_threads
keyword was added to the group_by
method on pyarrow.Table
which gets passed through to the pyarrow.acero.Declaration.to_table
call. Specifing use_threads=False
allows to get stable ordering of the output (GH-36709)pyarrow.TimestampScalar
when values are outside datetime range (GH-36323)from_dataframe
of the Dataframe Interchange Protocol (GH-37145)Further, the Python bindings benefit from improvements in the C++ library (e.g. new compute functions); see the C++ notes above for additional details.
The Arrow documentation is now built with an updated Pydata Sphinx Theme which includes light/dark theme, new colors from Accessible pygments themes, version switcher dropdown, search button, etc. (GH-36590, GH-32451)
This release of the R package features a substantial refactor of the package configuration, build, and installation. This change should be transparent to most users; however, package contributors can take advantage of a substantially simplified development setup: in most cases, package contributors should be able to use a pre-built nightly version of Arrow C++ in place of a local Arrow development setup. Special thanks to Jacob Wujciak-Jens for taking on this incredible refactor!
In addition to a number of bugfixes and improvements, this release includes several new features related to CSV input/output:
,
or other characters as a decimal point.write_csv_dataset()
to better document CSV-specific dataset writing options.schema
argument can be specified when reading a CSV dataset with partitions.For more on what’s in the 14.0.0 R package, see the [R changelog][4].
ArrowFlight::ClientOptions
(GH-37141)The Rust projects have moved to separate repositories outside the main Arrow monorepo. For notes on the latest release of the Rust implementation, see the latest [Arrow Rust changelog][5].
[2]: {{ site.baseurl }}/release/14.0.0.html#contributors [3]: {{ site.baseurl }}/release/14.0.0.html#changelog [4]: {{ site.baseurl }}/docs/r/news/ [5]: https://github.com/apache/arrow-rs/tags