The Apache Arrow team is pleased to announce the 2.0.0 release. This covers over 3 months of development work and includes 511 resolved issues from [81 distinct contributors][2]. See the Install Page to learn how to get the libraries for your platform.
The release notes below are not exhaustive and only expose selected highlights of the release. Many other bugfixes and improvements have been made: we refer you to the [complete changelog][3].
Since the 1.0.0 release, [Jorge Leitão][5] has been added as a committer. Thank you for your contributions!
As this is the first major release since 1.0.0, we remind everyone that we have moved to a “split” versioning system where the Library version (which is now 2.0.0) will now evolve separate from the Format version (which is still 1.0.0). Major releases of the libraries may contain non-backward-compatible API changes, but they will not contain any incompatible format changes. See the [Versioning and Stability][6] page in the documentation for more.
The columnar format metadata has been updated to permit 256-bit decimal values in addition to 128-bit decimals. This change is backward and forward compatible.
For Arrow Flight, 2.0.0 mostly brings bugfixes. In Java, some memory leaks in FlightStream
and DoPut
have been addressed. In C++ and Python, a deadlock has been fixed in an edge case. Additionally, when supported by gRPC, TLS verification can be disabled.
Parquet reading now fully supports round trip of arbitrarily nested data, including extension types with a nested storage type. In the process, several bugs in writing nested data and FixedSizeList were fixed. If writing data with these type we recommend upgrading to this release and validating old data as there is potential data loss.
Datasets can now be written with partitions, including these features:
Other notable features in the release include
The .NET package has added a number of new features this release.
Full support for Struct
types.
Synchronous write APIs for ArrowStreamWriter
and ArrowFileWriter
. These are complimentary to the existing async write APIs, and can be used in situations where the async APIs can't be used.
The ability to use DateTime
instances with Date32Array
and Date64Array
.
The Java package has supported a number of new features. Users can validate vectors in a wider range of aspects, if they are willing to take more time. In dictionary encoding, dictionary indices can be expressed as unsigned integers. A framework for data compression has been setup for IPC.
The calculation for vector capacity has been simplified, so users should experience notable performance improvements for various ‘setSafe’ methods.
Bugs for JDBC adapters, sort algorithms, and ComplexCopier have been resolved to make them more usable.
Upgrades Arrow's build to use TypeScript 3.9, fixing generated .d.ts
typings.
Parquet reading now supports round trip of arbitrarily nested data. Several bug fixes for writing nested data and FixedSizeList. If writing data with these type we recommend validating old data (there is potential some data loss) and upgrade to 2.0.
Extension types with a nested storage type now round trip through Parquet.
The pyarrow.filesystem
submodule is deprecated in favor of new filesystem implementations in pyarrow.fs
.
The custom serialization functionality (pyarrow.serialize()
, pyarrow.deserialize()
, etc) is deprecated. Those functions provided a Python-specific (not cross-language) serialization format which were not compatible with the standardized Arrow (IPC) serialization format. For arbitrary objects, you can use the standard library pickle
functionality instead. For pyarrow objects, you can use the IPC serialization format through the pyarrow.ipc
module, as explained above.
The pyarrow.compute
module now has a complete coverage of the available C++ compute kernels in the python API. Several new kernels have been added.
The pyarrow.dataset
module was further improved. In addition to reading, it is now also possible to write partitioned datasets (with write_dataset()
).
The Arrow <-> Python conversion code was refactored, fixing several bugs and corner cases.
Conversion of an array of pyarrow.MapType
to Pandas has been added.
Conversion of timezone aware datetimes to and/from pyarrow arrays including pandas now round-trip preserving timezone. To use the old behavior (e.g. for spark) set the environment variable PYARROW_IGNORE_TIMEZONE to a truthy value (i.e. PYARROW_IGNORE_TIMEZONE=1
)
Highlights of the R release include
In addition, the R package benefits from the various improvements in the C++ library listed above, including the ability to read and write Parquet files with nested struct and list types.
For more on what’s in the 2.0.0 R package, see the [R changelog][4].
In Ruby binding, Arrow::Table#save
uses the number of rows as the chunk_size
parameter by default when the table is saved in a Parquet file.
The GLib binding newly supports GArrowStringDictionaryArrayBuilder
and GArrowBinaryDictionaryArrayBuilder
.
Moreover the GLib binding supports new accessors of GArrowListArray
and GArrowLargeListArray
. They are get_values
, get_value_offset
, get_value_length
, and get_value_offsets
.
Due to the high volume of activity in the Rust subproject in this release, we're writing a separate blog post dedicated to those changes.
[2]: {{ site.baseurl }}/release/2.0.0.html#contributors [3]: {{ site.baseurl }}/release/2.0.0.html [4]: {{ site.baseurl }}/docs/r/news/ [5]: https://github.com/jorgecarleitao [6]: http://arrow.apache.org/docs/format/Versioning.html