The Apache Arrow team is pleased to announce the 8.0.0 release. This covers over 3 months of development work and includes 586 resolved issues from [127 distinct contributors][2]. See the Install Page to learn how to get the libraries for your platform.
The release notes below are not exhaustive and only expose selected highlights of the release. Many other bugfixes and improvements have been made: we refer you to the [complete changelog][3].
Since the 7.0.0 release, Kun Liu, Raphael Taylor-Davies Xudong Wang, Yijie Shen and Liang-Chi Hsieh have been invited to be committers. Thanks for your contributions and participation in the project!
Flight SQL has been extended with a method to get type metadata (ARROW-15313) and column metadata in returned schemas (ARROW-15314, ARROW-16064) New documentation is available describing Flight and Flight SQL, along with several Cookbook recipes (ARROW-14698, ARROW-16065).
The C++ libraries now support UCX as a network transport (ARROW-15706), and the APIs have been refactored to allow other transports to be implemented (ARROW-15282). UCX support is experimental and still subject to change. Many of the APIs have been refactored to use the arrow::Result
type, and the original variants have been deprecated (ARROW-16032). Support for gRPC >= 1.43 has been added (ARROW-15551).
Arrow C++ can now optionally build with support for the experimental Substrait query representation format (ARROW-15238).
A number of compute kernels operating on temporal data have been added:
It is possible to enable a timezone database on Windows at runtime by calling the arrow::Initialize()
function (ARROW-13168).
New hash aggregations are available: “hash_one” to return one value from each group (ARROW-13993), and “hash_list” to return all values from each group (ARROW-15152). Null columns are now supported on the sum, mean and product hash aggregates (ARROW-15506). Also, it is now possible to execute hash “aggregations” with only key columns (ARROW-15609).
A new compute function “map_lookup” allows looking up a given key in a map array (ARROW-15089).
New compute functions “sqrt” and “sqrt_checked” allow extracting the square root of their input (ARROW-15614).
Casting between two struct types is now possible, assuming the destination field names all exist in the source struct type (ARROW-1888, ARROW-15643).
Optional OpenTelemetry tracing has been added to kernel functions and execution plan nodes (ARROW-15061).
The CMake build option ARROW_ENGINE
has been renamed to ARROW_SUBSTRAIT
, to better reflect its actual effect (ARROW-16158).
It is now possible to change the field delimiter when writing a CSV file (ARROW-15672).
The ORC dataset scanner now observes the batch size parameter (ARROW-14153).
The dataset layer now supports filename-based partitioning, where the data files are all laid out in the dataset's base directory, their names prefixed with the partition values separated by underscore characters (ARROW-14612).
Optional OpenTelemetry tracing has been added to the dataset scanner (ARROW-15067).
It is possible to instantiate a Google Cloud Storage (GCS) filesystem from a URI, making GCS implicitly usable in the datasets layer (ARROW-14893). Recognized URI schemes are gs
and gcs
.
FileSystem::DeleteDirContents
can now optionally succeed when the directory doesn't exist (ARROW-16159).
It is possible to override the number of IO threads using the environment variable ARROW_IO_THREADS
(ARROW-15941).
The IPC file reader and writer now allow accessing the custom metadata associated with record batches (ARROW-16131).
It is possible to enable lightweight memory checks on the standard memory pools using a dedicated environment variable ARROW_DEBUG_MEMORY_POOL
(ARROW-15550). These checks are not a replacement for sophisticated checkers such as Address Sanitizer or Valgrind, but might come up useful if those tools are not available.
Temporal data is now validated when doing full array validation (ARROW-10924). The validation catches values not matching the specification (for example, a time value being outside of the span of one day).
The GDB plugin now attempts to print the data of an array, in addition to its metadata (ARROW-15389). This only works for primitive datatypes.
Pretty-printing is now shorter and more customizable for nested datatypes (ARROW-14798).
With .NET Core 2.1 reaching end-of-life in August 2021, the Apache.Arrow library has been updated to target netcoreapp3.1
and higher. It still supports netstandard1.3
, so the library works on .NET Framework. But to get the best performance, using .NET Core 3.1, .NET 5, or later is recommended.
ArrowReader
is now returned, which makes easier to create VectorSchemaRoot
from it.tableFromJSON
and struct vectors in vectorFromArray
. ARROW-16210In general, the Python bindings benefit from improvements in the C++ library (e.g. new compute functions); see the C++ notes above for additional details. In addition:
join
operation to perform left
, right
, full
joins of inner
or outer
types. The result of the join operation will be a new table (ARROW-14293). See https://arrow.apache.org/docs/dev/python/compute.html#table-and-dataset-joins for examples.ParquetDataset
class have been deprecated and will issue a warning, in favor of functionality based on the pyarrow.dataset
functionality (ARROW-16119).py.field("a", "b")
(ARROW-11259).Schema
, ChunkedArray
, Tensor
, RecordBatch
, parquet
and Table
now include examples on how to use the methods and classes (ARROW-15367).zoneinfo
(Python 3.9+) and dateutil
timezones in conversion to Arrow data structures (ARROW-5248).This release includes:
lubridate
and base
date and time functions in Arrow dpylr queries,c()
, rbind()
and cbind()
.For more on what’s in the 8.0.0 R package, see the [R changelog][4].
#values
of MonthInterval
Type (ARROW-15749)#raw_records
of MonthInterval
type (ARROW-15750)#values
of DayTimeInterval
type (ARROW-15885)DayTimeIntervalArrayBuilder
to support to make DayTimeIntervalArray
by a Hash with :day
and :millisecond
keys (ARROW-15918)#raw_records
of DayTimeInterval
type (ARROW-15886)#values
of MonthDayNanoInterval
type (ARROW-15924)MonthDayNanoIntervalArrayBuilder
to support to make MonthDayNanoIntervalArray
by a Hash with :month
, :day
, and :nanosecond
keys#raw_records
of MonthDayNanoInterval
type (ARROW-15925)Parquet::BooleanStatistics
(ARROW-16251)gaflight_client_close
(ARROW-15487)GParquetFileMetadata
and gparquet_arrow_file_reader_get_metadata
(ARROW-16214)GArrowGIOInputStream
so that all the data is completely read (ARROW-15626)garrow_string_array_builder_append_string_len
and garrow_large_string_array_builder_append_string_len
(ARROW-15629)GParquetRowGroupMetadata
(ARROW-16245)GParquetColumnChunkMetadata
(ARROW-16250)GArrowGCSFileSystem
(ARROW-16247)GParquetStatistics
and its family (ARROW-16251)GParquetBooleanStatistics
GParquetInt32Statistics
GParquetInt64Statistics
GParquetFloatStatistics
GParquetDoubleStatistics
GParquetByteArrayStatistics
GParquetFixedLengthByteArrayStatistics
GArrowRoundMode
(ARROW-16296)The Rust projects have moved to separate repositories outside the main Arrow monorepo. For notes on the 13.0.0 release of the Rust implementation, see the [Arrow Rust changelog][5].
[2]: {{ site.baseurl }}/release/8.0.0.html#contributors [3]: {{ site.baseurl }}/release/8.0.0.html#changelog [4]: {{ site.baseurl }}/docs/r/news/ [5]: https://github.com/apache/arrow-rs/blob/13.0.0/CHANGELOG.md