The Apache Arrow team is pleased to announce the 7.0.0 release. This covers over 3 months of development work and includes 617 resolved issues from [105 distinct contributors][2]. See the Install Page to learn how to get the libraries for your platform.
The release notes below are not exhaustive and only expose selected highlights of the release. Many other bugfixes and improvements have been made: we refer you to the [complete changelog][3].
Since the 6.0.1 release, Rémi Dattai and Alessandro Molina have been invited to be committers. Daniël Heres and Yibo Cai have joined the Project Management Committee (PMC). Thanks for your contributions and participation in the project!
The Flight specification has been clarified to note that schemas are expected to be IPC-encapsulated on the wire.
Documentation has been generally improved; see the Arrow Cookbook for recipes on how to use Flight in Python and R, and a new example on how to use Flight and gRPC services on the same port.
This release includes Arrow Flight SQL, a protocol for using Arrow Flight to execute queries against and fetch metadata from SQL databases. Support is included for C++ and Java (but not languages that bind to C++, like Python or R). A more detailed blog post is forthcoming (EDIT 2022/02/16: see the [Flight SQL announcement]({% link _posts/2022-02-16-introducing-arrow-flight-sql.md %})). Note that development is ongoing and the specification is currently experimental.
A set of CMake presets has been added to ease building Arrow in a number of cases (ARROW-14678, ARROW-14714).
The arrow::BitUtil
namespace has been renamed to arrow::bit_util
(ARROW-13494).
Concatenation of union arrays is now supported (ARROW-4975).
StructType
gained three convenience methods to add, change and remove a given field (ARROW-11424).
The Datum
kind COLLECTION
has been removed as it was entirely unused in the codebase (ARROW-13598).
A number of compute functions have been added:
Decimal data is now supported as input of the arithmetic kernels (ARROW-13130).
Dictionary data is now supported as input of the hash join execution node (ARROW-14181).
Residual predicates have been implemented in the hash join node (ARROW-13643).
The “list_parent_indices” function now always returns int64 data regardless of the input type (ARROW-14592).
Month-day-nano interval data is now supported as input of the same functions as other interval types (ARROW-13989).
The CSV writer got additional configuration options:
[Skyhook]({% link _posts/2022-01-31-skyhook-bringing-computation-to-storage-with-apache-arrow.md %}), a dataset addition that offloads fragment scan operations to a Ceph distributed storage cluster, was contributed (ARROW-13607).
The dataset writer now exposes options min_rows_per_group
and max_rows_per_group
to control the size of row groups created (ARROW-14426).
A critical bug in the AWS SDK for C++ that risks losing data in S3 multipart uploads has been circumvented (ARROW-14523).
The Google Cloud Storage filesystem is now featureful enough to pass all generic filesystem tests (ARROW-14924).
The OpenAppendStream method of filesystems has been un-deprecated; however, it still cannot be implemented for all filesystem backends (ARROW-14969).
A new function arrow::fs::ResolveS3BucketRegion
allows resolving the region where a particular S3 bucket resides (ARROW-15165).
The S3 filesystem now sets the Content-Type of output files to “application/octet-stream” (instead of “application/xml” previously) if not explicitly specified by the caller (ARROW-15306).
Fine-grained I/O (coalescing) is now enabled in the synchronous (ARROW-12683) and asynchronous (ARROW-14577) IPC reader.
It is now possible to set the compression level when using LZ4 compression (ARROW-9648).
The ORC adapters have been significantly improved. A lot more properties of the ORC reader as well as ORC writer options are now available. Moreover API docs for both the ORC reader and the ORC writer have been generated. (ARROW-11297)
DELTA_BYTE_ARRAY-encoded data can now be read from (but not written to) bytearray columns in Parquet files (PARQUET-492).
MessageReader.Message
get properly surfaced by Reader.Read
ARROW-14769ipc.Reader
properly uses the allocator it is initialized with instead of making native byte slices ARROW-14717Release
and Retain
to maintain proper management of reference counting.ValueOffsets
function added to array.String
to return the entire slice of offsets ARROW-14645array.Interface
has been lifted to arrow.Array
, array.{Record,Column,Chunked,Table}
have been lifted to arrow.{Record,Column,Chunked,Table}
. Interface arrow.ArrayData
has been created to be used instead of array.Data
. Aliases have been provided for the array
package so existing code that doesn‘t directly use array.Data
shouldn’t be affected. The aliases will be removed in v8. ARROW-5599. The Chunked.NewSlice
method has been removed and is replaced by the array.NewChunkedSlice
function.json.Marshaller
interface. Builders support adding values to them by unmarshalling from JSON via the json.Unmarshaller
interface. array.FromJSON
function added to create Arrays from JSON directly. ARROW-9630compute
package in preparation for adding compute interfaces. Does not yet allow executing expressions. ARROW-14430file
module added, Go Parquet library now supports full file reading and writing. ARROW-13984 ARROW-13986. Does not yet provide direct Parquet <--> Arrow conversions.GeneralOutOfPlaceVectorSorter
is now available for sorting any kind of vector. In general if dedicated sorters can be used (like FixedWidthInPlaceVectorSorter
) they should preferred as they will generally perform better.log4j2
dependency was removed as it was unused and a possible vector for attacksVectorSchemaRootAppender
now works with BitVector
vectorFromArray
are automatically cached for better performance.random
and indices_nonzero
compute functions are now supported in Pythonpyarrow.orc.read_table
is now provided to easily read the content of ORC files to a Table.pyarrow.orc.ORCFile
now has a lot more properties exposed.pyarrow.orc.ORCWriter
and pyarrow.orc.write_table
now have the writer options available.pyarrow.orc
now has much better API documentation.Table
now has a group_by
method that allows to perform aggregations on table data. The compute functions documentation has also been improved to better distinguish between standard compute functions and HASH_AGGREGATE
compute functions that can only be using for aggregations.This release adds additional improvements to the dplyr
interface, to CSV support, and to the C-Data interface to exchange data with other languages. For more details, see the [complete R changelog][4].
There are two new contributors @okadakk and @simpl1g .
The updates of Red Arrow consists of the following improvements:
Arrow::Function#execute
now accepts an instance of an Arrow::Column
as its argument (ARROW-14551)Arrow::Table.load
now supports .arrows
files to load (ARROW-15356)Arrow::Table
by a URI
in Arrow::Table.load
(ARROW-14562)Arrow::Table
now supports to join two tables (ARROW-14531)Arrow::Function#execute
gets more easier to use than before (ARROW-15274)Arrow::SortKey#name
has been renamed to Arrow::SortKey#target
(ARROW-14784)Arrow.s3_initialize
method (ARROW-14637)The updates of Arrow GLib etc. consists of the following improvements:
garrow_execute_plan_build_hash_join_node
function, GArrowHashJoinNodeOption
, and GArrowJoinType
(ARROW-15288)garrow_function_get_options_type
function (ARROW-15273)garrow_function_get_default_options
function (ARROW-15267)GArrowRoundToMultipleOptions
to customize the round_to_multiple
function (ARROW-15216)garrow_function_all
function to list up all the functions (ARROW-15205)garrow_function_get_name
, garrow_function_equal
, and garrow_function_to_string
functions for convenienceGArrowRoundOptions
(ARROW-15204)garrow_struct_scalar_get_value
function for converting a C++ scalar value to a GLib value (ARROW-15203)GArrowMonthIntervalDataType
for the interval with the month componentGArrowDayTimeIntervalDataType
for the interval with the days and the milliseconds componentsGArrowMonthDayNanoIntervalDataType
for the interval with the months, the days, and the nanoseconds componentsGArrowSortKey::name
to ::target
(ARROW-14784)garrow_s3_initialize
function (ARROW-14637)garrow_decimal128_new_string
and garrow_decimal256_new_string
now returns errors when they gets a invalid decimal string (ARROW-14530)garrow_decimal128_data_type_new
and garrow_decimal256_data_type_new
functions now validates the given precision (ARROW-14529)Rust releases minor versions every 2 weeks in addition to a major version with the rest of the Arrow language implementations. Thus most enhancements have been incrementally released over the last 3 months as part of the 6.x.
Going forward, the Rust implementation version will start deviating from the rest of the Arrow implementations, incrementing a major version if the changes to the crate require it. We still plan a release every other week. Please see issue #1120 for more details
Major changes in the 7.0.0 release include:
Decimal
dyn Array
Union
type now follows the latest Arrow standardAnother highlight is that the community continues to improve the safety of the arrow crate. The 6.4.0 release included complete data validation and has resolved all outstanding RUSTSEC issues against the crate.
For additional details on the 7.0.0 Rust implementation, please see the [Arrow Rust CHANGELOG][5]
[2]: {{ site.baseurl }}/release/7.0.0.html#contributors [3]: {{ site.baseurl }}/release/7.0.0.html#changelog [4]: {{ site.baseurl }}/docs/r/news/ [5]: https://github.com/apache/arrow-rs/blob/7.0.0/CHANGELOG.md