The Apache Arrow team is pleased to announce the 5.0.0 release. This covers 3 months of development work and includes 684 commits from [99 distinct contributors][1] in 2 repositories. See the Install Page to learn how to get the libraries for your platform.
The release notes below are not exhaustive and only expose selected highlights of the release. Many other bugfixes and improvements have been made: we refer you to the complete changelogs for the [apache/arrow
][2] and [apache/arrow-rs
][3] repositories.
Since the 4.0.0 release, Daniël Heres, Kazuaki Ishizaki, Dominik Moritz, and Weston Pace have been invited as committers to Arrow, and Benjamin Kietzman and David Li have joined the Project Management Committee (PMC). Thank you for all of your contributions!
Official IANA Media types (MIME types) have been registered for Apache Arrow IPC protocol data, both [stream]({{ site.baseurl }}/docs/format/Columnar.html#ipc-streaming-format) and [file]({{ site.baseurl }}/docs/format/Columnar.html#ipc-file-format) variants:
We recommend .arrow
as the IPC file format file extension and .arrows
for the IPC streaming format file extension.
The Go implementation now supports custom metadata and middleware, and has been added to integration testing.
In Python, some operations can now be interrupted via Control-C.
MakeArrayFromScalar
now works for fixed-size binary types (ARROW-13321).
The following compute functions were added:
aggregations: index
scalar arithmetic and math functions: abs
, abs_checked
, acos
, acos_checked
, asin
, asin_checked
, atan
, atan2
, ceil
, cos
, cos_checked
, floor
, ln
, ln_checked
, log10
, log10_checked
, log1p
, log1p_checked
, log2
, log2_checked
, negate
, negate_checked
, sign
, sin
, sin_checked
, tan
, tan_checked
, trunc
scalar bitwise functions: bit_wise_and
, bit_wise_not
, bit_wise_or
, bit_wise_xor
, shift_left
, shift_left_checked
, shift_right
, shift_right_checked
scalar string functions: ascii_center
, ascii_lpad
, ascii_reverse
, ascii_rpad
, binary_join
, binary_join_element_wise
, binary_replace_slice
, count_substring
, count_substring_regex
, ends_with
, find_substring
, find_substring_regex
, match_like
, split_pattern_regex
, starts_with
, utf8_center
, utf8_lpad
, utf8_replace_slice
, utf8_rpad
, utf8_reverse
, utf8_slice_codepoints
scalar temporal functions: day
, day_of_week
, day_of_year
, iso_calendar
, iso_week
, iso_year
, hour
, microsecond
, millisecond
, minute
, month
, nanosecond
, quarter
, second
, subsecond
, year
other scalar functions: case_when
, coalesce
, if_else
, is_finite
, is_inf
, is_nan
, max_element_wise
, min_element_wise
, make_struct
vector functions: replace_with_mask
Duplicates are now allowed in SetLookupOptions::value_set
(ARROW-12554).
Decimal types are now supported by some basic arithmetic functions (ARROW-12074).
The take
function now supports dense unions (ARROW-13005).
It is now possible to cast between dictionary types with different index types (ARROW-11673).
Sorting is now implemented for boolean input (ARROW-12016).
The streaming CSV reader can now take some advantage of multiple threads (ARROW-11889).
The CSV reader tries to make its errors more informative by adding the row number when it is known, i.e. when parallel reading is disabled (ARROW-12675).
A new option ReaderOptions::skip_rows_after_names
allows skipping a number of rows after reading the column names (as opposed to ReaderOptions::skip_rows
).
Quoted strings can now be treated as always non-null (ARROW-10115).
The asynchronous scanner introduced in 4.0.0 has been improved with truly asynchronous readers implemented for CSV, Parquet, and IPC file formats and file-level parallelism added. This mode is controlled by a flag use_async
that can be passed into methods which scan a dataset. Setting this flag to True will have significant improvements on filesystems with high latency or parallel reads (e.g. S3).
A CountRows method has been added to count rows matching a predicate; where possible, this will use metadata in files instead of reading the data itself.
CSV datasets can now be written, and when reading a CSV dataset, explicit types can now be specified for a subset of columns while allowing the rest to still be inferred.
The I/O thread pool size can now be adjusted at runtime (ARROW-12760). The default size remains 8 threads.
Streams now can have auxiliary metadata, depending on the backend. This has been implemented for the S3 filesystems, where a couple metadata keys are supported such as Content-Type
and ACL
(ARROW-11161, ARROW-12719).
The HadoopFileSystem implementation now implements the FileSystem abstraction more faithfully (ARROW-12790).
The new LZ4_RAW compression scheme was implemented (PARQUET-1998). Unlike the legacy LZ4 compression scheme, it is defined unambiguously and should provide better portability once other Parquet implementations catch up.
flight.NewClientWithMiddleware
and flight.NewServerWithMiddleware
. Functions flight.NewFlightClient
, flight.NewFlightServer
, flight.CreateServerBearerTokenAuthInterceptors
have been deprecated in favor of using the new middleware. #10633AuthHandler
no longer overwrites outgoing metadata, correctly appending new metadata without overwriting existing metadata #10297flight.Reader#LatestAppMetadata()
and flight.Writer#WriteWithAppMetadata
functions #10142Highlighted improvements and fixes:
ExtensionTypeVector
base class.AbstractContainerVector
to be consistent with other vectors.API compatibility changes:
getObject(int)
. #9964use_async=True
option is provided to Dataset.scanner
, Dataset.to_table
, or Dataset.to_batches
methods. This should provide better performance in environments where I/O can be slow, such as with remote sources.pyarrow.csv.write_csv
pyarrow.compute
functions are available (see the C++ notes above for more details), and introspection of the functions was improved so that they look more like standard Python functions.ORCFile
objectsStructArray
now accepts a mask
like other arraysIn this release, we‘ve more than doubled the number of functions you can call on Arrow Datasets inside dplyr::filter()
, mutate()
, and arrange()
, including many more string, datetime, and math functions. You can also write Datasets to CSV files, in addition to Parquet and Feather. We’ve also deepened support for the Arrow C interface, which is used in the Python interface and allows integration with other projects, such as DuckDB.
For more on what’s in the 5.0.0 R package, see the [R changelog][4].
Apache Arrow Flight support is started. But ListFlights
is only supported for now. More features will be implemented in the next major release.
You need gobject-introspection gem 3.4.5 or later to implement your Apache Arrow Flight server. If you only use Apache Arrow Flight client, gobject-introspection gem 3.4.5 or later isn't required.
Here are highlighted improvements:
Compute functions accept raw Ruby objects such as true
, Integer
, Array
and String
:
add_function = Arrow::Function.find("add") # Not shortcut version augend = Arrow::Int8Array.new([1, 2, 3]) addend = Arrow::Int8Scalar.new(5) args = [ Arrow::ArrayDatum.new(augend), Arrow::ScalarDatum.new(addend), ] add_function.execute(args).value.to_a # => [6, 7, 8] # Shortcut version add_function.execute([[1, 2, 3], 5]).value.to_a # => [6, 7, 8]
Arrow::PrimaryArray
and Arrow::Buffer
can be used as MemoryView that is added in Ruby 3.0.
There are some backward incompatible changes:
Arrow::CountOptions
and Arrow::CountMode
are removed. Use Arrow::ScalarAggregateOptions
instead.There are some backward incompatible changes:
GArrowCountOptions
and GArrowCountMode
are removed. Use GArrowScalarAggregateOptions
instead.garrow_array_equal_range()
requires GArrowEqualOptions
.gadataset_
/GADATASET_
from gad_
/GAD_
.GADScanOptions
, GADScanTask
and GADInMemoryScanTask
are removed. Use gadataset_begin_scan()
or gadataset_to_table()
instead.GArrowCompareOptions
, GArrowCompareOperator
and garrow_*_array_compare()
are removed. Use equal
, not_equal
, less_than
, less_than_equal
, greater_than
and greater_than_equal
compute functions directly instead.The Rust projects have moved to separate repositories outside the main Arrow monorepo. For notes on the 5.0.0 release of the Rust implementation, see the [Arrow Rust changelog][3] and the [Apache Arrow Rust 5.0.0 Release blog post]({% post_url 2021-07-20-5.0.0-rs-release %}).
[1]: {{ site.baseurl }}/release/5.0.0.html#contributors [2]: {{ site.baseurl }}/release/5.0.0.html#changelog [3]: https://github.com/apache/arrow-rs/blob/5.0.0/CHANGELOG.md [4]: {{ site.baseurl }}/docs/r/news/