The Apache Arrow team is pleased to announce the 4.0.0 release. This covers 3 months of development work and includes 711 resolved issues from [114 distinct contributors][2]. See the Install Page to learn how to get the libraries for your platform.
The release notes below are not exhaustive and only expose selected highlights of the release. Many other bugfixes and improvements have been made: we refer you to the [complete changelog][3].
Since the 3.0.0 release, Yibo Cai, Ian Cook, and Jonathan Keane have been invited as committers to Arrow, and Andrew Lamb and Jorge Leitão have joined the Project Management Committee (PMC). Thank you for all of your contributions!
In Java, applications can now enable zero-copy optimizations when writing data (ARROW-11066). This potentially breaks source compatibility, so it is not enabled by default.
Arrow Flight is now packaged for C#/.NET.
There are Linux packages for C++ and C GLib. They were provided by Bintray but [Bintray is no longer available as of 2021-05-01][5]. They are provided by Artifactory now. Users needs to change the install instructions because the URL has changed. See [the install page][6] for new instructions. Here is a summary of needed changes.
For Debian GNU Linux and Ubuntu users:
apache-arrow-archive-keyring
install instruction:apache-arrow-apt-source
.https://apache.jfrog.io/artifactory/arrow/...
from https://apache.bintray.com/arrow/...
.For CentOS and Red Hat Enterprise Linux users:
apache-arrow-release
install instruction:https://apache.jfrog.io/artifactory/arrow/...
from https://apache.bintray.com/arrow/...
.The Arrow C++ library now includes a vcpkg.json
manifest file and a new CMake option -DARROW_DEPENDENCY_SOURCE=VCPKG
to simplify installation of dependencies using the vcpkg package manager. This provides an alternative means of installing C++ library dependencies on Linux, macOS, and Windows. See the [Building Arrow C++]({{ site.baseurl }}/docs/developers/cpp/building.html) and [Developing on Windows]({{ site.baseurl }}/docs/developers/cpp/windows.html) docs pages for details.
The default memory allocator on macOS has been changed from jemalloc to mimalloc, yielding performance benefits on a range of macro-benchmarks (ARROW-12316).
Non-monotonic dense union offsets are now disallowed as per the Arrow format specification, and return an error in Array::ValidateFull
(ARROW-10580).
Automatic implicit casting in compute kernels (ARROW-8919). For example, for the addition of two arrays, the arrays are first cast to their common numeric type instead of erroring when the types are not equal.
Compute functions quantile
(ARROW-10831) and power
(ARROW-11070) have been added for numeric data.
Compute functions for string processing have been added for:
extract_regex
, ARROW-10195).utf8_length
, ARROW-11693).match_substring_regex
, ARROW-12134).replace_substring
and replace_substring_regex
, ARROW-10306).It is now possible to sort decimal and fixed-width binary data (ARROW-11662).
The precision of the sum
kernel was improved (ARROW-11758).
A CSV writer has been added (ARROW-2229).
The CSV reader can now infer timestamp columns with fractional seconds (ARROW-12031).
Arrow Datasets received various performance improvements and new features. Some highlights:
Fixed some rare instances of GZip files could not be read properly (ARROW-12169).
Support for setting S3 proxy parameters has been added (ARROW-8900).
The HDFS filesystem is now able to write more than 2GB of data at a time (ARROW-11391).
The IPC reader now supports reading data with dictionaries shared between different schema fields (ARROW-11838).
The IPC reader now supports optional endian conversion when receiving IPC data represented with a different endianness. It is therefore possible to exchange Arrow data between systems with different endiannesses (ARROW-8797).
The IPC file writer now optionally unifies dictionaries when writing a file in a single shot, instead of erroring out if unequal dictionaries are encountered (ARROW-10406).
An interoperability issue with the C# implementation was fixed (ARROW-12100).
A possible crash when reading a line-separated JSON file has been fixed (ARROW-12065).
The Arrow C++ library now includes an ORC file writer. Hence it is possible to both read and write ORC files from/to Arrow data.
The Parquet C++ library version is now synced with the Arrow version (ARROW-7830).
Parquet DECIMAL statistics were previously calculated incorrectly, this has now been fixed (PARQUET-1655).
Initial support for high-level Parquet encryption APIs similar to those in parquet-mr is available (ARROW-9318).
Arrow Flight is now packaged for C#/.NET.
The go implementation now supports IPC buffer compression
Java now supports IPC buffer compression (ZSTD is recommended as the current performance of LZ4 is quite slow).
Creating a dataset with pyarrow.dataset.write_dataset
is now possible from a Python iterator of record batches (ARROW-10882). The Dataset interface can now use custom projections using expressions when scanning (ARROW-11750). The expressions gained basic support for arithmetic operations (e.g. ds.field('a') / ds.field('b')
) (ARROW-12058). See the [Dataset docs][7] for more details.
See the C++ notes above for additional details.
The dplyr
interface to Arrow data gained many new features in this release, including support for mutate()
, relocate()
, and more. You can also call in filter()
or mutate()
over 100 functions supported by the Arrow C++ library, and many string functions are available both by their base R (grepl()
, gsub()
, etc.) and stringr
(str_detect()
, str_replace()
) spellings.
Datasets can now read compressed CSVs automatically, and you can also open a dataset that is based on a single file, enabling you to use write_dataset()
to partition a very large file without having to read the whole file into memory.
For more on what’s in the 4.0.0 R package, see the [R changelog][4].
In Arrow GLib version 4.0.0, the following changes are introduced in addition to the changes by Arrow C++.
GGandivaFilter
, GGandivaCondition
, and GGandivaSelectableProjector
input
property is added in GArrowCSVReader
and GArrowJSONReader
configure
script, support is droppedGADScanContext
is removed, and use_threads
property is moved to GADScanOptions
garrow_chunked_array_combine
function is addedgarrow_array_concatenate
function is addedGADFragment
and its subclass GADInMemoryFragment
are addedGADScanTask
now holds the corresponding GADFragment
gad_scan_options_replace_schema
function is removedDecimal128DataType
is changed to decimal128
In Red Arrow version 4.0.0, the following changes are introduced in addition to the changes by Arrow GLib.
ArrowDataset::ScanContext
is removed, and use_threads
attribute is moved to ArrowDataset::ScanOptions
Arrow::Array#concatenate
is added; it can concatenate not only an Arrow::Array
but also a normal Array
Arrow::SortKey
and Arrow::SortOptions
are added for accepting Ruby objects as sort key and optionsArrowDataset::InMemoryFragment
is addedThis release of Arrow continues to add new features and performance improvements. Much of our time this release was spent hammering out the necessary details so we can release the Rust versions to cargo at a more regular rate. In addition, we welcomed the Ballista distributed compute project officially to the fold.
limit
in sort kernelnew_null_array
for creating Arrays full of nulls.New Features
SQL Support
Extensibility API
Physical Plans
Hash Repartitioning
SQL Metrics
Additional Postgres compatible function library:
Proper identifier case identification (e.g. “Foo” vs Foo vs foo)
Upgraded to Tokio 1.x
Performance Improvements:
API Changes
Vec<Expr>
rather than &[Expr]
Arc<TableProvider>
rather than Box<TableProvider>
Ballista was donated shortly before the Arrow 4.0.0 release and there is no new release of Ballista as part of Arrow 4.0.0
[2]: {{ site.baseurl }}/release/4.0.0.html#contributors [3]: {{ site.baseurl }}/release/4.0.0.html [4]: {{ site.baseurl }}/docs/r/news/ [5]: https://jfrog.com/blog/into-the-sunset-bintray-jcenter-gocenter-and-chartcenter/ [6]: {{ site.baseurl }}/install/ [7]: https://arrow.apache.org/docs/python/dataset.html#projecting-columns