The Apache Arrow team is pleased to announce the 6.0.0 release. This covers over 3 months of development work and includes 572 resolved issues from [77 distinct contributors][2]. See the Install Page to learn how to get the libraries for your platform.
The release notes below are not exhaustive and only expose selected highlights of the release. Many other bugfixes and improvements have been made: we refer you to the [complete changelog][3].
Since the 5.0.0 release, Nic Crane, QP Hou, Jiayu Liu, and Matt Topol have been invited to be committers, and Neville Dipale has joined the Project Management Committee (PMC). Thanks for your contributions and participation in the project!
A new calendar interval type consisting of Month, Day and Nanoseconds has been added to the specification. Reference implementations existing in Java, C++ and Python.
GLib and Ruby have added bindings for Arrow Flight.
While not part of the release, work is ongoing on Arrow Flight SQL, which defines a protocol for clients to communicate with SQL databases using Arrow Flight. For those interested in the project, please reach out on the mailing list.
The month-day-nano interval type has been added (ARROW-13628).
Various APIs, including extension types and scalars, are no longer experimental (ARROW-5244).
Support for Visual Studio 2015 was dropped (ARROW-14070).
A basic in-memory query engine has been implemented and is accessible from the R bindings. Operations including filter, project, sort, equality joins, and various aggregations are supported.
The following compute functions have been added:
approximate_median
, count_distinct
, max
, min
, product
hash_all
, hash_any
, hash_approximate_median
, hash_count_distinct
, hash_distinct
, hash_max
, hash_mean
, hash_min
, hash_product
, hash_stdev
, hash_variance
logb
, round
, round_to_multiple
ascii_capitalize
, ascii_swapcase
, ascii_title
, utf8_capitalize
, utf8_swapcase
, utf8_title
assume_timezone
, day_time_interval_between
, days_between
, hours_between
, microseconds_between
, milliseconds_between
, minutes_between
, month_day_nano_interval_between
, month_interval_between
, nanoseconds_between
, quarters_between
, seconds_between
, strftime
, us_week
, week
, weeks_between
, years_between
choose
, list_element
drop_null
, select_k_unstable
In general, type support has been improved for most of the compute functions, but work here is ongoing, particularly around decimal support.
Crashes have been fixed in particular cases for take
, filter
, unique
, and value_counts
(ARROW-13474, ARROW-13509, ARROW-14129).
Hash aggregations (i.e. group by) supports scalar and array values (ARROW-13737, ARROW-14027).
Temporal functions are now timezone-aware (e.g. when extracting the hour of a timestamp) (ARROW-12980).
count
can optionally count all values, not just null or non-null values (ARROW-13574).
fill_null
has been replaced by the more general coalesce
(ARROW-7179).
is_null
can optionally consider NaN as null (ARROW-12959).
Sorting has been optimized (ARROW-10898, ARROW-14165). Also, null values can now be sorted at either the beginning or the end (ARROW-12063).
The CSV reader can read time32 and time64 types, and will infer time32 values for columns in the format “hh:mm” and “hh:mm:ss” (ARROW-11243).
The decimal point can be customized when reading (ARROW-13421).
The streaming reader will not unintentionally infer null-typed columns when using the various skip options (ARROW-13441).
If a row has an incorrect number of columns, now the row can be skipped instead of raising an error (ARROW-12673).
The option quoted_strings_can_be_null
applies to all column types now, not just strings (ARROW-13580). When quoting is disabled entirely, the reader now takes advantage of this to improve performance (ARROW-14150).
A CSVWriter object is now exposed, allowing for incremental writing (ARROW-11828). Dates can now be written (ARROW-12540).
The dataset writer was refactored, and now supports more options, including a limit on the number of files open at once, compatibility with the async scanner, a limit on the number of rows written per file, and control over what to do when files already exist in the target directory (ARROW-13650). Additionally, the query engine can feed into the dataset writer as a sink (ARROW-13542).
The asynchronous scanner now properly respects backpressure (ARROW-13611, ARROW-14192), as does the writer (ARROW-14191).
ORC datasets are supported (ARROW-13572) with support for column projection pushdown (ARROW-13797).
The Parquet/IPC format readers now respect the batch_size scanner option (ARROW-14024). Also, the Parquet reader now properly implements readahead for better performance (ARROW-14026).
The retry strategy of S3FileSystem can be customized (ARROW-13508). When writing to an existing bucket as a user with limited permissions, Arrow will no longer emit a spurious “Access Denied” error (ARROW-13685).
On MacOS with NFS mounts, a “[errno 25] Inappropriate ioctl for device” error was fixed (ARROW-13983).
The basics of a Google Cloud Storage filesystem have been added; work is in progress for full support (ARROW-8147, ARROW-14222, ARROW-14223, ARROW-14232, ARROW-14236, ARROW-14345, ARROW-14157).
A crash was fixed when duplicate keys were present (ARROW-14109).
Written min/max and null_count statistics for dictionary types were corrected (ARROW-11634, ARROW-12513). null_count statistics for columns that contain repeated data where corrected.
file_offset for row groups was not being populated according to the specification, this issue has been corrected.
Column selection now works for repeated columns and structs of more then one level.
An error with large files when built with Thrift 0.14 was fixed (ARROW-13655).
The ParquetVersion enum was updated with more values to support finer-grained Parquet format version selection (ARROW-13794).
Writer performance was improved by avoiding repeated dynamic casts (ARROW-13965).
This release includes improved support for dictionary arrays, as well as integration testing with the other Arrow implementations for the primitive and decimal types
FromBigInt
#10796assert
build tag in CI from now on, including a bug when writing slices of String, Binary or FixedWidthType arrays via ipc.Writer #11270, #11276MakeArrayFromScalar
function #11252This release fixes builds with the latest TypeScript versions and ESM tree shaking.
Deprecation notice: in Arrow 7, we will remove the compute code from Arrow JS.
pyarrow.compute
functions are available (see the C++ notes above for more details), and introspection of the functions was improved so that they look more like standard Python functions.copy_files
is now available in PythonThis release adds grouped aggregation and joins in the dplyr
interface, on top of the new Arrow C++ query engine. There is also support for using duckdb
to query Arrow datasets. For more details, see the [complete R changelog][4].
The updates of Red Arrow etc. consists of the following improvements:
Arrow::RecordBatchReader
Arrow::Table#[]
and Arrow::RecordBatch#[]
Arrow::TableConcatenateOptions
and conversion from a Hash
for convenienceArrow::Expression
and conversion from Array
and Hash
for convenienceArrowFlight::Client#do_get
supportThe updates of Arrow GLib etc. consists of the following improvements:
garrow_record_batch_reader_new
garrow_record_batch_reader_read_all
garrow_union_scalar_get_type_code
type_code
supports in union scalar typesGArrowCountOptions
and let count functions support itGArrowSetLookupOptions
for options of is_in
and index_in
kernelsGArrowVarianceOptions
to specify the calculation options for variance and standard deviation kernelsGArrowFunctionDoc
GArrowTableConcatenateOptions
and let garrow_table_concatenate
support itgadataset_scanner_builder_set_filter
use_async
of GADatasetScannerBuilder
a property, and remove gadataset_scanner_builder_use_async
gaflight_client_do_get
and gaflight_server_do_get
GAFlightStreamReader
, GAFlightStreamChunk
, GAFlightRecordBatchReader
, GAFlightDataStream
, GAFlightRecordBatchStream
, and GAFlightServerCallContext
gaflight_client_list_flights
functiongparquet_arrow_file_reader_get_n_rows
functionRust continues to release minor versions every 2 weeks in addition to a major version with the rest of the Arrow language implementations. Thus most enhancements have been incrementally released over the last 3 months as part of the 5.x.
The DataFusion and Ballista sub projects have begun releasing at their own cadence which is expected to continue in the next few weeks.
Major changes in the 6.0.0 release include support for the MapArray
array type, improved lower level ArrayData
APIs to better communicate safety, and a faster (but unstable) sorting kernel.
For additional details on the 6.0.0 Rust implementation, please see the [Arrow Rust CHANGELOG][5]
[2]: {{ site.baseurl }}/release/6.0.0.html#contributors [3]: {{ site.baseurl }}/release/6.0.0.html [4]: {{ site.baseurl }}/docs/r/news/ [5]: https://github.com/apache/arrow-rs/blob/6.0.0/CHANGELOG.md