blob: 19a6f4f72fd71b11fdc59287033a426581b6000f [file] [log] [blame]
<!DOCTYPE html>
<html lang="en-US">
<head>
<meta charset="UTF-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<!-- The above meta tags *must* come first in the head; any other head content must come *after* these tags -->
<title>Apache Arrow 4.0.0 Release | Apache Arrow</title>
<!-- Begin Jekyll SEO tag v2.8.0 -->
<meta name="generator" content="Jekyll v4.4.1" />
<meta property="og:title" content="Apache Arrow 4.0.0 Release" />
<meta name="author" content="pmc" />
<meta property="og:locale" content="en_US" />
<meta name="description" content="The Apache Arrow team is pleased to announce the 4.0.0 release. This covers 3 months of development work and includes 711 resolved issues from 114 distinct contributors. See the Install Page to learn how to get the libraries for your platform. The release notes below are not exhaustive and only expose selected highlights of the release. Many other bugfixes and improvements have been made: we refer you to the complete changelog. Community Since the 3.0.0 release, Yibo Cai, Ian Cook, and Jonathan Keane have been invited as committers to Arrow, and Andrew Lamb and Jorge Leitão have joined the Project Management Committee (PMC). Thank you for all of your contributions! Arrow Flight RPC notes In Java, applications can now enable zero-copy optimizations when writing data (ARROW-11066). This potentially breaks source compatibility, so it is not enabled by default. Arrow Flight is now packaged for C#/.NET. Linux packages notes There are Linux packages for C++ and C GLib. They were provided by Bintray but Bintray is no longer available as of 2021-05-01. They are provided by Artifactory now. Users needs to change the install instructions because the URL has changed. See the install page for new instructions. Here is a summary of needed changes. For Debian GNU Linux and Ubuntu users: Users need to change the apache-arrow-archive-keyring install instruction: Package name is changed to apache-arrow-apt-source. Download URL is changed to https://apache.jfrog.io/artifactory/arrow/... from https://apache.bintray.com/arrow/.... For CentOS and Red Hat Enterprise Linux users: Users need to change the apache-arrow-release install instruction: Download URL is changed to https://apache.jfrog.io/artifactory/arrow/... from https://apache.bintray.com/arrow/.... C++ notes The Arrow C++ library now includes a vcpkg.json manifest file and a new CMake option -DARROW_DEPENDENCY_SOURCE=VCPKG to simplify installation of dependencies using the vcpkg package manager. This provides an alternative means of installing C++ library dependencies on Linux, macOS, and Windows. See the Building Arrow C++ and Developing on Windows docs pages for details. The default memory allocator on macOS has been changed from jemalloc to mimalloc, yielding performance benefits on a range of macro-benchmarks (ARROW-12316). Non-monotonic dense union offsets are now disallowed as per the Arrow format specification, and return an error in Array::ValidateFull (ARROW-10580). Compute layer Automatic implicit casting in compute kernels (ARROW-8919). For example, for the addition of two arrays, the arrays are first cast to their common numeric type instead of erroring when the types are not equal. Compute functions quantile (ARROW-10831) and power (ARROW-11070) have been added for numeric data. Compute functions for string processing have been added for: Trimming characters (ARROW-9128). Extracting substrings captured by a regex pattern (extract_regex, ARROW-10195). Computing UTF8 string lengths (utf8_length, ARROW-11693). Matching strings against regex pattern (match_substring_regex, ARROW-12134). Replacing non-overlapping substrings that match a literal pattern or regular expression (replace_substring and replace_substring_regex, ARROW-10306). It is now possible to sort decimal and fixed-width binary data (ARROW-11662). The precision of the sum kernel was improved (ARROW-11758). CSV A CSV writer has been added (ARROW-2229). The CSV reader can now infer timestamp columns with fractional seconds (ARROW-12031). Dataset Arrow Datasets received various performance improvements and new features. Some highlights: New columns can be projected from arbitrary expressions at scan time (ARROW-11174) Read performance was improved for Parquet on high-latency filesystems (ARROW-11601) and in general when there are thousands of files or more (ARROW-8658) Null partition keys can be written (ARROW-10438) Compressed CSV files can be read (ARROW-10372) Filesystems support async operations (ARROW-10846) Usage and API documentation were added (ARROW-11677) Files and filesystems Fixed some rare instances of GZip files could not be read properly (ARROW-12169). Support for setting S3 proxy parameters has been added (ARROW-8900). The HDFS filesystem is now able to write more than 2GB of data at a time (ARROW-11391). IPC The IPC reader now supports reading data with dictionaries shared between different schema fields (ARROW-11838). The IPC reader now supports optional endian conversion when receiving IPC data represented with a different endianness. It is therefore possible to exchange Arrow data between systems with different endiannesses (ARROW-8797). The IPC file writer now optionally unifies dictionaries when writing a file in a single shot, instead of erroring out if unequal dictionaries are encountered (ARROW-10406). An interoperability issue with the C# implementation was fixed (ARROW-12100). JSON A possible crash when reading a line-separated JSON file has been fixed (ARROW-12065). ORC The Arrow C++ library now includes an ORC file writer. Hence it is possible to both read and write ORC files from/to Arrow data. Parquet The Parquet C++ library version is now synced with the Arrow version (ARROW-7830). Parquet DECIMAL statistics were previously calculated incorrectly, this has now been fixed (PARQUET-1655). Initial support for high-level Parquet encryption APIs similar to those in parquet-mr is available (ARROW-9318). C# notes Arrow Flight is now packaged for C#/.NET. Go notes The go implementation now supports IPC buffer compression Java notes Java now supports IPC buffer compression (ZSTD is recommended as the current performance of LZ4 is quite slow). JavaScript notes The Arrow JS module is now tree-shakeable. Iterating over Tables or Vectors is ~2X faster. Demo The default bundles use modern JS. Python notes Limited support for writing out CSV files (only types that have cast implementation to String) is now available. Writing parquet list types now has the option of enabling the canonical group naming according to the Parquet specification. The ORC Writer is now available. Creating a dataset with pyarrow.dataset.write_dataset is now possible from a Python iterator of record batches (ARROW-10882). The Dataset interface can now use custom projections using expressions when scanning (ARROW-11750). The expressions gained basic support for arithmetic operations (e.g. ds.field(&#39;a&#39;) / ds.field(&#39;b&#39;)) (ARROW-12058). See the Dataset docs for more details. See the C++ notes above for additional details. R notes The dplyr interface to Arrow data gained many new features in this release, including support for mutate(), relocate(), and more. You can also call in filter() or mutate() over 100 functions supported by the Arrow C++ library, and many string functions are available both by their base R (grepl(), gsub(), etc.) and stringr (str_detect(), str_replace()) spellings. Datasets can now read compressed CSVs automatically, and you can also open a dataset that is based on a single file, enabling you to use write_dataset() to partition a very large file without having to read the whole file into memory. For more on what’s in the 4.0.0 R package, see the R changelog. C GLib and Ruby notes C GLib In Arrow GLib version 4.0.0, the following changes are introduced in addition to the changes by Arrow C++. gandiva-glib supports filtering by using the newly introduced GGandivaFilter, GGandivaCondition, and GGandivaSelectableProjector The input property is added in GArrowCSVReader and GArrowJSONReader GNU Autotools, namely configure script, support is dropped GADScanContext is removed, and use_threads property is moved to GADScanOptions garrow_chunked_array_combine function is added garrow_array_concatenate function is added GADFragment and its subclass GADInMemoryFragment are added GADScanTask now holds the corresponding GADFragment gad_scan_options_replace_schema function is removed The name of Decimal128DataType is changed to decimal128 Ruby In Red Arrow version 4.0.0, the following changes are introduced in addition to the changes by Arrow GLib. ArrowDataset::ScanContext is removed, and use_threads attribute is moved to ArrowDataset::ScanOptions Arrow::Array#concatenate is added; it can concatenate not only an Arrow::Array but also a normal Array Arrow::SortKey and Arrow::SortOptions are added for accepting Ruby objects as sort key and options ArrowDataset::InMemoryFragment is added Rust notes This release of Arrow continues to add new features and performance improvements. Much of our time this release was spent hammering out the necessary details so we can release the Rust versions to cargo at a more regular rate. In addition, we welcomed the Ballista distributed compute project officially to the fold. Arrow Improved LargeUtf8 support Improved null handling in AND/OR kernels Added JSON writer support (ARROW-11310) JSON reader improvements LargeUTF8 Improved schema inference for nested list and struct types Various performance improvements IPC writer no longer calls finish() implicitly on drop Compute kernels Support for optional limit in sort kernel Divide by a single scalar Support for casting to timestamps Cast: Improved support between casting List, LargeList, Int32, Int64, Date64 Kernel to combine two arrays based on boolean mask Pow kernel new_null_array for creating Arrays full of nulls. Parquet Added support for filtering row groups (used by DataFusion to implement filter push-down) Added support for Parquet v 2.0 logical types DataFusion New Features SQL Support CTEs UNION HAVING EXTRACT SHOW TABLES SHOW COLUMNS INTERVAL SQL Information schema Support GROUP BY for more data types, including dictionary columns, boolean, Date32 Extensibility API Catalogs and schemas support Table deregistration Better support for multiple optimizers User defined functions can now provide specialized implementations for scalar values Physical Plans Hash Repartitioning SQL Metrics Additional Postgres compatible function library: Length functions Pad/trim functions Concat functions Ascii/Unicode functions Regex Proper identifier case identification (e.g. “Foo” vs Foo vs foo) Upgraded to Tokio 1.x Performance Improvements: LIMIT pushdown Constant folding Partitioned hash join Create hashes vectorized in hash join Improve parallelism using repartitioning pass Improved hash aggregate performance with large number of grouping values Predicate pushdown support for table scans Predicate push-down to parquet enables DataFusion to quickly eliminate entire parquet row-groups based on query filter expressions and parquet row group min/max statistics API Changes DataFrame methods now take Vec&lt;Expr&gt; rather than &amp;[Expr] TableProvider now consistently uses Arc&lt;TableProvider&gt; rather than Box&lt;TableProvider&gt; Ballista Ballista was donated shortly before the Arrow 4.0.0 release and there is no new release of Ballista as part of Arrow 4.0.0" />
<meta property="og:description" content="The Apache Arrow team is pleased to announce the 4.0.0 release. This covers 3 months of development work and includes 711 resolved issues from 114 distinct contributors. See the Install Page to learn how to get the libraries for your platform. The release notes below are not exhaustive and only expose selected highlights of the release. Many other bugfixes and improvements have been made: we refer you to the complete changelog. Community Since the 3.0.0 release, Yibo Cai, Ian Cook, and Jonathan Keane have been invited as committers to Arrow, and Andrew Lamb and Jorge Leitão have joined the Project Management Committee (PMC). Thank you for all of your contributions! Arrow Flight RPC notes In Java, applications can now enable zero-copy optimizations when writing data (ARROW-11066). This potentially breaks source compatibility, so it is not enabled by default. Arrow Flight is now packaged for C#/.NET. Linux packages notes There are Linux packages for C++ and C GLib. They were provided by Bintray but Bintray is no longer available as of 2021-05-01. They are provided by Artifactory now. Users needs to change the install instructions because the URL has changed. See the install page for new instructions. Here is a summary of needed changes. For Debian GNU Linux and Ubuntu users: Users need to change the apache-arrow-archive-keyring install instruction: Package name is changed to apache-arrow-apt-source. Download URL is changed to https://apache.jfrog.io/artifactory/arrow/... from https://apache.bintray.com/arrow/.... For CentOS and Red Hat Enterprise Linux users: Users need to change the apache-arrow-release install instruction: Download URL is changed to https://apache.jfrog.io/artifactory/arrow/... from https://apache.bintray.com/arrow/.... C++ notes The Arrow C++ library now includes a vcpkg.json manifest file and a new CMake option -DARROW_DEPENDENCY_SOURCE=VCPKG to simplify installation of dependencies using the vcpkg package manager. This provides an alternative means of installing C++ library dependencies on Linux, macOS, and Windows. See the Building Arrow C++ and Developing on Windows docs pages for details. The default memory allocator on macOS has been changed from jemalloc to mimalloc, yielding performance benefits on a range of macro-benchmarks (ARROW-12316). Non-monotonic dense union offsets are now disallowed as per the Arrow format specification, and return an error in Array::ValidateFull (ARROW-10580). Compute layer Automatic implicit casting in compute kernels (ARROW-8919). For example, for the addition of two arrays, the arrays are first cast to their common numeric type instead of erroring when the types are not equal. Compute functions quantile (ARROW-10831) and power (ARROW-11070) have been added for numeric data. Compute functions for string processing have been added for: Trimming characters (ARROW-9128). Extracting substrings captured by a regex pattern (extract_regex, ARROW-10195). Computing UTF8 string lengths (utf8_length, ARROW-11693). Matching strings against regex pattern (match_substring_regex, ARROW-12134). Replacing non-overlapping substrings that match a literal pattern or regular expression (replace_substring and replace_substring_regex, ARROW-10306). It is now possible to sort decimal and fixed-width binary data (ARROW-11662). The precision of the sum kernel was improved (ARROW-11758). CSV A CSV writer has been added (ARROW-2229). The CSV reader can now infer timestamp columns with fractional seconds (ARROW-12031). Dataset Arrow Datasets received various performance improvements and new features. Some highlights: New columns can be projected from arbitrary expressions at scan time (ARROW-11174) Read performance was improved for Parquet on high-latency filesystems (ARROW-11601) and in general when there are thousands of files or more (ARROW-8658) Null partition keys can be written (ARROW-10438) Compressed CSV files can be read (ARROW-10372) Filesystems support async operations (ARROW-10846) Usage and API documentation were added (ARROW-11677) Files and filesystems Fixed some rare instances of GZip files could not be read properly (ARROW-12169). Support for setting S3 proxy parameters has been added (ARROW-8900). The HDFS filesystem is now able to write more than 2GB of data at a time (ARROW-11391). IPC The IPC reader now supports reading data with dictionaries shared between different schema fields (ARROW-11838). The IPC reader now supports optional endian conversion when receiving IPC data represented with a different endianness. It is therefore possible to exchange Arrow data between systems with different endiannesses (ARROW-8797). The IPC file writer now optionally unifies dictionaries when writing a file in a single shot, instead of erroring out if unequal dictionaries are encountered (ARROW-10406). An interoperability issue with the C# implementation was fixed (ARROW-12100). JSON A possible crash when reading a line-separated JSON file has been fixed (ARROW-12065). ORC The Arrow C++ library now includes an ORC file writer. Hence it is possible to both read and write ORC files from/to Arrow data. Parquet The Parquet C++ library version is now synced with the Arrow version (ARROW-7830). Parquet DECIMAL statistics were previously calculated incorrectly, this has now been fixed (PARQUET-1655). Initial support for high-level Parquet encryption APIs similar to those in parquet-mr is available (ARROW-9318). C# notes Arrow Flight is now packaged for C#/.NET. Go notes The go implementation now supports IPC buffer compression Java notes Java now supports IPC buffer compression (ZSTD is recommended as the current performance of LZ4 is quite slow). JavaScript notes The Arrow JS module is now tree-shakeable. Iterating over Tables or Vectors is ~2X faster. Demo The default bundles use modern JS. Python notes Limited support for writing out CSV files (only types that have cast implementation to String) is now available. Writing parquet list types now has the option of enabling the canonical group naming according to the Parquet specification. The ORC Writer is now available. Creating a dataset with pyarrow.dataset.write_dataset is now possible from a Python iterator of record batches (ARROW-10882). The Dataset interface can now use custom projections using expressions when scanning (ARROW-11750). The expressions gained basic support for arithmetic operations (e.g. ds.field(&#39;a&#39;) / ds.field(&#39;b&#39;)) (ARROW-12058). See the Dataset docs for more details. See the C++ notes above for additional details. R notes The dplyr interface to Arrow data gained many new features in this release, including support for mutate(), relocate(), and more. You can also call in filter() or mutate() over 100 functions supported by the Arrow C++ library, and many string functions are available both by their base R (grepl(), gsub(), etc.) and stringr (str_detect(), str_replace()) spellings. Datasets can now read compressed CSVs automatically, and you can also open a dataset that is based on a single file, enabling you to use write_dataset() to partition a very large file without having to read the whole file into memory. For more on what’s in the 4.0.0 R package, see the R changelog. C GLib and Ruby notes C GLib In Arrow GLib version 4.0.0, the following changes are introduced in addition to the changes by Arrow C++. gandiva-glib supports filtering by using the newly introduced GGandivaFilter, GGandivaCondition, and GGandivaSelectableProjector The input property is added in GArrowCSVReader and GArrowJSONReader GNU Autotools, namely configure script, support is dropped GADScanContext is removed, and use_threads property is moved to GADScanOptions garrow_chunked_array_combine function is added garrow_array_concatenate function is added GADFragment and its subclass GADInMemoryFragment are added GADScanTask now holds the corresponding GADFragment gad_scan_options_replace_schema function is removed The name of Decimal128DataType is changed to decimal128 Ruby In Red Arrow version 4.0.0, the following changes are introduced in addition to the changes by Arrow GLib. ArrowDataset::ScanContext is removed, and use_threads attribute is moved to ArrowDataset::ScanOptions Arrow::Array#concatenate is added; it can concatenate not only an Arrow::Array but also a normal Array Arrow::SortKey and Arrow::SortOptions are added for accepting Ruby objects as sort key and options ArrowDataset::InMemoryFragment is added Rust notes This release of Arrow continues to add new features and performance improvements. Much of our time this release was spent hammering out the necessary details so we can release the Rust versions to cargo at a more regular rate. In addition, we welcomed the Ballista distributed compute project officially to the fold. Arrow Improved LargeUtf8 support Improved null handling in AND/OR kernels Added JSON writer support (ARROW-11310) JSON reader improvements LargeUTF8 Improved schema inference for nested list and struct types Various performance improvements IPC writer no longer calls finish() implicitly on drop Compute kernels Support for optional limit in sort kernel Divide by a single scalar Support for casting to timestamps Cast: Improved support between casting List, LargeList, Int32, Int64, Date64 Kernel to combine two arrays based on boolean mask Pow kernel new_null_array for creating Arrays full of nulls. Parquet Added support for filtering row groups (used by DataFusion to implement filter push-down) Added support for Parquet v 2.0 logical types DataFusion New Features SQL Support CTEs UNION HAVING EXTRACT SHOW TABLES SHOW COLUMNS INTERVAL SQL Information schema Support GROUP BY for more data types, including dictionary columns, boolean, Date32 Extensibility API Catalogs and schemas support Table deregistration Better support for multiple optimizers User defined functions can now provide specialized implementations for scalar values Physical Plans Hash Repartitioning SQL Metrics Additional Postgres compatible function library: Length functions Pad/trim functions Concat functions Ascii/Unicode functions Regex Proper identifier case identification (e.g. “Foo” vs Foo vs foo) Upgraded to Tokio 1.x Performance Improvements: LIMIT pushdown Constant folding Partitioned hash join Create hashes vectorized in hash join Improve parallelism using repartitioning pass Improved hash aggregate performance with large number of grouping values Predicate pushdown support for table scans Predicate push-down to parquet enables DataFusion to quickly eliminate entire parquet row-groups based on query filter expressions and parquet row group min/max statistics API Changes DataFrame methods now take Vec&lt;Expr&gt; rather than &amp;[Expr] TableProvider now consistently uses Arc&lt;TableProvider&gt; rather than Box&lt;TableProvider&gt; Ballista Ballista was donated shortly before the Arrow 4.0.0 release and there is no new release of Ballista as part of Arrow 4.0.0" />
<link rel="canonical" href="https://arrow.apache.org/blog/2021/05/03/4.0.0-release/" />
<meta property="og:url" content="https://arrow.apache.org/blog/2021/05/03/4.0.0-release/" />
<meta property="og:site_name" content="Apache Arrow" />
<meta property="og:image" content="https://arrow.apache.org/img/arrow-logo_horizontal_black-txt_white-bg.png" />
<meta property="og:type" content="article" />
<meta property="article:published_time" content="2021-05-03T02:00:00-04:00" />
<meta name="twitter:card" content="summary_large_image" />
<meta property="twitter:image" content="https://arrow.apache.org/img/arrow-logo_horizontal_black-txt_white-bg.png" />
<meta property="twitter:title" content="Apache Arrow 4.0.0 Release" />
<script type="application/ld+json">
{"@context":"https://schema.org","@type":"BlogPosting","author":{"@type":"Person","name":"pmc"},"dateModified":"2021-05-03T02:00:00-04:00","datePublished":"2021-05-03T02:00:00-04:00","description":"The Apache Arrow team is pleased to announce the 4.0.0 release. This covers 3 months of development work and includes 711 resolved issues from 114 distinct contributors. See the Install Page to learn how to get the libraries for your platform. The release notes below are not exhaustive and only expose selected highlights of the release. Many other bugfixes and improvements have been made: we refer you to the complete changelog. Community Since the 3.0.0 release, Yibo Cai, Ian Cook, and Jonathan Keane have been invited as committers to Arrow, and Andrew Lamb and Jorge Leitão have joined the Project Management Committee (PMC). Thank you for all of your contributions! Arrow Flight RPC notes In Java, applications can now enable zero-copy optimizations when writing data (ARROW-11066). This potentially breaks source compatibility, so it is not enabled by default. Arrow Flight is now packaged for C#/.NET. Linux packages notes There are Linux packages for C++ and C GLib. They were provided by Bintray but Bintray is no longer available as of 2021-05-01. They are provided by Artifactory now. Users needs to change the install instructions because the URL has changed. See the install page for new instructions. Here is a summary of needed changes. For Debian GNU Linux and Ubuntu users: Users need to change the apache-arrow-archive-keyring install instruction: Package name is changed to apache-arrow-apt-source. Download URL is changed to https://apache.jfrog.io/artifactory/arrow/... from https://apache.bintray.com/arrow/.... For CentOS and Red Hat Enterprise Linux users: Users need to change the apache-arrow-release install instruction: Download URL is changed to https://apache.jfrog.io/artifactory/arrow/... from https://apache.bintray.com/arrow/.... C++ notes The Arrow C++ library now includes a vcpkg.json manifest file and a new CMake option -DARROW_DEPENDENCY_SOURCE=VCPKG to simplify installation of dependencies using the vcpkg package manager. This provides an alternative means of installing C++ library dependencies on Linux, macOS, and Windows. See the Building Arrow C++ and Developing on Windows docs pages for details. The default memory allocator on macOS has been changed from jemalloc to mimalloc, yielding performance benefits on a range of macro-benchmarks (ARROW-12316). Non-monotonic dense union offsets are now disallowed as per the Arrow format specification, and return an error in Array::ValidateFull (ARROW-10580). Compute layer Automatic implicit casting in compute kernels (ARROW-8919). For example, for the addition of two arrays, the arrays are first cast to their common numeric type instead of erroring when the types are not equal. Compute functions quantile (ARROW-10831) and power (ARROW-11070) have been added for numeric data. Compute functions for string processing have been added for: Trimming characters (ARROW-9128). Extracting substrings captured by a regex pattern (extract_regex, ARROW-10195). Computing UTF8 string lengths (utf8_length, ARROW-11693). Matching strings against regex pattern (match_substring_regex, ARROW-12134). Replacing non-overlapping substrings that match a literal pattern or regular expression (replace_substring and replace_substring_regex, ARROW-10306). It is now possible to sort decimal and fixed-width binary data (ARROW-11662). The precision of the sum kernel was improved (ARROW-11758). CSV A CSV writer has been added (ARROW-2229). The CSV reader can now infer timestamp columns with fractional seconds (ARROW-12031). Dataset Arrow Datasets received various performance improvements and new features. Some highlights: New columns can be projected from arbitrary expressions at scan time (ARROW-11174) Read performance was improved for Parquet on high-latency filesystems (ARROW-11601) and in general when there are thousands of files or more (ARROW-8658) Null partition keys can be written (ARROW-10438) Compressed CSV files can be read (ARROW-10372) Filesystems support async operations (ARROW-10846) Usage and API documentation were added (ARROW-11677) Files and filesystems Fixed some rare instances of GZip files could not be read properly (ARROW-12169). Support for setting S3 proxy parameters has been added (ARROW-8900). The HDFS filesystem is now able to write more than 2GB of data at a time (ARROW-11391). IPC The IPC reader now supports reading data with dictionaries shared between different schema fields (ARROW-11838). The IPC reader now supports optional endian conversion when receiving IPC data represented with a different endianness. It is therefore possible to exchange Arrow data between systems with different endiannesses (ARROW-8797). The IPC file writer now optionally unifies dictionaries when writing a file in a single shot, instead of erroring out if unequal dictionaries are encountered (ARROW-10406). An interoperability issue with the C# implementation was fixed (ARROW-12100). JSON A possible crash when reading a line-separated JSON file has been fixed (ARROW-12065). ORC The Arrow C++ library now includes an ORC file writer. Hence it is possible to both read and write ORC files from/to Arrow data. Parquet The Parquet C++ library version is now synced with the Arrow version (ARROW-7830). Parquet DECIMAL statistics were previously calculated incorrectly, this has now been fixed (PARQUET-1655). Initial support for high-level Parquet encryption APIs similar to those in parquet-mr is available (ARROW-9318). C# notes Arrow Flight is now packaged for C#/.NET. Go notes The go implementation now supports IPC buffer compression Java notes Java now supports IPC buffer compression (ZSTD is recommended as the current performance of LZ4 is quite slow). JavaScript notes The Arrow JS module is now tree-shakeable. Iterating over Tables or Vectors is ~2X faster. Demo The default bundles use modern JS. Python notes Limited support for writing out CSV files (only types that have cast implementation to String) is now available. Writing parquet list types now has the option of enabling the canonical group naming according to the Parquet specification. The ORC Writer is now available. Creating a dataset with pyarrow.dataset.write_dataset is now possible from a Python iterator of record batches (ARROW-10882). The Dataset interface can now use custom projections using expressions when scanning (ARROW-11750). The expressions gained basic support for arithmetic operations (e.g. ds.field(&#39;a&#39;) / ds.field(&#39;b&#39;)) (ARROW-12058). See the Dataset docs for more details. See the C++ notes above for additional details. R notes The dplyr interface to Arrow data gained many new features in this release, including support for mutate(), relocate(), and more. You can also call in filter() or mutate() over 100 functions supported by the Arrow C++ library, and many string functions are available both by their base R (grepl(), gsub(), etc.) and stringr (str_detect(), str_replace()) spellings. Datasets can now read compressed CSVs automatically, and you can also open a dataset that is based on a single file, enabling you to use write_dataset() to partition a very large file without having to read the whole file into memory. For more on what’s in the 4.0.0 R package, see the R changelog. C GLib and Ruby notes C GLib In Arrow GLib version 4.0.0, the following changes are introduced in addition to the changes by Arrow C++. gandiva-glib supports filtering by using the newly introduced GGandivaFilter, GGandivaCondition, and GGandivaSelectableProjector The input property is added in GArrowCSVReader and GArrowJSONReader GNU Autotools, namely configure script, support is dropped GADScanContext is removed, and use_threads property is moved to GADScanOptions garrow_chunked_array_combine function is added garrow_array_concatenate function is added GADFragment and its subclass GADInMemoryFragment are added GADScanTask now holds the corresponding GADFragment gad_scan_options_replace_schema function is removed The name of Decimal128DataType is changed to decimal128 Ruby In Red Arrow version 4.0.0, the following changes are introduced in addition to the changes by Arrow GLib. ArrowDataset::ScanContext is removed, and use_threads attribute is moved to ArrowDataset::ScanOptions Arrow::Array#concatenate is added; it can concatenate not only an Arrow::Array but also a normal Array Arrow::SortKey and Arrow::SortOptions are added for accepting Ruby objects as sort key and options ArrowDataset::InMemoryFragment is added Rust notes This release of Arrow continues to add new features and performance improvements. Much of our time this release was spent hammering out the necessary details so we can release the Rust versions to cargo at a more regular rate. In addition, we welcomed the Ballista distributed compute project officially to the fold. Arrow Improved LargeUtf8 support Improved null handling in AND/OR kernels Added JSON writer support (ARROW-11310) JSON reader improvements LargeUTF8 Improved schema inference for nested list and struct types Various performance improvements IPC writer no longer calls finish() implicitly on drop Compute kernels Support for optional limit in sort kernel Divide by a single scalar Support for casting to timestamps Cast: Improved support between casting List, LargeList, Int32, Int64, Date64 Kernel to combine two arrays based on boolean mask Pow kernel new_null_array for creating Arrays full of nulls. Parquet Added support for filtering row groups (used by DataFusion to implement filter push-down) Added support for Parquet v 2.0 logical types DataFusion New Features SQL Support CTEs UNION HAVING EXTRACT SHOW TABLES SHOW COLUMNS INTERVAL SQL Information schema Support GROUP BY for more data types, including dictionary columns, boolean, Date32 Extensibility API Catalogs and schemas support Table deregistration Better support for multiple optimizers User defined functions can now provide specialized implementations for scalar values Physical Plans Hash Repartitioning SQL Metrics Additional Postgres compatible function library: Length functions Pad/trim functions Concat functions Ascii/Unicode functions Regex Proper identifier case identification (e.g. “Foo” vs Foo vs foo) Upgraded to Tokio 1.x Performance Improvements: LIMIT pushdown Constant folding Partitioned hash join Create hashes vectorized in hash join Improve parallelism using repartitioning pass Improved hash aggregate performance with large number of grouping values Predicate pushdown support for table scans Predicate push-down to parquet enables DataFusion to quickly eliminate entire parquet row-groups based on query filter expressions and parquet row group min/max statistics API Changes DataFrame methods now take Vec&lt;Expr&gt; rather than &amp;[Expr] TableProvider now consistently uses Arc&lt;TableProvider&gt; rather than Box&lt;TableProvider&gt; Ballista Ballista was donated shortly before the Arrow 4.0.0 release and there is no new release of Ballista as part of Arrow 4.0.0","headline":"Apache Arrow 4.0.0 Release","image":"https://arrow.apache.org/img/arrow-logo_horizontal_black-txt_white-bg.png","mainEntityOfPage":{"@type":"WebPage","@id":"https://arrow.apache.org/blog/2021/05/03/4.0.0-release/"},"publisher":{"@type":"Organization","logo":{"@type":"ImageObject","url":"https://arrow.apache.org/img/logo.png"},"name":"pmc"},"url":"https://arrow.apache.org/blog/2021/05/03/4.0.0-release/"}</script>
<!-- End Jekyll SEO tag -->
<!-- favicons -->
<link rel="icon" type="image/png" sizes="16x16" href="/img/favicon-16x16.png" id="light1">
<link rel="icon" type="image/png" sizes="32x32" href="/img/favicon-32x32.png" id="light2">
<link rel="apple-touch-icon" type="image/png" sizes="180x180" href="/img/apple-touch-icon.png" id="light3">
<link rel="apple-touch-icon" type="image/png" sizes="120x120" href="/img/apple-touch-icon-120x120.png" id="light4">
<link rel="apple-touch-icon" type="image/png" sizes="76x76" href="/img/apple-touch-icon-76x76.png" id="light5">
<link rel="apple-touch-icon" type="image/png" sizes="60x60" href="/img/apple-touch-icon-60x60.png" id="light6">
<!-- dark mode favicons -->
<link rel="icon" type="image/png" sizes="16x16" href="/img/favicon-16x16-dark.png" id="dark1">
<link rel="icon" type="image/png" sizes="32x32" href="/img/favicon-32x32-dark.png" id="dark2">
<link rel="apple-touch-icon" type="image/png" sizes="180x180" href="/img/apple-touch-icon-dark.png" id="dark3">
<link rel="apple-touch-icon" type="image/png" sizes="120x120" href="/img/apple-touch-icon-120x120-dark.png" id="dark4">
<link rel="apple-touch-icon" type="image/png" sizes="76x76" href="/img/apple-touch-icon-76x76-dark.png" id="dark5">
<link rel="apple-touch-icon" type="image/png" sizes="60x60" href="/img/apple-touch-icon-60x60-dark.png" id="dark6">
<script>
// Switch to the dark-mode favicons if prefers-color-scheme: dark
function onUpdate() {
light1 = document.querySelector('link#light1');
light2 = document.querySelector('link#light2');
light3 = document.querySelector('link#light3');
light4 = document.querySelector('link#light4');
light5 = document.querySelector('link#light5');
light6 = document.querySelector('link#light6');
dark1 = document.querySelector('link#dark1');
dark2 = document.querySelector('link#dark2');
dark3 = document.querySelector('link#dark3');
dark4 = document.querySelector('link#dark4');
dark5 = document.querySelector('link#dark5');
dark6 = document.querySelector('link#dark6');
if (matcher.matches) {
light1.remove();
light2.remove();
light3.remove();
light4.remove();
light5.remove();
light6.remove();
document.head.append(dark1);
document.head.append(dark2);
document.head.append(dark3);
document.head.append(dark4);
document.head.append(dark5);
document.head.append(dark6);
} else {
dark1.remove();
dark2.remove();
dark3.remove();
dark4.remove();
dark5.remove();
dark6.remove();
document.head.append(light1);
document.head.append(light2);
document.head.append(light3);
document.head.append(light4);
document.head.append(light5);
document.head.append(light6);
}
}
matcher = window.matchMedia('(prefers-color-scheme: dark)');
matcher.addListener(onUpdate);
onUpdate();
</script>
<link href="/css/main.css" rel="stylesheet">
<link href="/css/syntax.css" rel="stylesheet">
<script src="/javascript/main.js"></script>
<!-- Matomo -->
<script>
var _paq = window._paq = window._paq || [];
/* tracker methods like "setCustomDimension" should be called before "trackPageView" */
/* We explicitly disable cookie tracking to avoid privacy issues */
_paq.push(['disableCookies']);
_paq.push(['trackPageView']);
_paq.push(['enableLinkTracking']);
(function() {
var u="https://analytics.apache.org/";
_paq.push(['setTrackerUrl', u+'matomo.php']);
_paq.push(['setSiteId', '20']);
var d=document, g=d.createElement('script'), s=d.getElementsByTagName('script')[0];
g.async=true; g.src=u+'matomo.js'; s.parentNode.insertBefore(g,s);
})();
</script>
<!-- End Matomo Code -->
<link type="application/atom+xml" rel="alternate" href="https://arrow.apache.org/feed.xml" title="Apache Arrow" />
</head>
<body class="wrap">
<header>
<nav class="navbar navbar-expand-md navbar-dark bg-dark">
<a class="navbar-brand no-padding" href="/"><img src="/img/arrow-inverse-300px.png" height="40px"></a>
<button class="navbar-toggler ml-auto" type="button" data-toggle="collapse" data-target="#arrow-navbar" aria-controls="arrow-navbar" aria-expanded="false" aria-label="Toggle navigation">
<span class="navbar-toggler-icon"></span>
</button>
<!-- Collect the nav links, forms, and other content for toggling -->
<div class="collapse navbar-collapse justify-content-end" id="arrow-navbar">
<ul class="nav navbar-nav">
<li class="nav-item"><a class="nav-link" href="/overview/" role="button" aria-haspopup="true" aria-expanded="false">Overview</a></li>
<li class="nav-item"><a class="nav-link" href="/faq/" role="button" aria-haspopup="true" aria-expanded="false">FAQ</a></li>
<li class="nav-item"><a class="nav-link" href="/blog" role="button" aria-haspopup="true" aria-expanded="false">Blog</a></li>
<li class="nav-item dropdown">
<a class="nav-link dropdown-toggle" href="#" id="navbarDropdownGetArrow" role="button" data-toggle="dropdown" aria-haspopup="true" aria-expanded="false">
Get Arrow
</a>
<div class="dropdown-menu" aria-labelledby="navbarDropdownGetArrow">
<a class="dropdown-item" href="/install/">Install</a>
<a class="dropdown-item" href="/release/">Releases</a>
</div>
</li>
<li class="nav-item dropdown">
<a class="nav-link dropdown-toggle" href="#" id="navbarDropdownDocumentation" role="button" data-toggle="dropdown" aria-haspopup="true" aria-expanded="false">
Docs
</a>
<div class="dropdown-menu" aria-labelledby="navbarDropdownDocumentation">
<a class="dropdown-item" href="/docs">Project Docs</a>
<a class="dropdown-item" href="/docs/format/Columnar.html">Format</a>
<hr>
<a class="dropdown-item" href="/docs/c_glib">C GLib</a>
<a class="dropdown-item" href="/docs/cpp">C++</a>
<a class="dropdown-item" href="https://github.com/apache/arrow/blob/main/csharp/README.md" target="_blank" rel="noopener">C#</a>
<a class="dropdown-item" href="https://godoc.org/github.com/apache/arrow/go/arrow" target="_blank" rel="noopener">Go</a>
<a class="dropdown-item" href="/docs/java">Java</a>
<a class="dropdown-item" href="/docs/js">JavaScript</a>
<a class="dropdown-item" href="/julia/">Julia</a>
<a class="dropdown-item" href="https://github.com/apache/arrow/blob/main/matlab/README.md" target="_blank" rel="noopener">MATLAB</a>
<a class="dropdown-item" href="/docs/python">Python</a>
<a class="dropdown-item" href="/docs/r">R</a>
<a class="dropdown-item" href="https://github.com/apache/arrow/blob/main/ruby/README.md" target="_blank" rel="noopener">Ruby</a>
<a class="dropdown-item" href="https://docs.rs/arrow/latest" target="_blank" rel="noopener">Rust</a>
<a class="dropdown-item" href="/swift">Swift</a>
</div>
</li>
<li class="nav-item dropdown">
<a class="nav-link dropdown-toggle" href="#" id="navbarDropdownSource" role="button" data-toggle="dropdown" aria-haspopup="true" aria-expanded="false">
Source
</a>
<div class="dropdown-menu" aria-labelledby="navbarDropdownSource">
<a class="dropdown-item" href="https://github.com/apache/arrow" target="_blank" rel="noopener">Main Repo</a>
<hr>
<a class="dropdown-item" href="https://github.com/apache/arrow/tree/main/c_glib" target="_blank" rel="noopener">C GLib</a>
<a class="dropdown-item" href="https://github.com/apache/arrow/tree/main/cpp" target="_blank" rel="noopener">C++</a>
<a class="dropdown-item" href="https://github.com/apache/arrow/tree/main/csharp" target="_blank" rel="noopener">C#</a>
<a class="dropdown-item" href="https://github.com/apache/arrow-go" target="_blank" rel="noopener">Go</a>
<a class="dropdown-item" href="https://github.com/apache/arrow-java" target="_blank" rel="noopener">Java</a>
<a class="dropdown-item" href="https://github.com/apache/arrow-js" target="_blank" rel="noopener">JavaScript</a>
<a class="dropdown-item" href="https://github.com/apache/arrow-julia" target="_blank" rel="noopener">Julia</a>
<a class="dropdown-item" href="https://github.com/apache/arrow/tree/main/matlab" target="_blank" rel="noopener">MATLAB</a>
<a class="dropdown-item" href="https://github.com/apache/arrow/tree/main/python" target="_blank" rel="noopener">Python</a>
<a class="dropdown-item" href="https://github.com/apache/arrow/tree/main/r" target="_blank" rel="noopener">R</a>
<a class="dropdown-item" href="https://github.com/apache/arrow/tree/main/ruby" target="_blank" rel="noopener">Ruby</a>
<a class="dropdown-item" href="https://github.com/apache/arrow-rs" target="_blank" rel="noopener">Rust</a>
<a class="dropdown-item" href="https://github.com/apache/arrow-swift" target="_blank" rel="noopener">Swift</a>
</div>
</li>
<li class="nav-item dropdown">
<a class="nav-link dropdown-toggle" href="#" id="navbarDropdownSubprojects" role="button" data-toggle="dropdown" aria-haspopup="true" aria-expanded="false">
Subprojects
</a>
<div class="dropdown-menu" aria-labelledby="navbarDropdownSubprojects">
<a class="dropdown-item" href="/adbc">ADBC</a>
<a class="dropdown-item" href="/docs/format/Flight.html">Arrow Flight</a>
<a class="dropdown-item" href="/docs/format/FlightSql.html">Arrow Flight SQL</a>
<a class="dropdown-item" href="https://datafusion.apache.org" target="_blank" rel="noopener">DataFusion</a>
<a class="dropdown-item" href="/nanoarrow">nanoarrow</a>
</div>
</li>
<li class="nav-item dropdown">
<a class="nav-link dropdown-toggle" href="#" id="navbarDropdownCommunity" role="button" data-toggle="dropdown" aria-haspopup="true" aria-expanded="false">
Community
</a>
<div class="dropdown-menu" aria-labelledby="navbarDropdownCommunity">
<a class="dropdown-item" href="/community/">Communication</a>
<a class="dropdown-item" href="/docs/developers/index.html">Contributing</a>
<a class="dropdown-item" href="https://github.com/apache/arrow/issues" target="_blank" rel="noopener">Issue Tracker</a>
<a class="dropdown-item" href="/committers/">Governance</a>
<a class="dropdown-item" href="/use_cases/">Use Cases</a>
<a class="dropdown-item" href="/powered_by/">Powered By</a>
<a class="dropdown-item" href="/visual_identity/">Visual Identity</a>
<a class="dropdown-item" href="/security/">Security</a>
<a class="dropdown-item" href="https://www.apache.org/foundation/policies/conduct.html" target="_blank" rel="noopener">Code of Conduct</a>
</div>
</li>
<li class="nav-item dropdown">
<a class="nav-link dropdown-toggle" href="#" id="navbarDropdownASF" role="button" data-toggle="dropdown" aria-haspopup="true" aria-expanded="false">
ASF Links
</a>
<div class="dropdown-menu dropdown-menu-right" aria-labelledby="navbarDropdownASF">
<a class="dropdown-item" href="https://www.apache.org/" target="_blank" rel="noopener">ASF Website</a>
<a class="dropdown-item" href="https://www.apache.org/licenses/" target="_blank" rel="noopener">License</a>
<a class="dropdown-item" href="https://www.apache.org/foundation/sponsorship.html" target="_blank" rel="noopener">Donate</a>
<a class="dropdown-item" href="https://www.apache.org/foundation/thanks.html" target="_blank" rel="noopener">Thanks</a>
<a class="dropdown-item" href="https://www.apache.org/security/" target="_blank" rel="noopener">Security</a>
</div>
</li>
</ul>
</div>
<!-- /.navbar-collapse -->
</nav>
</header>
<div class="container p-4 pt-5">
<div class="col-md-8 mx-auto">
<main role="main" class="pb-5">
<h1>
Apache Arrow 4.0.0 Release
</h1>
<hr class="mt-4 mb-3">
<p class="mb-4 pb-1">
<span class="badge badge-secondary">Published</span>
<span class="published mr-3">
03 May 2021
</span>
<br>
<span class="badge badge-secondary">By</span>
<a class="mr-3" href="https://arrow.apache.org">The Apache Arrow PMC (pmc) </a>
</p>
<!--
-->
<p>The Apache Arrow team is pleased to announce the 4.0.0 release. This covers
3 months of development work and includes <a href="https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20%3D%20Resolved%20AND%20fixVersion%20%3D%204.0.0" target="_blank" rel="noopener"><strong>711 resolved issues</strong></a>
from <a href="/release/4.0.0.html#contributors"><strong>114 distinct contributors</strong></a>. See the Install Page to learn how to
get the libraries for your platform.</p>
<p>The release notes below are not exhaustive and only expose selected highlights
of the release. Many other bugfixes and improvements have been made: we refer
you to the <a href="/release/4.0.0.html">complete changelog</a>.</p>
<h2>Community</h2>
<p>Since the 3.0.0 release, Yibo Cai, Ian Cook, and Jonathan Keane
have been invited as committers to Arrow,
and Andrew Lamb and Jorge Leitão have joined the Project Management Committee
(PMC). Thank you for all of your contributions!</p>
<h2>Arrow Flight RPC notes</h2>
<p>In Java, applications can now enable zero-copy optimizations when writing
data (ARROW-11066). This potentially breaks source compatibility, so it is
not enabled by default.</p>
<p>Arrow Flight is now packaged for C#/.NET.</p>
<h2>Linux packages notes</h2>
<p>There are Linux packages for C++ and C GLib. They were provided by Bintray
but <a href="https://jfrog.com/blog/into-the-sunset-bintray-jcenter-gocenter-and-chartcenter/" target="_blank" rel="noopener">Bintray is no longer available as of 2021-05-01</a>. They are provided
by Artifactory now. Users needs to change the install instructions because the URL
has changed. See <a href="/install/">the install page</a> for new instructions. Here is a
summary of needed changes.</p>
<p>For Debian GNU Linux and Ubuntu users:</p>
<ul>
<li>Users need to change the <code>apache-arrow-archive-keyring</code> install instruction:
<ul>
<li>Package name is changed to <code>apache-arrow-apt-source</code>.</li>
<li>Download URL is changed to <code>https://apache.jfrog.io/artifactory/arrow/...</code> from <code>https://apache.bintray.com/arrow/...</code>.</li>
</ul>
</li>
</ul>
<p>For CentOS and Red Hat Enterprise Linux users:</p>
<ul>
<li>Users need to change the <code>apache-arrow-release</code> install instruction:
<ul>
<li>Download URL is changed to <code>https://apache.jfrog.io/artifactory/arrow/...</code> from <code>https://apache.bintray.com/arrow/...</code>.</li>
</ul>
</li>
</ul>
<h2>C++ notes</h2>
<p>The Arrow C++ library now includes a <a href="https://github.com/apache/arrow/blob/master/cpp/vcpkg.json" target="_blank" rel="noopener"><code>vcpkg.json</code></a>
manifest file and a new CMake option <code>-DARROW_DEPENDENCY_SOURCE=VCPKG</code> to
simplify installation of dependencies using the <a href="https://github.com/microsoft/vcpkg" target="_blank" rel="noopener">vcpkg</a>
package manager. This provides an alternative means of installing C++ library
dependencies on Linux, macOS, and Windows. See the
<a href="/docs/developers/cpp/building.html">Building Arrow C++</a>
and <a href="/docs/developers/cpp/windows.html">Developing on Windows</a>
docs pages for details.</p>
<p>The default memory allocator on macOS has been changed from jemalloc to mimalloc,
yielding performance benefits on a range of macro-benchmarks (ARROW-12316).</p>
<p>Non-monotonic dense union offsets are now disallowed as per the Arrow format
specification, and return an error in <code>Array::ValidateFull</code> (ARROW-10580).</p>
<h3>Compute layer</h3>
<p>Automatic implicit casting in compute kernels (ARROW-8919). For example, for
the addition of two arrays, the arrays are first cast to their common numeric
type instead of erroring when the types are not equal.</p>
<p>Compute functions <code>quantile</code> (ARROW-10831) and <code>power</code> (ARROW-11070) have been
added for numeric data.</p>
<p>Compute functions for string processing have been added for:</p>
<ul>
<li>Trimming characters (ARROW-9128).</li>
<li>Extracting substrings captured by a regex pattern (<code>extract_regex</code>, ARROW-10195).</li>
<li>Computing UTF8 string lengths (<code>utf8_length</code>, ARROW-11693).</li>
<li>Matching strings against regex pattern (<code>match_substring_regex</code>, ARROW-12134).</li>
<li>Replacing non-overlapping substrings that match a literal pattern or regular
expression (<code>replace_substring</code> and <code>replace_substring_regex</code>, ARROW-10306).</li>
</ul>
<p>It is now possible to sort decimal and fixed-width binary data (ARROW-11662).</p>
<p>The precision of the <code>sum</code> kernel was improved (ARROW-11758).</p>
<h3>CSV</h3>
<p>A CSV writer has been added (ARROW-2229).</p>
<p>The CSV reader can now infer timestamp columns with fractional seconds (ARROW-12031).</p>
<h3>Dataset</h3>
<p>Arrow Datasets received various performance improvements and new
features. Some highlights:</p>
<ul>
<li>New columns can be projected from arbitrary expressions at scan time
(ARROW-11174)</li>
<li>Read performance was improved for Parquet on high-latency
filesystems (ARROW-11601) and in general when there are thousands of
files or more (ARROW-8658)</li>
<li>Null partition keys can be written (ARROW-10438)</li>
<li>Compressed CSV files can be read (ARROW-10372)</li>
<li>Filesystems support async operations (ARROW-10846)</li>
<li>Usage and API documentation were added (ARROW-11677)</li>
</ul>
<h3>Files and filesystems</h3>
<p>Fixed some rare instances of GZip files could not be read properly (ARROW-12169).</p>
<p>Support for setting S3 proxy parameters has been added (ARROW-8900).</p>
<p>The HDFS filesystem is now able to write more than 2GB of data at a time
(ARROW-11391).</p>
<h3>IPC</h3>
<p>The IPC reader now supports reading data with dictionaries shared between
different schema fields (ARROW-11838).</p>
<p>The IPC reader now supports optional endian conversion when receiving IPC
data represented with a different endianness. It is therefore possible to
exchange Arrow data between systems with different endiannesses (ARROW-8797).</p>
<p>The IPC file writer now optionally unifies dictionaries when writing a
file in a single shot, instead of erroring out if unequal dictionaries are
encountered (ARROW-10406).</p>
<p>An interoperability issue with the C# implementation was fixed (ARROW-12100).</p>
<h3>JSON</h3>
<p>A possible crash when reading a line-separated JSON file has been fixed (ARROW-12065).</p>
<h3>ORC</h3>
<p>The Arrow C++ library now includes an ORC file writer. Hence it is possible
to both read and write ORC files from/to Arrow data.</p>
<h3>Parquet</h3>
<p>The Parquet C++ library version is now synced with the Arrow version (ARROW-7830).</p>
<p>Parquet DECIMAL statistics were previously calculated incorrectly, this
has now been fixed (PARQUET-1655).</p>
<p>Initial support for high-level Parquet encryption APIs similar to those
in parquet-mr is available (ARROW-9318).</p>
<h2>C# notes</h2>
<p>Arrow Flight is now packaged for C#/.NET.</p>
<h2>Go notes</h2>
<p>The go implementation now supports IPC buffer compression</p>
<h2>Java notes</h2>
<p>Java now supports IPC buffer compression (ZSTD is recommended as the current performance of LZ4 is quite slow).</p>
<h2>JavaScript notes</h2>
<ul>
<li>The Arrow JS module is now tree-shakeable.</li>
<li>Iterating over Tables or Vectors is ~2X faster. <a href="https://observablehq.com/@domoritz/arrow-js-3-vs-4-iterator" target="_blank" rel="noopener">Demo</a>
</li>
<li>The default bundles use modern JS.</li>
</ul>
<h2>Python notes</h2>
<ul>
<li>Limited support for writing out CSV files (only types that have cast implementation to String) is now available.</li>
<li>Writing parquet list types now has the option of enabling the canonical group naming according to the Parquet specification.</li>
<li>The ORC Writer is now available.</li>
</ul>
<p>Creating a dataset with <code>pyarrow.dataset.write_dataset</code> is now possible from a
Python iterator of record batches (ARROW-10882).
The Dataset interface can now use custom projections using expressions when
scanning (ARROW-11750). The expressions gained basic support for arithmetic
operations (e.g. <code>ds.field('a') / ds.field('b')</code>) (ARROW-12058). See
the <a href="https://arrow.apache.org/docs/python/dataset.html#projecting-columns">Dataset docs</a> for more details.</p>
<p>See the C++ notes above for additional details.</p>
<h2>R notes</h2>
<p>The <code>dplyr</code> interface to Arrow data gained many new features in this release, including support for <code>mutate()</code>, <code>relocate()</code>, and more. You can also call in <code>filter()</code> or <code>mutate()</code> over 100 functions supported by the Arrow C++ library, and many string functions are available both by their base R (<code>grepl()</code>, <code>gsub()</code>, etc.) and <code>stringr</code> (<code>str_detect()</code>, <code>str_replace()</code>) spellings.</p>
<p>Datasets can now read compressed CSVs automatically, and you can also open a dataset that is based on a single file, enabling you to use <code>write_dataset()</code> to partition a very large file without having to read the whole file into memory.</p>
<p>For more on what’s in the 4.0.0 R package, see the <a href="/docs/r/news/">R changelog</a>.</p>
<h2>C GLib and Ruby notes</h2>
<h3>C GLib</h3>
<p>In Arrow GLib version 4.0.0, the following changes are introduced in addition to the changes by Arrow C++.</p>
<ul>
<li>gandiva-glib supports filtering by using the newly introduced <code>GGandivaFilter</code>, <code>GGandivaCondition</code>, and <code>GGandivaSelectableProjector</code>
</li>
<li>The <code>input</code> property is added in <code>GArrowCSVReader</code> and <code>GArrowJSONReader</code>
</li>
<li>GNU Autotools, namely <code>configure</code> script, support is dropped</li>
<li>
<code>GADScanContext</code> is removed, and <code>use_threads</code> property is moved to <code>GADScanOptions</code>
</li>
<li>
<code>garrow_chunked_array_combine</code> function is added</li>
<li>
<code>garrow_array_concatenate</code> function is added</li>
<li>
<code>GADFragment</code> and its subclass <code>GADInMemoryFragment</code> are added</li>
<li>
<code>GADScanTask</code> now holds the corresponding <code>GADFragment</code>
</li>
<li>
<code>gad_scan_options_replace_schema</code> function is removed</li>
<li>The name of <code>Decimal128DataType</code> is changed to <code>decimal128</code>
</li>
</ul>
<h3>Ruby</h3>
<p>In Red Arrow version 4.0.0, the following changes are introduced in addition to the changes by Arrow GLib.</p>
<ul>
<li>
<code>ArrowDataset::ScanContext</code> is removed, and <code>use_threads</code> attribute is moved to <code>ArrowDataset::ScanOptions</code>
</li>
<li>
<code>Arrow::Array#concatenate</code> is added; it can concatenate not only an <code>Arrow::Array</code> but also a normal <code>Array</code>
</li>
<li>
<code>Arrow::SortKey</code> and <code>Arrow::SortOptions</code> are added for accepting Ruby objects as sort key and options</li>
<li>
<code>ArrowDataset::InMemoryFragment</code> is added</li>
</ul>
<h2>Rust notes</h2>
<p>This release of Arrow continues to add new features and performance improvements. Much of our time this release was spent hammering out the necessary details so we can release the Rust versions to cargo at a more regular rate. In addition, we welcomed the <a href="https://ballistacompute.org/" target="_blank" rel="noopener">Ballista distributed compute project</a> officially to the fold.</p>
<h3>Arrow</h3>
<ul>
<li>Improved LargeUtf8 support</li>
<li>Improved null handling in AND/OR kernels</li>
<li>Added JSON writer support (ARROW-11310)</li>
<li>JSON reader improvements</li>
<li>LargeUTF8
<ul>
<li>Improved schema inference for nested list and struct types</li>
</ul>
</li>
<li>Various performance improvements</li>
<li>IPC writer no longer calls finish() implicitly on drop</li>
<li>Compute kernels
<ul>
<li>Support for optional <code>limit</code> in sort kernel</li>
<li>Divide by a single scalar</li>
<li>Support for casting to timestamps</li>
<li>Cast: Improved support between casting List, LargeList, Int32, Int64, Date64</li>
<li>Kernel to combine two arrays based on boolean mask</li>
<li>Pow kernel</li>
</ul>
</li>
<li>
<code>new_null_array</code> for creating Arrays full of nulls.</li>
</ul>
<h3>Parquet</h3>
<ul>
<li>Added support for filtering row groups (used by DataFusion to implement filter push-down)</li>
<li>Added support for Parquet v 2.0 logical types</li>
</ul>
<h3>DataFusion</h3>
<p>New Features</p>
<ul>
<li>
<p>SQL Support</p>
</li>
<li>
<ul>
<li>CTEs</li>
<li>UNION</li>
<li>HAVING</li>
<li>EXTRACT</li>
<li>SHOW TABLES</li>
<li>SHOW COLUMNS</li>
<li>INTERVAL</li>
<li>SQL Information schema</li>
<li>Support GROUP BY for more data types, including dictionary columns, boolean, Date32</li>
</ul>
</li>
<li>
<p>Extensibility API</p>
<ul>
<li>Catalogs and schemas support</li>
<li>Table deregistration</li>
<li>Better support for multiple optimizers</li>
<li>User defined functions can now provide specialized implementations for scalar values</li>
</ul>
</li>
<li>
<p>Physical Plans</p>
</li>
<li>
<p>Hash Repartitioning</p>
</li>
<li>
<p>SQL Metrics</p>
</li>
<li>
<p>Additional Postgres compatible function library:</p>
<ul>
<li>Length functions</li>
<li>Pad/trim functions</li>
<li>Concat functions</li>
<li>Ascii/Unicode functions</li>
<li>Regex</li>
</ul>
</li>
<li>
<p>Proper identifier case identification (e.g. “Foo” vs Foo vs foo)</p>
</li>
<li>
<p>Upgraded to Tokio 1.x</p>
</li>
</ul>
<p>Performance Improvements:</p>
<ul>
<li>LIMIT pushdown</li>
<li>Constant folding</li>
<li>Partitioned hash join</li>
<li>Create hashes vectorized in hash join</li>
<li>Improve parallelism using repartitioning pass</li>
<li>Improved hash aggregate performance with large number of grouping values</li>
<li>Predicate pushdown support for table scans</li>
<li>Predicate push-down to parquet enables DataFusion to quickly eliminate entire parquet row-groups based on query filter expressions and parquet row group min/max statistics</li>
</ul>
<p>API Changes</p>
<ul>
<li>DataFrame methods now take <code>Vec&lt;Expr&gt;</code> rather than <code>&amp;[Expr]</code>
</li>
<li>TableProvider now consistently uses <code>Arc&lt;TableProvider&gt;</code> rather than <code>Box&lt;TableProvider&gt;</code>
</li>
</ul>
<h3>Ballista</h3>
<p>Ballista was donated shortly before the Arrow 4.0.0 release and there is no new release of Ballista as part of Arrow 4.0.0</p>
</main>
</div>
<hr>
<footer class="footer">
<div class="row">
<div class="col-md-9">
<p>Apache Arrow, Arrow, Apache, the Apache logo, and the Apache Arrow project logo are either registered trademarks or trademarks of The Apache Software Foundation in the United States and other countries.</p>
<p>© 2016-2025 The Apache Software Foundation</p>
</div>
<div class="col-md-3">
<a class="d-sm-none d-md-inline pr-2" href="https://www.apache.org/events/current-event.html" target="_blank" rel="noopener">
<img src="https://www.apache.org/events/current-event-234x60.png">
</a>
</div>
</div>
</footer>
</div>
</body>
</html>