blob: 9fdeee9e10650e4cc80401dba434b63928ec6a85 [file] [log] [blame]
<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.3.3">Jekyll</generator><link href="https://arrow.apache.org/feed.xml" rel="self" type="application/atom+xml" /><link href="https://arrow.apache.org/" rel="alternate" type="text/html" /><updated>2024-04-29T17:30:49-04:00</updated><id>https://arrow.apache.org/feed.xml</id><title type="html">Apache Arrow</title><subtitle>Apache Arrow is a cross-language development platform for in-memory data. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware. It also provides computational libraries and zero-copy streaming messaging and interprocess communication. Languages currently supported include C, C++, C#, Go, Java, JavaScript, MATLAB, Python, R, Ruby, and Rust.</subtitle><entry><title type="html">Apache Arrow 16.0.0 Release</title><link href="https://arrow.apache.org/blog/2024/04/20/16.0.0-release/" rel="alternate" type="text/html" title="Apache Arrow 16.0.0 Release" /><published>2024-04-20T00:00:00-04:00</published><updated>2024-04-20T00:00:00-04:00</updated><id>https://arrow.apache.org/blog/2024/04/20/16.0.0-release</id><content type="html" xml:base="https://arrow.apache.org/blog/2024/04/20/16.0.0-release/"><![CDATA[<!--
-->
<p>The Apache Arrow team is pleased to announce the 16.0.0 release. This covers
over 3 months of development work and includes <a href="https://github.com/apache/arrow/milestone/59?closed=1"><strong>385 resolved issues</strong></a>
on <a href="/release/16.0.0.html#contributors"><strong>586 distinct commits</strong></a> from <a href="/release/16.0.0.html#contributors"><strong>119 distinct contributors</strong></a>.
See the <a href="https://arrow.apache.org/install/">Install Page</a>
to learn how to get the libraries for your platform.</p>
<p>The release notes below are not exhaustive and only expose selected highlights
of the release. Many other bugfixes and improvements have been made: we refer
you to the <a href="/release/16.0.0.html#changelog">complete changelog</a>.</p>
<h2 id="community">Community</h2>
<p>Since the 15.0.0 release, Jeffrey Vo, Jay Zhan, Bryce Mecum, Joel Lubinitsky,
and Sarah Gilmore have been invited to be committers.
No new members have joined the Project Management Committee (PMC).</p>
<p>Thanks for your contributions and participation in the project!</p>
<h2 id="c-data-interface-notes">C Data Interface notes</h2>
<ul>
<li>Added <code class="language-plaintext highlighter-rouge">RegisterDeviceMemoryManager</code> and <code class="language-plaintext highlighter-rouge">GetDeviceMemoryManage</code> for managing mappings between a device type and id to a memory manager (<a href="https://github.com/apache/arrow/issues/40698">GH-40698</a>).</li>
<li>Added <code class="language-plaintext highlighter-rouge">RegisterCUDADevice</code> to register CUDA devices (<a href="https://github.com/apache/arrow/issues/40698">GH-40698</a>).</li>
<li>Added <code class="language-plaintext highlighter-rouge">ImportFromChunkedArray</code> and <code class="language-plaintext highlighter-rouge">ExportChunkedArray</code> for handling Chunked Arrays in the C Stream Interface (<a href="https://github.com/apache/arrow/issues/38717">GH-38717</a>).</li>
<li>Fixed an issue where string and nested types weren’t being correctly imported with DeviceArray (<a href="https://github.com/apache/arrow/issues/39769">GH-39769</a>).</li>
<li>Added support for copying Arrays and RecordBatches between memory types (<a href="https://github.com/apache/arrow/issues/39771">GH-39771</a>).</li>
</ul>
<h2 id="arrow-flight-rpc-notes">Arrow Flight RPC notes</h2>
<ul>
<li>Session variable RPCs were added (<a href="https://github.com/apache/arrow/issues/34865">GH-34865</a>)</li>
<li>Go: cookies can be copied to another connection to reuse existing credentials (<a href="https://github.com/apache/arrow/issues/39837">GH-39837</a>)</li>
<li>Go: enable PollFlightInfo for Flight SQL clients/servers (<a href="https://github.com/apache/arrow/issues/39574">GH-39574</a>)</li>
<li>Java: the JDBC driver now tries all locations the server sends it (<a href="https://github.com/apache/arrow/issues/38573">GH-38573</a>)</li>
<li>Java: tweak some options to give better performance (<a href="https://github.com/apache/arrow/issues/40745">GH-40475</a>, <a href="https://github.com/apache/arrow/issues/40039">GH-40039</a>)</li>
</ul>
<h2 id="c-notes">C++ notes</h2>
<p>For C++ notes refer to the full changelog.</p>
<h2 id="highlights">Highlights</h2>
<ul>
<li>Initial support for the Azure Blob Storage has been added (<a href="https://github.com/apache/arrow/issues/18014">GH-18014</a>).</li>
<li>Arrow C++ can now be built with Emscripten (<a href="https://github.com/apache/arrow/pull/37821">GH-37821</a>) which lays the foundation for running Arrow C++ under WASM runtimes and eventually <a href="https://github.com/apache/arrow/pull/37822">PyArrow</a> as well.</li>
<li>Arrow’s filesystem modules have been separated out into individual libraries and this change enables writing and registering custom filesystem implementations (<a href="https://github.com/apache/arrow/issues/38309">GH-38309</a>).</li>
<li>Conversion from <code class="language-plaintext highlighter-rouge">Table</code> and <code class="language-plaintext highlighter-rouge">RecordBatch</code> to a <code class="language-plaintext highlighter-rouge">Tensor</code> (not the same as
<a href="https://arrow.apache.org/docs/dev/format/CanonicalExtensions.html#official-list">tensor extension array</a>)
is being developed. Umbrella issue is created (<a href="https://github.com/apache/arrow/issues/40058">GH-40058</a>)
and issues connected to the <code class="language-plaintext highlighter-rouge">RecordBatch</code> conversion are included in this release
(<a href="https://github.com/apache/arrow/issues/40059">GH-40059</a>,
<a href="https://github.com/apache/arrow/issues/40357">GH-40357</a>,
<a href="https://github.com/apache/arrow/issues/40297">GH-40297</a>,
<a href="https://github.com/apache/arrow/issues/40060">GH-40060</a>,
<a href="https://github.com/apache/arrow/issues/40061">GH-40061</a> and
<a href="https://github.com/apache/arrow/issues/40866">GH-40866</a>) which means <code class="language-plaintext highlighter-rouge">RecordBatch</code> can now be
converted to a column or row-major two-dimensional structure.</li>
</ul>
<h2 id="compute">Compute</h2>
<h3 id="bug-fixes">Bug Fixes</h3>
<ul>
<li>Fixed a potential crash when accessing the <code class="language-plaintext highlighter-rouge">true_count</code> property on a BooleanArray (<a href="https://github.com/apache/arrow/issues/41016">GH-41016</a>).</li>
</ul>
<h3 id="performance-improvements">Performance improvements</h3>
<ul>
<li>Significantly improved performance of the take kernel on certain types of inputs (<a href="https://github.com/apache/arrow/issues/40207">GH-40207</a>).</li>
</ul>
<h3 id="enhancements">Enhancements</h3>
<ul>
<li>Support for casting to and from half-float (float16) has been added (<a href="https://github.com/apache/arrow/issues/20213">GH-20213</a>).</li>
<li>Added support for residual predicates to Swiss Join implementation (<a href="https://github.com/apache/arrow/issues/20339">GH-20339</a>).</li>
<li>Expanded support to primitive filter implementation for all fixed-width primitive types and take filter implementation for all well-known fixed-width types (<a href="https://github.com/apache/arrow/issues/39740">GH-39740</a>).</li>
<li>Added support for calling the <code class="language-plaintext highlighter-rouge">binary_slice</code> kernel on Fixed-Size Binary Arrays (<a href="https://github.com/apache/arrow/issues/39231">GH-39231</a>).</li>
<li>The cast kernel now supports casting from LargeString, Binary, and LargeBinary to Dictionary (<a href="https://github.com/apache/arrow/issues/39463">GH-39463</a>).</li>
<li>Fields of different decimal precision can now be used together in arithmetic operations without an explicit cast beforehand. (<a href="https://github.com/apache/arrow/issues/40126">GH-40126</a>).</li>
</ul>
<h2 id="datasets">Datasets</h2>
<ul>
<li>Improved backpressure handling in the Dataset Writer which can significantly reduce memory usage for some use cases (<a href="https://github.com/apache/arrow/pull/40722">https://github.com/apache/arrow/pull/40722</a>).</li>
</ul>
<h2 id="parquet">Parquet</h2>
<ul>
<li>Byte stream split encoding support has been added for FIXED_LEN_BYTE_ARRAY, INT32, and INT64 which enables this encoding for half-float (float16) and fixed-width decimal (<a href="https://github.com/apache/arrow/issues/39978">GH-39978</a>).</li>
<li>Decoding boolean values has been made faster for a variety of cases (<a href="https://github.com/apache/arrow/issues/40872">GH-40872</a>).</li>
</ul>
<h2 id="filesystems">Filesystems</h2>
<h3 id="new-features">New Features</h3>
<ul>
<li>In addition to building the individual filesystem implementations as separate modules, users can now write and register custom filesystem implementations (<a href="https://github.com/apache/arrow/issues/38309">GH-38309</a>).</li>
<li>A new environment variable, <code class="language-plaintext highlighter-rouge">AWS_ENDPOINT_URL_S3</code>, has been added which allows separately overriding the endpoint for S3 operations alone (<a href="https://github.com/apache/arrow/issues/38663">GH-38663</a>).</li>
</ul>
<h3 id="bug-fixes-1">Bug Fixes</h3>
<ul>
<li>Fixed a bug in the S3 filesystem implementation that could cause a crash when deleting an object having duplicate forward slashes in its name (<a href="https://github.com/apache/arrow/issues/38821">GH-38821</a>).</li>
<li>Fixed a bug where <code class="language-plaintext highlighter-rouge">hash_mean</code> could silently overflow (<a href="https://github.com/apache/arrow/issues/38833">GH-38833</a>).</li>
</ul>
<h3 id="improvements">Improvements</h3>
<ul>
<li>The S3 implementation now sets the content-type of directory-like objects to application/x-directory to improve compatibility with other S3 tools (<a href="https://github.com/apache/arrow/issues/38794">GH-38794</a>).</li>
<li>Repeated S3Client initialization is now roughly an order of magnitude faster (<a href="https://github.com/apache/arrow/pull/40299">GH-40299</a>).</li>
<li>The MemoryPoolStats implementation has been reworked to re-order loads and stores which may be an improvement for some allocation-heavy, multi-threaded applications (<a href="https://github.com/apache/arrow/issues/40783">GH-40783</a>).</li>
</ul>
<h3 id="substrait">Substrait</h3>
<ul>
<li>Support has been added to Substrait for a variety of Arrow types (<a href="https://github.com/apache/arrow/issues/40695">GH-40695</a>).</li>
<li>substrait-cpp has been upgraded to 0.44 (<a href="https://github.com/apache/arrow/issues/40695">GH-40695</a>).</li>
</ul>
<h2 id="development">Development</h2>
<ul>
<li>Added support the mold and lld linkers for building Arrow C++ (<a href="https://github.com/apache/arrow/issues/40394">GH-40394</a>, <a href="https://github.com/apache/arrow/issues/40400">GH-40400</a>).</li>
</ul>
<h3 id="miscellaneous">Miscellaneous</h3>
<ul>
<li>Upgraded ORC to 2.0.0 (<a href="https://github.com/apache/arrow/issues/40507">GH-40507</a>).</li>
<li>Upgraded zstd to 1.5.6 (<a href="https://github.com/apache/arrow/pull/40837">GH-40837</a>).</li>
<li>Upgraded google benchmark to 1.8.3 (<a href="https://github.com/apache/arrow/issues/39863">GH-39863</a>).</li>
<li>Upgraded zlib 1.3.1 (<a href="https://github.com/apache/arrow/issues/39876">GH-39876</a>).</li>
<li>Various ToString methods now support an optional <code class="language-plaintext highlighter-rouge">show_metadata</code> argument which will print metadata that may exist in nested types. (<a href="https://github.com/apache/arrow/issues/39864">GH-39864</a>).</li>
</ul>
<h2 id="c-notes-1">C# notes</h2>
<ul>
<li>IPC record batch compression has been implemented <a href="https://github.com/apache/arrow/issues/24834">GH-24834</a></li>
<li>Optional materialization of C# string arrays is now supported <a href="https://github.com/apache/arrow/issues/41047">GH-41047</a></li>
<li>A memory leak in the C Data interface has been fixed <a href="https://github.com/apache/arrow/issues/40898">GH-40898</a></li>
<li>Various other bug fixes and improvements.</li>
</ul>
<h2 id="go-notes">Go Notes</h2>
<ul>
<li>The Golang Arrow and Parquet libraries now require Go 1.21+ (<a href="https://github.com/apache/arrow/issues/40733">GH-40733</a>)</li>
</ul>
<h3 id="bug-fixes-2">Bug Fixes</h3>
<h4 id="arrow">Arrow</h4>
<ul>
<li>FlightSQL Driver will now properly handle concurrent result sets instead of pulling the entire result into memory (<a href="https://github.com/apache/arrow/issues/40089">GH-40089</a>)</li>
<li>FlightSQL driver will now correctly respect the <code class="language-plaintext highlighter-rouge">DriverConfig.TLSEnabled</code> field (<a href="https://github.com/apache/arrow/issues/40097">GH-40097</a>)</li>
<li>Fixed a panic on 32-bit architectures (<a href="https://github.com/apache/arrow/issues/40672">GH-40672</a>)</li>
<li>Corrected a precision loss for Decimal types when converting to JSON (<a href="https://github.com/apache/arrow/issues/40693">GH-40693</a>)</li>
<li>Fixed an issue with <code class="language-plaintext highlighter-rouge">array.RecordBuilder</code> when using a NullType column (<a href="https://github.com/apache/arrow/issues/40719">GH-40719</a>)</li>
</ul>
<h4 id="parquet-1">Parquet</h4>
<ul>
<li>Fixed panic when writing a DeltaBinaryPacked column containing only nulls (<a href="https://github.com/apache/arrow/issues/35718">GH-35718</a>)</li>
<li>Fixed a panic when writing a ListOf(DeltaBinaryPacked) field with no data (<a href="https://github.com/apache/arrow/issues/39309">GH-39309</a>)</li>
<li>Arrow DATE64 types will now be properly coerced into Parquet DATE[32-bit] logical type (<a href="https://github.com/apache/arrow/issues/39456">GH-39456</a>)</li>
<li>Fixed the timezone semantics for timestamp conversion from Arrow to Parquet (<a href="https://github.com/apache/arrow/issues/39466">GH-39466</a>)</li>
<li>Corrected an inaccuracy with <code class="language-plaintext highlighter-rouge">RowGroupTotalCompressedBytes</code> and <code class="language-plaintext highlighter-rouge">RowGroupTotalBytesWritten</code> for Parquet file writer (<a href="https://github.com/apache/arrow/issues/39870">GH-39870</a>)</li>
<li>Fixed the <code class="language-plaintext highlighter-rouge">TotalCompressedBytes</code> count when falling back to plain encoding if a dictionary is too large (<a href="https://github.com/apache/arrow/issues/39921">GH-39921</a>)</li>
<li>Fixed a bug when reslicing a nullable dictionary in the chunked writer (<a href="https://github.com/apache/arrow/issues/39925">GH-39925</a>)</li>
</ul>
<h3 id="enhancements-1">Enhancements</h3>
<h4 id="arrow-1">Arrow</h4>
<ul>
<li>Users can now access the underlying <code class="language-plaintext highlighter-rouge">MemoTable</code> of a dictionary builder (<a href="https://github.com/apache/arrow/issues/38988">GH-38988</a>)</li>
<li>Added an option to provide a string replacer for CSV writing (<a href="https://github.com/apache/arrow/issues/39552">GH-39552</a>)</li>
<li>Flight: Cookies can be copied to another connection to reuse existing credentials (<a href="https://github.com/apache/arrow/issues/39837">GH-39837</a>)</li>
<li>Flight: enable PollFlightInfo for Flight SQL clients/servers (<a href="https://github.com/apache/arrow/issues/39574">GH-39574</a>)</li>
<li>Added the ability to create a PreparedStatement from persisted data and provided access for FlightSQL users to the PreparedStatement handle property (<a href="https://github.com/apache/arrow/issues/39774">GH-39774</a> <a href="https://github.com/apache/arrow/issues/39910">GH-39910</a>)</li>
<li>FlightRPC Session management extensions have been implemented (<a href="https://github.com/apache/arrow/issues/40155">GH-40155</a>)</li>
</ul>
<h4 id="parquet-2">Parquet</h4>
<ul>
<li>Can now register new compression codecs for Parquet (<a href="https://github.com/apache/arrow/issues/40113">GH-40113</a>)</li>
<li>Parquet footers can be incrementally written without closing the file (<a href="https://github.com/apache/arrow/issues/40630">GH-40630</a>)</li>
</ul>
<h2 id="java-notes">Java notes</h2>
<ul>
<li>A breaking change to support Java 9 modules has been implemented in this release. <a href="https://github.com/apache/arrow/issues/39001">GH-39001</a></li>
<li>A new Float16 type has been added. <a href="https://github.com/apache/arrow/issues/39680">GH-39680</a></li>
<li>Java 22 is supported. <a href="https://github.com/apache/arrow/issues/40680">GH-40680</a></li>
<li>Various bug fixes and improvements.</li>
</ul>
<h2 id="javascript-notes">JavaScript notes</h2>
<ul>
<li>Dates are now stored as TimestampMillisecond
(<a href="https://github.com/apache/arrow/pull/40892">GH-40892</a>)</li>
<li>Vectors created from typed arrays are now correctly not nullable and null
counts are now correct
(<a href="https://github.com/apache/arrow/pull/40852">GH-40852</a>)</li>
</ul>
<h2 id="python-notes">Python notes</h2>
<p>Compatibility notes:</p>
<ul>
<li>To ensure PyArrow compatibility with NumPy 2.0 umbrella issue has been closed <a href="https://github.com/apache/arrow/issues/39532">GH-39532</a> with last issues included in 16.0.0 Arrow release (<a href="https://github.com/apache/arrow/issues/41098">GH-41098</a>, <a href="https://github.com/apache/arrow/issues/39848">GH-39848</a> and <a href="https://github.com/apache/arrow/issues/40376">GH-40376</a>).</li>
<li>We no longer use internals to create Block objects and started using new pandas API with pandas version 3 <a href="https://github.com/apache/arrow/issues/35081">GH-35081</a></li>
<li>Pandas compatibility code has been simplified as old pandas and Python versions are not supported anymore <a href="https://github.com/apache/arrow/issues/40720">GH-40720</a></li>
<li>Deprecated <code class="language-plaintext highlighter-rouge">pyarrow.filesystem</code> legacy implementations have been removed <a href="https://github.com/apache/arrow/issues/20127">GH-20127</a></li>
</ul>
<p>New features:</p>
<ul>
<li>Converting Arrow <code class="language-plaintext highlighter-rouge">Table</code> and <code class="language-plaintext highlighter-rouge">RecordBatch</code> to a <code class="language-plaintext highlighter-rouge">Tensor</code> (not the same as <a href="https://arrow.apache.org/docs/dev/format/CanonicalExtensions.html#official-list">tensor extension array</a>) is being developed in Arrow C++ with bindings in Python. Umbrella issue: (<a href="https://github.com/apache/arrow/issues/40058">GH-40058</a>). In current release the option to convert a <code class="language-plaintext highlighter-rouge">RecordBatch</code> to <code class="language-plaintext highlighter-rouge">Tensor</code> with <code class="language-plaintext highlighter-rouge">pyarrow.RecordBatch.to_tensor(...)</code> is added returning a row or column major tensor with an option of writing missing values as <code class="language-plaintext highlighter-rouge">NaN</code> in the result.</li>
<li><code class="language-plaintext highlighter-rouge">ListView</code> and <code class="language-plaintext highlighter-rouge">LargeListView</code> array formats are now supported by PyArrow (<a href="https://github.com/apache/arrow/issues/39812">GH-39812</a>, <a href="https://github.com/apache/arrow/issues/39855">GH-39855</a>, <a href="https://github.com/apache/arrow/issues/40205">GH-40205</a>, <a href="https://github.com/apache/arrow/issues/41039">GH-41039</a>, <a href="https://github.com/apache/arrow/issues/40266">GH-40266</a>)</li>
<li><code class="language-plaintext highlighter-rouge">Binary</code> and <code class="language-plaintext highlighter-rouge">StringView</code> are now supported in PyArrow (<a href="https://github.com/apache/arrow/issues/39651">GH-39651</a>, <a href="https://github.com/apache/arrow/issues/39852">GH-39852</a>, <a href="https://github.com/apache/arrow/issues/40092">GH-40092</a>)</li>
<li>Final support for Run-End Encoded arrays in PyArrow has been included (conversion to numpy and pandas <a href="https://github.com/apache/arrow/issues/40659">GH-40659</a>, construction in <code class="language-plaintext highlighter-rouge">pa.array(...)</code> <a href="https://github.com/apache/arrow/issues/40273">GH-40273</a>)</li>
<li><code class="language-plaintext highlighter-rouge">AsofJoinNode</code> C++ functionality is now exposed in Python as a <code class="language-plaintext highlighter-rouge">join_asof</code> <a href="https://github.com/apache/arrow/issues/34235">GH-34235</a></li>
<li>Minimal python bindings are added for AzureFilesystem <a href="https://github.com/apache/arrow/issues/39968">GH-39968</a></li>
<li><code class="language-plaintext highlighter-rouge">FixedSizeTensorScalar</code> class is added <a href="https://github.com/apache/arrow/issues/37484">GH-37484</a></li>
</ul>
<p>Other improvements:</p>
<ul>
<li>Add ChunkedArray import/export to/from C <a href="https://github.com/apache/arrow/issues/39984">GH-39984</a></li>
<li><code class="language-plaintext highlighter-rouge">pyarrow.Field</code> and <code class="language-plaintext highlighter-rouge">pyarrow.ChunkedArray</code> can now be constructed from objects supporting the PyCapsule Arrow C Data Interface <a href="https://github.com/apache/arrow/issues/38010">GH-38010</a></li>
<li>Requested_schema is supported in <code class="language-plaintext highlighter-rouge">__arrow_c_stream__</code> implementations <a href="https://github.com/apache/arrow/issues/40066">GH-40066</a></li>
<li>Add low-level bindings for exporting/importing the C Device Interface
<a href="https://github.com/apache/arrow/issues/39979">GH-39979</a></li>
<li>Function to download and extract timezone database on a Windows machine is added <a href="https://github.com/apache/arrow/issues/37328">GH-37328</a></li>
<li>Missing methods are added to <code class="language-plaintext highlighter-rouge">pyarrow.RecordBatch</code> <a href="https://github.com/apache/arrow/issues/30915">GH-30915</a></li>
<li>Dictionary is now also accepted in <code class="language-plaintext highlighter-rouge">pyarrow.record_batch</code> factory function (as in <code class="language-plaintext highlighter-rouge">pyarrow.table</code>) <a href="https://github.com/apache/arrow/issues/40291">GH-40291</a></li>
<li>Usage of scalar legacy cast has been removed <a href="https://github.com/apache/arrow/issues/40023">GH-40023</a></li>
<li>Missing byte_width attribute are added to all DataType classes <a href="https://github.com/apache/arrow/issues/39277">GH-39277</a></li>
<li><code class="language-plaintext highlighter-rouge">FileInfo</code> instances can now be used to construct Dataset objects <a href="https://github.com/apache/arrow/issues/40142">GH-40142</a></li>
<li>Support hashing for <code class="language-plaintext highlighter-rouge">FileMetaData</code> and <code class="language-plaintext highlighter-rouge">ParquetSchema</code> <a href="https://github.com/apache/arrow/issues/39780">GH-39780</a></li>
<li><code class="language-plaintext highlighter-rouge">force_virtual_addressing</code> is exposed in PyArrow <a href="https://github.com/apache/arrow/issues/39779">GH-39779</a></li>
</ul>
<p>Relevant bug fixes:</p>
<ul>
<li>Calling <code class="language-plaintext highlighter-rouge">pyarrow.dataset.ParquetFileFormat.make_write_options</code> as a class method now returns a warning <a href="https://github.com/apache/arrow/issues/39440">GH-39440</a></li>
<li><code class="language-plaintext highlighter-rouge">ScalarMemoTable</code>is now initiated only when deduplication is enabled which fixes large memory consumption in the other case <a href="https://github.com/apache/arrow/issues/40316">GH-40316</a></li>
<li>Slicing an array backwards beyond the start doesn’t include first item (<a href="https://github.com/apache/arrow/issues/38768">GH-38768</a> and <a href="https://github.com/apache/arrow/issues/40642">GH-40642</a>)</li>
<li>Memory leaks when creating Arrow array from Python list of dicts is fixed <a href="https://github.com/apache/arrow/issues/37989">GH-37989</a></li>
<li><code class="language-plaintext highlighter-rouge">FixedSizeListType</code> has not been considered as a nested type and is now added to <code class="language-plaintext highlighter-rouge">_NESTED_TYPES</code> <a href="https://github.com/apache/arrow/issues/40171">GH-40171</a></li>
<li><code class="language-plaintext highlighter-rouge">max_chunksize</code> is now validated in <code class="language-plaintext highlighter-rouge">Table.to_batches</code> <a href="https://github.com/apache/arrow/issues/39788">GH-39788</a></li>
<li>
<p>Raising <code class="language-plaintext highlighter-rouge">ValueError</code> on <code class="language-plaintext highlighter-rouge">_ensure_partitioning</code>in Dataset is fixed <a href="https://github.com/apache/arrow/issues/39579">GH-39579</a></p>
</li>
<li>Python stacktrace is now attached to errors in <code class="language-plaintext highlighter-rouge">ConvertPyError</code> <a href="https://github.com/apache/arrow/issues/37164">GH-37164</a></li>
</ul>
<h2 id="r-notes">R notes</h2>
<ul>
<li>Arrow IPC streams (i.e., <code class="language-plaintext highlighter-rouge">write_ipc_stream</code>) can now be written to socket
connections (<a href="https://github.com/apache/arrow/pull/38897">GH-38897</a>)</li>
<li>The <code class="language-plaintext highlighter-rouge">print()</code> output for <code class="language-plaintext highlighter-rouge">Dataset</code> and <code class="language-plaintext highlighter-rouge">Table</code> objects has been improved so it
now shows dimensions and truncates its output in the case of wide schemas
(<a href="https://github.com/apache/arrow/pull/38917">GH-38917</a>)</li>
<li>Various improvements and fixes to documentation, package build, and CI systems</li>
</ul>
<p>For more on what’s in the 16.0.0 R package, see the <a href="/docs/r/news/">R changelog</a>.</p>
<h2 id="ruby-and-c-glib-notes">Ruby and C GLib notes</h2>
<h3 id="ruby">Ruby</h3>
<ul>
<li>Added support for customizing timestamp parsers.
<a href="https://github.com/apache/arrow/issues/40590">GH-40590</a></li>
</ul>
<h3 id="c-glib">C GLib</h3>
<ul>
<li>Added support for time zone in <code class="language-plaintext highlighter-rouge">GArrowTimestampDataType</code>.
<a href="https://github.com/apache/arrow/issues/39702">GH-39702</a></li>
<li>Added missing compute function options.
<a href="https://github.com/apache/arrow/issues/40402">GH-40402</a>
<ul>
<li><code class="language-plaintext highlighter-rouge">GArrowSplitPatternOptions</code></li>
<li><code class="language-plaintext highlighter-rouge">GArrowStrftimeOptions</code></li>
<li><code class="language-plaintext highlighter-rouge">GArrowStrptimeOptions</code></li>
<li><code class="language-plaintext highlighter-rouge">GArrowStructFieldOptions</code></li>
</ul>
</li>
<li>Changed documentation generator to GI-DocGen from GTK-Doc.
<a href="https://github.com/apache/arrow/issues/39935">GH-39935</a></li>
<li>Added <code class="language-plaintext highlighter-rouge">GArrowTimestampParser</code>.
<a href="https://github.com/apache/arrow/issues/40438">GH-40438</a></li>
<li>Added support for customizing timestamp parsers.
<a href="https://github.com/apache/arrow/issues/40590">GH-40590</a></li>
</ul>
<h2 id="rust-notes">Rust notes</h2>
<p>The Rust projects have moved to separate repositories outside the
main Arrow monorepo. For notes on the latest release of the Rust
implementation, see the latest <a href="https://github.com/apache/arrow-rs/tags">Arrow Rust changelog</a>.</p>]]></content><author><name>pmc</name></author><category term="release" /><summary type="html"><![CDATA[The Apache Arrow team is pleased to announce the 16.0.0 release. This covers over 3 months of development work and includes 385 resolved issues on 586 distinct commits from 119 distinct contributors. See the Install Page to learn how to get the libraries for your platform. The release notes below are not exhaustive and only expose selected highlights of the release. Many other bugfixes and improvements have been made: we refer you to the complete changelog. Community Since the 15.0.0 release, Jeffrey Vo, Jay Zhan, Bryce Mecum, Joel Lubinitsky, and Sarah Gilmore have been invited to be committers. No new members have joined the Project Management Committee (PMC). Thanks for your contributions and participation in the project! C Data Interface notes Added RegisterDeviceMemoryManager and GetDeviceMemoryManage for managing mappings between a device type and id to a memory manager (GH-40698). Added RegisterCUDADevice to register CUDA devices (GH-40698). Added ImportFromChunkedArray and ExportChunkedArray for handling Chunked Arrays in the C Stream Interface (GH-38717). Fixed an issue where string and nested types weren’t being correctly imported with DeviceArray (GH-39769). Added support for copying Arrays and RecordBatches between memory types (GH-39771). Arrow Flight RPC notes Session variable RPCs were added (GH-34865) Go: cookies can be copied to another connection to reuse existing credentials (GH-39837) Go: enable PollFlightInfo for Flight SQL clients/servers (GH-39574) Java: the JDBC driver now tries all locations the server sends it (GH-38573) Java: tweak some options to give better performance (GH-40475, GH-40039) C++ notes For C++ notes refer to the full changelog. Highlights Initial support for the Azure Blob Storage has been added (GH-18014). Arrow C++ can now be built with Emscripten (GH-37821) which lays the foundation for running Arrow C++ under WASM runtimes and eventually PyArrow as well. Arrow’s filesystem modules have been separated out into individual libraries and this change enables writing and registering custom filesystem implementations (GH-38309). Conversion from Table and RecordBatch to a Tensor (not the same as tensor extension array) is being developed. Umbrella issue is created (GH-40058) and issues connected to the RecordBatch conversion are included in this release (GH-40059, GH-40357, GH-40297, GH-40060, GH-40061 and GH-40866) which means RecordBatch can now be converted to a column or row-major two-dimensional structure. Compute Bug Fixes Fixed a potential crash when accessing the true_count property on a BooleanArray (GH-41016). Performance improvements Significantly improved performance of the take kernel on certain types of inputs (GH-40207). Enhancements Support for casting to and from half-float (float16) has been added (GH-20213). Added support for residual predicates to Swiss Join implementation (GH-20339). Expanded support to primitive filter implementation for all fixed-width primitive types and take filter implementation for all well-known fixed-width types (GH-39740). Added support for calling the binary_slice kernel on Fixed-Size Binary Arrays (GH-39231). The cast kernel now supports casting from LargeString, Binary, and LargeBinary to Dictionary (GH-39463). Fields of different decimal precision can now be used together in arithmetic operations without an explicit cast beforehand. (GH-40126). Datasets Improved backpressure handling in the Dataset Writer which can significantly reduce memory usage for some use cases (https://github.com/apache/arrow/pull/40722). Parquet Byte stream split encoding support has been added for FIXED_LEN_BYTE_ARRAY, INT32, and INT64 which enables this encoding for half-float (float16) and fixed-width decimal (GH-39978). Decoding boolean values has been made faster for a variety of cases (GH-40872). Filesystems New Features In addition to building the individual filesystem implementations as separate modules, users can now write and register custom filesystem implementations (GH-38309). A new environment variable, AWS_ENDPOINT_URL_S3, has been added which allows separately overriding the endpoint for S3 operations alone (GH-38663). Bug Fixes Fixed a bug in the S3 filesystem implementation that could cause a crash when deleting an object having duplicate forward slashes in its name (GH-38821). Fixed a bug where hash_mean could silently overflow (GH-38833). Improvements The S3 implementation now sets the content-type of directory-like objects to application/x-directory to improve compatibility with other S3 tools (GH-38794). Repeated S3Client initialization is now roughly an order of magnitude faster (GH-40299). The MemoryPoolStats implementation has been reworked to re-order loads and stores which may be an improvement for some allocation-heavy, multi-threaded applications (GH-40783). Substrait Support has been added to Substrait for a variety of Arrow types (GH-40695). substrait-cpp has been upgraded to 0.44 (GH-40695). Development Added support the mold and lld linkers for building Arrow C++ (GH-40394, GH-40400). Miscellaneous Upgraded ORC to 2.0.0 (GH-40507). Upgraded zstd to 1.5.6 (GH-40837). Upgraded google benchmark to 1.8.3 (GH-39863). Upgraded zlib 1.3.1 (GH-39876). Various ToString methods now support an optional show_metadata argument which will print metadata that may exist in nested types. (GH-39864). C# notes IPC record batch compression has been implemented GH-24834 Optional materialization of C# string arrays is now supported GH-41047 A memory leak in the C Data interface has been fixed GH-40898 Various other bug fixes and improvements. Go Notes The Golang Arrow and Parquet libraries now require Go 1.21+ (GH-40733) Bug Fixes Arrow FlightSQL Driver will now properly handle concurrent result sets instead of pulling the entire result into memory (GH-40089) FlightSQL driver will now correctly respect the DriverConfig.TLSEnabled field (GH-40097) Fixed a panic on 32-bit architectures (GH-40672) Corrected a precision loss for Decimal types when converting to JSON (GH-40693) Fixed an issue with array.RecordBuilder when using a NullType column (GH-40719) Parquet Fixed panic when writing a DeltaBinaryPacked column containing only nulls (GH-35718) Fixed a panic when writing a ListOf(DeltaBinaryPacked) field with no data (GH-39309) Arrow DATE64 types will now be properly coerced into Parquet DATE[32-bit] logical type (GH-39456) Fixed the timezone semantics for timestamp conversion from Arrow to Parquet (GH-39466) Corrected an inaccuracy with RowGroupTotalCompressedBytes and RowGroupTotalBytesWritten for Parquet file writer (GH-39870) Fixed the TotalCompressedBytes count when falling back to plain encoding if a dictionary is too large (GH-39921) Fixed a bug when reslicing a nullable dictionary in the chunked writer (GH-39925) Enhancements Arrow Users can now access the underlying MemoTable of a dictionary builder (GH-38988) Added an option to provide a string replacer for CSV writing (GH-39552) Flight: Cookies can be copied to another connection to reuse existing credentials (GH-39837) Flight: enable PollFlightInfo for Flight SQL clients/servers (GH-39574) Added the ability to create a PreparedStatement from persisted data and provided access for FlightSQL users to the PreparedStatement handle property (GH-39774 GH-39910) FlightRPC Session management extensions have been implemented (GH-40155) Parquet Can now register new compression codecs for Parquet (GH-40113) Parquet footers can be incrementally written without closing the file (GH-40630) Java notes A breaking change to support Java 9 modules has been implemented in this release. GH-39001 A new Float16 type has been added. GH-39680 Java 22 is supported. GH-40680 Various bug fixes and improvements. JavaScript notes Dates are now stored as TimestampMillisecond (GH-40892) Vectors created from typed arrays are now correctly not nullable and null counts are now correct (GH-40852) Python notes Compatibility notes: To ensure PyArrow compatibility with NumPy 2.0 umbrella issue has been closed GH-39532 with last issues included in 16.0.0 Arrow release (GH-41098, GH-39848 and GH-40376). We no longer use internals to create Block objects and started using new pandas API with pandas version 3 GH-35081 Pandas compatibility code has been simplified as old pandas and Python versions are not supported anymore GH-40720 Deprecated pyarrow.filesystem legacy implementations have been removed GH-20127 New features: Converting Arrow Table and RecordBatch to a Tensor (not the same as tensor extension array) is being developed in Arrow C++ with bindings in Python. Umbrella issue: (GH-40058). In current release the option to convert a RecordBatch to Tensor with pyarrow.RecordBatch.to_tensor(...) is added returning a row or column major tensor with an option of writing missing values as NaN in the result. ListView and LargeListView array formats are now supported by PyArrow (GH-39812, GH-39855, GH-40205, GH-41039, GH-40266) Binary and StringView are now supported in PyArrow (GH-39651, GH-39852, GH-40092) Final support for Run-End Encoded arrays in PyArrow has been included (conversion to numpy and pandas GH-40659, construction in pa.array(...) GH-40273) AsofJoinNode C++ functionality is now exposed in Python as a join_asof GH-34235 Minimal python bindings are added for AzureFilesystem GH-39968 FixedSizeTensorScalar class is added GH-37484 Other improvements: Add ChunkedArray import/export to/from C GH-39984 pyarrow.Field and pyarrow.ChunkedArray can now be constructed from objects supporting the PyCapsule Arrow C Data Interface GH-38010 Requested_schema is supported in __arrow_c_stream__ implementations GH-40066 Add low-level bindings for exporting/importing the C Device Interface GH-39979 Function to download and extract timezone database on a Windows machine is added GH-37328 Missing methods are added to pyarrow.RecordBatch GH-30915 Dictionary is now also accepted in pyarrow.record_batch factory function (as in pyarrow.table) GH-40291 Usage of scalar legacy cast has been removed GH-40023 Missing byte_width attribute are added to all DataType classes GH-39277 FileInfo instances can now be used to construct Dataset objects GH-40142 Support hashing for FileMetaData and ParquetSchema GH-39780 force_virtual_addressing is exposed in PyArrow GH-39779 Relevant bug fixes: Calling pyarrow.dataset.ParquetFileFormat.make_write_options as a class method now returns a warning GH-39440 ScalarMemoTableis now initiated only when deduplication is enabled which fixes large memory consumption in the other case GH-40316 Slicing an array backwards beyond the start doesn’t include first item (GH-38768 and GH-40642) Memory leaks when creating Arrow array from Python list of dicts is fixed GH-37989 FixedSizeListType has not been considered as a nested type and is now added to _NESTED_TYPES GH-40171 max_chunksize is now validated in Table.to_batches GH-39788 Raising ValueError on _ensure_partitioningin Dataset is fixed GH-39579 Python stacktrace is now attached to errors in ConvertPyError GH-37164 R notes Arrow IPC streams (i.e., write_ipc_stream) can now be written to socket connections (GH-38897) The print() output for Dataset and Table objects has been improved so it now shows dimensions and truncates its output in the case of wide schemas (GH-38917) Various improvements and fixes to documentation, package build, and CI systems For more on what’s in the 16.0.0 R package, see the R changelog. Ruby and C GLib notes Ruby Added support for customizing timestamp parsers. GH-40590 C GLib Added support for time zone in GArrowTimestampDataType. GH-39702 Added missing compute function options. GH-40402 GArrowSplitPatternOptions GArrowStrftimeOptions GArrowStrptimeOptions GArrowStructFieldOptions Changed documentation generator to GI-DocGen from GTK-Doc. GH-39935 Added GArrowTimestampParser. GH-40438 Added support for customizing timestamp parsers. GH-40590 Rust notes The Rust projects have moved to separate repositories outside the main Arrow monorepo. For notes on the latest release of the Rust implementation, see the latest Arrow Rust changelog.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://arrow.apache.org/img/arrow-logo_horizontal_black-txt_white-bg.png" /><media:content medium="image" url="https://arrow.apache.org/img/arrow-logo_horizontal_black-txt_white-bg.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Apache Arrow ADBC 0.11.0 (Libraries) Release</title><link href="https://arrow.apache.org/blog/2024/03/31/adbc-0.11.0-release/" rel="alternate" type="text/html" title="Apache Arrow ADBC 0.11.0 (Libraries) Release" /><published>2024-03-31T00:00:00-04:00</published><updated>2024-03-31T00:00:00-04:00</updated><id>https://arrow.apache.org/blog/2024/03/31/adbc-0.11.0-release</id><content type="html" xml:base="https://arrow.apache.org/blog/2024/03/31/adbc-0.11.0-release/"><![CDATA[<!--
-->
<p>The Apache Arrow team is pleased to announce the 0.11.0 release of
the Apache Arrow ADBC libraries. This covers includes <a href="https://github.com/apache/arrow-adbc/milestone/15"><strong>36
resolved issues</strong></a> from <a href="#contributors"><strong>11 distinct contributors</strong></a>.</p>
<p>This is a release of the <strong>libraries</strong>, which are at version
0.11.0. The <strong>API specification</strong> is versioned separately and is
at version 1.1.0.</p>
<p>The release notes below are not exhaustive and only expose selected
highlights of the release. Many other bugfixes and improvements have
been made: we refer you to the <a href="https://github.com/apache/arrow-adbc/blob/apache-arrow-adbc-0.11.0/CHANGELOG.md">complete changelog</a>.</p>
<h2 id="release-highlights">Release Highlights</h2>
<p>This release includes <a href="https://www.nuget.org/packages?q=apache.arrow.adbc">NuGet packages</a> for C#.</p>
<p>The Flight SQL driver supports the session options and reuse-connection URI
scheme recently added to the protocol.</p>
<p>Go packages now require Go 1.21 or later, as Go 1.20 is out of support. The
Go drivers now use a common driver framework to make future maintenance
easier.</p>
<p>Python wheels now include debug info to help investigate bug reports. Also,
users of the DBAPI layer will find that the driver properly reacts to
SIGINT/Control+C in more places.</p>
<p>The Snowflake driver now returns table constraints metadata.</p>
<p>The SQLite driver now supports temporary tables and more ingestion options.</p>
<h2 id="contributors">Contributors</h2>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ git shortlog --perl-regexp --author='^((?!dependabot\[bot\]).*)$' -sn apache-arrow-adbc-0.10.0..apache-arrow-adbc-0.11.0
39 David Li
3 Matt Topol
2 Dewey Dunnington
2 davidhcoe
1 Adnan Khan
1 Bruce Irschick
1 Joel Lubinitsky
1 Julian Brandrick
1 Ruoxuan Wang
1 Ryan Syed
1 vleslief-ms
</code></pre></div></div>
<h2 id="roadmap">Roadmap</h2>
<p>We plan for the next release to be 1.0.0. We aim to have this out in late May
2024.</p>
<h2 id="getting-involved">Getting Involved</h2>
<p>We welcome questions and contributions from all interested. Issues
can be filed on <a href="https://github.com/apache/arrow-adbc/issues">GitHub</a>, and questions can be directed to GitHub
or the <a href="/community/">Arrow mailing lists</a>.</p>]]></content><author><name>pmc</name></author><category term="release" /><summary type="html"><![CDATA[The Apache Arrow team is pleased to announce the 0.11.0 release of the Apache Arrow ADBC libraries. This covers includes 36 resolved issues from 11 distinct contributors. This is a release of the libraries, which are at version 0.11.0. The API specification is versioned separately and is at version 1.1.0. The release notes below are not exhaustive and only expose selected highlights of the release. Many other bugfixes and improvements have been made: we refer you to the complete changelog. Release Highlights This release includes NuGet packages for C#. The Flight SQL driver supports the session options and reuse-connection URI scheme recently added to the protocol. Go packages now require Go 1.21 or later, as Go 1.20 is out of support. The Go drivers now use a common driver framework to make future maintenance easier. Python wheels now include debug info to help investigate bug reports. Also, users of the DBAPI layer will find that the driver properly reacts to SIGINT/Control+C in more places. The Snowflake driver now returns table constraints metadata. The SQLite driver now supports temporary tables and more ingestion options. Contributors $ git shortlog --perl-regexp --author='^((?!dependabot\[bot\]).*)$' -sn apache-arrow-adbc-0.10.0..apache-arrow-adbc-0.11.0 39 David Li 3 Matt Topol 2 Dewey Dunnington 2 davidhcoe 1 Adnan Khan 1 Bruce Irschick 1 Joel Lubinitsky 1 Julian Brandrick 1 Ruoxuan Wang 1 Ryan Syed 1 vleslief-ms Roadmap We plan for the next release to be 1.0.0. We aim to have this out in late May 2024. Getting Involved We welcome questions and contributions from all interested. Issues can be filed on GitHub, and questions can be directed to GitHub or the Arrow mailing lists.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://arrow.apache.org/img/arrow-logo_horizontal_black-txt_white-bg.png" /><media:content medium="image" url="https://arrow.apache.org/img/arrow-logo_horizontal_black-txt_white-bg.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Apache Arrow 15.0.2 Release</title><link href="https://arrow.apache.org/blog/2024/03/18/15.0.2-release/" rel="alternate" type="text/html" title="Apache Arrow 15.0.2 Release" /><published>2024-03-18T00:00:00-04:00</published><updated>2024-03-18T00:00:00-04:00</updated><id>https://arrow.apache.org/blog/2024/03/18/15.0.2-release</id><content type="html" xml:base="https://arrow.apache.org/blog/2024/03/18/15.0.2-release/"><![CDATA[<!--
-->
<p>The Apache Arrow team is pleased to announce the 15.0.2 release.
This is mostly a bugfix release that includes <a href="https://github.com/apache/arrow/milestone/61?closed=1"><strong>8 resolved issues</strong></a>
from <a href="/release/15.0.2.html#contributors"><strong>7 distinct contributors</strong></a>. See the Install Page to learn how to
get the libraries for your platform.</p>
<p>The release notes below are not exhaustive and only expose selected highlights
of the release. Other bugfixes and improvements have been made: we refer
you to the <a href="/release/15.0.2.html#changelog">complete changelog</a>.</p>
<h2 id="c-notes">C++ notes</h2>
<p>Several bug fixes, please see the full changelog for details.</p>
<p>Arrow v15.0.1 introduced a breaking ABI change due to the inclusion of <a href="https://github.com/apache/arrow/issues/39865">GH-39865</a>
this was reported and is a known issue (<a href="https://github.com/apache/arrow/issues/40604">GH-40604</a>).</p>
<h2 id="java-notes">Java notes</h2>
<p>Fixed a regression in Arrow Java v15.0.0 affecting the arrow-dataset module on Linux. The Protobuf library dependency was not statically linked properly, resulting in an undefined symbol. The regression caused the program to crash at runtime. See <a href="https://github.com/apache/arrow/issues/39919">GH-39919</a> for more details.</p>
<h2 id="python-notes">Python notes</h2>
<ul>
<li>Fix failure in building pyarrow when using the latest Cython release (<a href="https://github.com/apache/arrow/issues/40386">GH-40386</a>)</li>
</ul>]]></content><author><name>pmc</name></author><category term="release" /><summary type="html"><![CDATA[The Apache Arrow team is pleased to announce the 15.0.2 release. This is mostly a bugfix release that includes 8 resolved issues from 7 distinct contributors. See the Install Page to learn how to get the libraries for your platform. The release notes below are not exhaustive and only expose selected highlights of the release. Other bugfixes and improvements have been made: we refer you to the complete changelog. C++ notes Several bug fixes, please see the full changelog for details. Arrow v15.0.1 introduced a breaking ABI change due to the inclusion of GH-39865 this was reported and is a known issue (GH-40604). Java notes Fixed a regression in Arrow Java v15.0.0 affecting the arrow-dataset module on Linux. The Protobuf library dependency was not statically linked properly, resulting in an undefined symbol. The regression caused the program to crash at runtime. See GH-39919 for more details. Python notes Fix failure in building pyarrow when using the latest Cython release (GH-40386)]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://arrow.apache.org/img/arrow-logo_horizontal_black-txt_white-bg.png" /><media:content medium="image" url="https://arrow.apache.org/img/arrow-logo_horizontal_black-txt_white-bg.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Apache Arrow 15.0.1 Release</title><link href="https://arrow.apache.org/blog/2024/03/07/15.0.1-release/" rel="alternate" type="text/html" title="Apache Arrow 15.0.1 Release" /><published>2024-03-07T00:00:00-05:00</published><updated>2024-03-07T00:00:00-05:00</updated><id>https://arrow.apache.org/blog/2024/03/07/15.0.1-release</id><content type="html" xml:base="https://arrow.apache.org/blog/2024/03/07/15.0.1-release/"><![CDATA[<!--
-->
<p>The Apache Arrow team is pleased to announce the 15.0.1 release.
This is mostly a bugfix release that includes <a href="https://github.com/apache/arrow/milestone/60?closed=1"><strong>42 resolved issues</strong></a>
from <a href="/release/15.0.1.html#contributors"><strong>18 distinct contributors</strong></a>. See the Install Page to learn how to
get the libraries for your platform.</p>
<p>The release notes below are not exhaustive and only expose selected highlights
of the release. Other bugfixes and improvements have been made: we refer
you to the <a href="/release/15.0.1.html#changelog">complete changelog</a>.</p>
<h2 id="c-notes">C++ notes</h2>
<p>Several bug fixes, please see the full changelog for details.</p>
<p>With this patch release we introduced a breaking ABI change due to the inclusion of <a href="https://github.com/apache/arrow/issues/39865">GH-39865</a>
this was reported and is a known issue on <a href="https://github.com/apache/arrow/issues/40604">GH-40604</a>.</p>
<h2 id="python-notes">Python notes</h2>
<ul>
<li>Fix race condition with concurrent invocation of <code class="language-plaintext highlighter-rouge">_pandas_api.is_data_frame(df)</code> (<a href="https://github.com/apache/arrow/issues/39313">GH-39313</a>)</li>
<li>Fix leaking references to Numpy dtypes (<a href="https://github.com/apache/arrow/issues/39599">GH-39599</a>)</li>
<li>Fix except clauses in order to be compatible with Cython 3.0.9 (<a href="https://github.com/apache/arrow/issues/40386">GH-40386</a>)</li>
<li>Fix interpreter deadlock when using <code class="language-plaintext highlighter-rouge">GeneratorStream</code> (<a href="https://github.com/apache/arrow/issues/40004">GH-40004</a>)</li>
</ul>]]></content><author><name>pmc</name></author><category term="release" /><summary type="html"><![CDATA[The Apache Arrow team is pleased to announce the 15.0.1 release. This is mostly a bugfix release that includes 42 resolved issues from 18 distinct contributors. See the Install Page to learn how to get the libraries for your platform. The release notes below are not exhaustive and only expose selected highlights of the release. Other bugfixes and improvements have been made: we refer you to the complete changelog. C++ notes Several bug fixes, please see the full changelog for details. With this patch release we introduced a breaking ABI change due to the inclusion of GH-39865 this was reported and is a known issue on GH-40604. Python notes Fix race condition with concurrent invocation of _pandas_api.is_data_frame(df) (GH-39313) Fix leaking references to Numpy dtypes (GH-39599) Fix except clauses in order to be compatible with Cython 3.0.9 (GH-40386) Fix interpreter deadlock when using GeneratorStream (GH-40004)]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://arrow.apache.org/img/arrow-logo_horizontal_black-txt_white-bg.png" /><media:content medium="image" url="https://arrow.apache.org/img/arrow-logo_horizontal_black-txt_white-bg.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Announcing Apache Arrow DataFusion Comet</title><link href="https://arrow.apache.org/blog/2024/03/06/comet-donation/" rel="alternate" type="text/html" title="Announcing Apache Arrow DataFusion Comet" /><published>2024-03-06T00:00:00-05:00</published><updated>2024-03-06T00:00:00-05:00</updated><id>https://arrow.apache.org/blog/2024/03/06/comet-donation</id><content type="html" xml:base="https://arrow.apache.org/blog/2024/03/06/comet-donation/"><![CDATA[<!--
-->
<h1 id="introduction">Introduction</h1>
<p>The Apache Arrow PMC is pleased to announce the donation of the <a href="https://github.com/apache/arrow-datafusion-comet">Comet project</a>,
a native Spark SQL Accelerator built on <a href="https://arrow.apache.org/datafusion">Apache Arrow DataFusion</a>.</p>
<p>Comet is an Apache Spark plugin that uses Apache Arrow DataFusion to
accelerate Spark workloads. It is designed as a drop-in
replacement for Spark’s JVM based SQL execution engine and offers significant
performance improvements for some workloads as shown below.</p>
<figure style="text-align: center;">
<img src="/img/datafusion-comet/comet-architecture.png" width="100%" class="img-responsive" alt="Fig 1: Adaptive Arrow schema architecture overview." />
<figcaption>
<b>Figure 1</b>: With Comet, users interact with the same Spark ecosystem, tools
and APIs such as Spark SQL. Queries still run through Spark's query optimizer and planner.
However, the execution is delegated to Comet,
which is significantly faster and more resource efficient than a JVM based
implementation.
</figcaption>
</figure>
<p>Comet is one of a growing class of projects that aim to accelerate Spark using
native columnar engines such as the proprietary <a href="https://www.databricks.com/product/photon">Databricks Photon Engine</a> and
open source projects <a href="https://incubator.apache.org/projects/gluten.html">Gluten</a>, <a href="https://github.com/NVIDIA/spark-rapids">Spark RAPIDS</a>, and <a href="https://github.com/kwai/blaze">Blaze</a> (also built using
DataFusion).</p>
<p>Comet was originally implemented at Apple and the engineers who worked on the
project are also significant contributors to Arrow and DataFusion. Bringing
Comet into the Apache Software Foundation will accelerate its development and
grow its community of contributors and users.</p>
<h1 id="get-involved">Get Involved</h1>
<p>Comet is still in the early stages of development and we would love to have you
join us and help shape the project. We are working on an initial release, and
expect to post another update with more details at that time.</p>
<p>Before then, here are some ways to get involved:</p>
<ul>
<li>
<p>Learn more by visiting the <a href="https://github.com/apache/arrow-datafusion-comet">Comet project</a> page and reading the <a href="https://lists.apache.org/thread/0q1rb11jtpopc7vt1ffdzro0omblsh0s">mailing list
discussion</a> about the initial donation.</p>
</li>
<li>
<p>Help us plan out the <a href="https://github.com/apache/arrow-datafusion-comet/issues/19">roadmap</a></p>
</li>
<li>
<p>Try out the project and provide feedback, file issues, and contribute code.</p>
</li>
</ul>]]></content><author><name>pmc</name></author><category term="release" /><summary type="html"><![CDATA[Introduction The Apache Arrow PMC is pleased to announce the donation of the Comet project, a native Spark SQL Accelerator built on Apache Arrow DataFusion. Comet is an Apache Spark plugin that uses Apache Arrow DataFusion to accelerate Spark workloads. It is designed as a drop-in replacement for Spark’s JVM based SQL execution engine and offers significant performance improvements for some workloads as shown below. Figure 1: With Comet, users interact with the same Spark ecosystem, tools and APIs such as Spark SQL. Queries still run through Spark's query optimizer and planner. However, the execution is delegated to Comet, which is significantly faster and more resource efficient than a JVM based implementation. Comet is one of a growing class of projects that aim to accelerate Spark using native columnar engines such as the proprietary Databricks Photon Engine and open source projects Gluten, Spark RAPIDS, and Blaze (also built using DataFusion). Comet was originally implemented at Apple and the engineers who worked on the project are also significant contributors to Arrow and DataFusion. Bringing Comet into the Apache Software Foundation will accelerate its development and grow its community of contributors and users. Get Involved Comet is still in the early stages of development and we would love to have you join us and help shape the project. We are working on an initial release, and expect to post another update with more details at that time. Before then, here are some ways to get involved: Learn more by visiting the Comet project page and reading the mailing list discussion about the initial donation. Help us plan out the roadmap Try out the project and provide feedback, file issues, and contribute code.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://arrow.apache.org/img/arrow-logo_horizontal_black-txt_white-bg.png" /><media:content medium="image" url="https://arrow.apache.org/img/arrow-logo_horizontal_black-txt_white-bg.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Apache Arrow ADBC 0.10.0 (Libraries) Release</title><link href="https://arrow.apache.org/blog/2024/02/22/adbc-0.10.0-release/" rel="alternate" type="text/html" title="Apache Arrow ADBC 0.10.0 (Libraries) Release" /><published>2024-02-22T00:00:00-05:00</published><updated>2024-02-22T00:00:00-05:00</updated><id>https://arrow.apache.org/blog/2024/02/22/adbc-0.10.0-release</id><content type="html" xml:base="https://arrow.apache.org/blog/2024/02/22/adbc-0.10.0-release/"><![CDATA[<!--
-->
<p>The Apache Arrow team is pleased to announce the 0.10.0 release of
the Apache Arrow ADBC libraries. This covers includes <a href="https://github.com/apache/arrow-adbc/milestone/14"><strong>31
resolved issues</strong></a> from <a href="#contributors"><strong>18 distinct contributors</strong></a>.</p>
<p>This is a release of the <strong>libraries</strong>, which are at version
0.10.0. The <strong>API specification</strong> is versioned separately and is
at version 1.1.0.</p>
<p>The release notes below are not exhaustive and only expose selected
highlights of the release. Many other bugfixes and improvements have
been made: we refer you to the <a href="https://github.com/apache/arrow-adbc/blob/apache-arrow-adbc-0.10.0/CHANGELOG.md">complete changelog</a>.</p>
<h2 id="release-highlights">Release Highlights</h2>
<p>The BigQuery driver now handles large result sets, and supports passing a
scope when authenticating. It also has better support for ARRAY types.</p>
<p>The C++ implementation now requires C++17 or later.</p>
<p>The Flight SQL driver now supports the incremental execution feature with
Flight SQL services that implement PollFlightInfo. Also, it will reuse
credentials/cookies when creating sub-clients to fetch data.</p>
<p>The Go libraries now expose a <code class="language-plaintext highlighter-rouge">Close</code> method on <code class="language-plaintext highlighter-rouge">AdbcDatabase</code> structs; this
is a potentially breaking change.</p>
<p>The Java implementation now has Checker Framework nullness annotations. The
driver manager interface was overhauled; this is a breaking change to the
library APIs. The Java Flight SQL driver now supports <code class="language-plaintext highlighter-rouge">getObjects</code>.</p>
<p>The PostgreSQL driver will add the PostgreSQL type name to the field metadata
of columns of NUMERIC type. Also, it now handles ENUM types.</p>
<p>The Python bindings now return the underlying PyArrow <code class="language-plaintext highlighter-rouge">RecordBatchReader</code> when
requested, instead of the “wrapped” reader, due to a crash. This means that
callers will not get ADBC-wrapped exceptions from the reader. Also, Ctrl-C
will now interrupt ADBC operations on the main thread.</p>
<p>The Snowflake driver now has much, much faster bulk ingestion speed as it now
uploads bulk Parquet files instead of using bind parameters.</p>
<h2 id="contributors">Contributors</h2>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ git shortlog --perl-regexp --author='^((?!dependabot\[bot\]).*)$' -sn apache-arrow-adbc-0.9.0..apache-arrow-adbc-0.10.0
22 David Li
6 Dewey Dunnington
4 James Duong
4 Sutou Kouhei
4 William Ayd
4 davidhcoe
3 Matt Topol
2 Bruce Irschick
2 Joel Lubinitsky
2 Lubo Slivka
2 Soumya D. Sanyal
1 Anton Levakin
1 Bryce Mecum
1 Curt Hagenlocher
1 Ruoxuan Wang
1 Ryan Syed
1 eitsupi
1 olivroy
</code></pre></div></div>
<h2 id="roadmap">Roadmap</h2>
<p>We are planning on a 1.0.0 release of the libraries this year, either for the
next release or the release after that.</p>
<h2 id="getting-involved">Getting Involved</h2>
<p>We welcome questions and contributions from all interested. Issues
can be filed on <a href="https://github.com/apache/arrow-adbc/issues">GitHub</a>, and questions can be directed to GitHub
or the <a href="/community/">Arrow mailing lists</a>.</p>]]></content><author><name>pmc</name></author><category term="release" /><summary type="html"><![CDATA[The Apache Arrow team is pleased to announce the 0.10.0 release of the Apache Arrow ADBC libraries. This covers includes 31 resolved issues from 18 distinct contributors. This is a release of the libraries, which are at version 0.10.0. The API specification is versioned separately and is at version 1.1.0. The release notes below are not exhaustive and only expose selected highlights of the release. Many other bugfixes and improvements have been made: we refer you to the complete changelog. Release Highlights The BigQuery driver now handles large result sets, and supports passing a scope when authenticating. It also has better support for ARRAY types. The C++ implementation now requires C++17 or later. The Flight SQL driver now supports the incremental execution feature with Flight SQL services that implement PollFlightInfo. Also, it will reuse credentials/cookies when creating sub-clients to fetch data. The Go libraries now expose a Close method on AdbcDatabase structs; this is a potentially breaking change. The Java implementation now has Checker Framework nullness annotations. The driver manager interface was overhauled; this is a breaking change to the library APIs. The Java Flight SQL driver now supports getObjects. The PostgreSQL driver will add the PostgreSQL type name to the field metadata of columns of NUMERIC type. Also, it now handles ENUM types. The Python bindings now return the underlying PyArrow RecordBatchReader when requested, instead of the “wrapped” reader, due to a crash. This means that callers will not get ADBC-wrapped exceptions from the reader. Also, Ctrl-C will now interrupt ADBC operations on the main thread. The Snowflake driver now has much, much faster bulk ingestion speed as it now uploads bulk Parquet files instead of using bind parameters. Contributors $ git shortlog --perl-regexp --author='^((?!dependabot\[bot\]).*)$' -sn apache-arrow-adbc-0.9.0..apache-arrow-adbc-0.10.0 22 David Li 6 Dewey Dunnington 4 James Duong 4 Sutou Kouhei 4 William Ayd 4 davidhcoe 3 Matt Topol 2 Bruce Irschick 2 Joel Lubinitsky 2 Lubo Slivka 2 Soumya D. Sanyal 1 Anton Levakin 1 Bryce Mecum 1 Curt Hagenlocher 1 Ruoxuan Wang 1 Ryan Syed 1 eitsupi 1 olivroy Roadmap We are planning on a 1.0.0 release of the libraries this year, either for the next release or the release after that. Getting Involved We welcome questions and contributions from all interested. Issues can be filed on GitHub, and questions can be directed to GitHub or the Arrow mailing lists.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://arrow.apache.org/img/arrow-logo_horizontal_black-txt_white-bg.png" /><media:content medium="image" url="https://arrow.apache.org/img/arrow-logo_horizontal_black-txt_white-bg.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Apache Arrow nanoarrow 0.4.0 Release</title><link href="https://arrow.apache.org/blog/2024/01/29/nanoarrow-0.4.0-release/" rel="alternate" type="text/html" title="Apache Arrow nanoarrow 0.4.0 Release" /><published>2024-01-29T00:00:00-05:00</published><updated>2024-01-29T00:00:00-05:00</updated><id>https://arrow.apache.org/blog/2024/01/29/nanoarrow-0.4.0-release</id><content type="html" xml:base="https://arrow.apache.org/blog/2024/01/29/nanoarrow-0.4.0-release/"><![CDATA[<!--
-->
<p>The Apache Arrow team is pleased to announce the 0.4.0 release of
Apache Arrow nanoarrow. This release covers 46 resolved issues from
5 contributors.</p>
<h2 id="release-highlights">Release Highlights</h2>
<p>The primary focus of the nanoarrow 0.4.0 release was testing, stability, and code
quality. Notably, an implementation of the
<a href="https://arrow.apache.org/docs/format/Integration.html#example-c-data-interface">C data interface integration test</a>
protocol was added to ensure data produced or consumed by nanoarrow can be
consumed or produced by other Arrow implementations.</p>
<p>Apache Arrow nanoarrow 0.4.0 also contains experimental <a href="#python-bindings">Python bindings</a>
to the C library for the purposes of community testing and feedback while a more stable
set of bindings is prepared (targeted for 0.5.0).</p>
<p>See the
<a href="https://github.com/apache/arrow-nanoarrow/blob/apache-arrow-nanoarrow-0.4.0/CHANGELOG.md">Changelog</a>
for a detailed list of contributions to this release.</p>
<h2 id="breaking-changes">Breaking Changes</h2>
<p>Changes included in the nanoarrow 0.4.0 release will not break most downstream
code; however, several changes in the C library may result in additional compiler
warnings that could cause downstream build failures for projects with strict
compiler warning policies.</p>
<p>First, in debug mode (i.e., when <code class="language-plaintext highlighter-rouge">NANOARROW_DEBUG</code> is defined), an ignored return
value for functions that return <code class="language-plaintext highlighter-rouge">ArrowErrorCode</code> now issues a compiler warning
for compilers that support an “unused result” attribute. Ignoring the return value
of these functions is a common error and return values that are not
equal to <code class="language-plaintext highlighter-rouge">NANOARROW_OK</code> should be propagated as soon as possible. The C library
provides tools to check return values in a readable way. Notably:</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">NANOARROW_RETURN_NOT_OK()</code> can be used in a wrapper function that also returns
<code class="language-plaintext highlighter-rouge">ArrowErrorCode</code>.</li>
<li><code class="language-plaintext highlighter-rouge">NANOARROW_THROW_NOT_OK()</code> can be used from C++ code that inclues <code class="language-plaintext highlighter-rouge">nanoarrow.hpp</code>
and is prepared to handle exceptions.</li>
<li><code class="language-plaintext highlighter-rouge">NANOARROW_ASSERT_OK()</code> can be used to to check for <code class="language-plaintext highlighter-rouge">NANOARROW_OK</code> only in
debug mode (i.e., silently ignore errors in release mode and will crash in
debug mode with a message indicating the location of the error).</li>
</ul>
<p>Of these, the first or second is preferred. The
<a href="https://arrow.apache.org/nanoarrow/main/getting-started/cpp.html#arrow-c-data-nanoarrow-interface-basics">Getting Started with nanoarrow in C/C++ tutorial</a>
includes examples and advice for handling errors emanating from the nanoarrow C
library.</p>
<p>Second, in debug mode (i.e., when <code class="language-plaintext highlighter-rouge">NANOARROW_DEBUG</code> is defined), the appropriate
attribute was added to check the format string passed to <code class="language-plaintext highlighter-rouge">ArrowErrorSet()</code>
against the provided arguments. Correct code will be unaffected by this change;
however, actual arguments that do not match the format string (e.g., an <code class="language-plaintext highlighter-rouge">int64_t</code>
that is passed to <code class="language-plaintext highlighter-rouge">ArrowErrorSet()</code> with a format string of <code class="language-plaintext highlighter-rouge">"%d"</code>) should be
cast to the appropriate C type (e.g., <code class="language-plaintext highlighter-rouge">int</code>) or the format string should be fixed
to support the type of the actual argument (e.g., using <code class="language-plaintext highlighter-rouge">"%" PRId64</code>).</p>
<p>Third, functions in the C library that do not take ownersip of or modify input
are now properly marked as <code class="language-plaintext highlighter-rouge">const</code>. For example, <code class="language-plaintext highlighter-rouge">ArrowArrayViewGetIntUnsafe()</code>
previously accepted a <code class="language-plaintext highlighter-rouge">struct ArrowArrayView*</code> and now accepts a
<code class="language-plaintext highlighter-rouge">const struct ArrowArrayView*</code>. This change makes it more difficult to
accidentally modify input intended to be read-only and improves usability
from C++. Downstream projects that get a new warning about discarding a <code class="language-plaintext highlighter-rouge">const</code>
qualifier may need to adjust variable declarations or formal parameter types;
however, most projects should be unaffected by this change.</p>
<h3 id="cc">C/C++</h3>
<p>The nanoarrow 0.4.0 release includes a number of bugfixes and improvements
to the core C library and C++ helpers.</p>
<ul>
<li>An implementation of the
<a href="https://arrow.apache.org/docs/format/Integration.html#example-c-data-interface">C data interface integration test</a>
was added, including a reader/writer for the Arrow integration testing JSON
format. This was used to improve test coverage of the IPC reader and to
add nanoarrow as a participating member of integration testing in the CI
job that runs in the main Arrow repository.</li>
<li>The C library now supports a wider range of extended compiler warnings to
make it easier to vendor in projects with strict compiler warning policies.</li>
<li>C++ helpers were improved to support const-correctness. As a result,
the <code class="language-plaintext highlighter-rouge">UniqueSchema</code>, <code class="language-plaintext highlighter-rouge">UniqueArray</code>, <code class="language-plaintext highlighter-rouge">UniqueArrayView</code>, and <code class="language-plaintext highlighter-rouge">UniqueBuffer</code> now
work with a wider variety of C++ wrappers (e.g., <code class="language-plaintext highlighter-rouge">std::unordered_map</code>).</li>
</ul>
<h3 id="r-bindings">R bindings</h3>
<p>The nanoarrow R bindings are distributed as the <code class="language-plaintext highlighter-rouge">nanoarrow</code> package on
<a href="https://cran.r-project.org/">CRAN</a>. The 0.4.0 release of the R bindings includes
improvements in type support and stability. Notably:</p>
<ul>
<li>Documentation was improved for low-level users of nanoarrow that are producing or
consuming <code class="language-plaintext highlighter-rouge">ArrowArray</code>, <code class="language-plaintext highlighter-rouge">ArrowSchema</code>, and/or <code class="language-plaintext highlighter-rouge">ArrowArrayStream</code> structures
from C or C++ code in other R packages.</li>
<li>Improved conversion of <code class="language-plaintext highlighter-rouge">list()</code>s to support more types when the <strong>arrow</strong> R
package is not available.</li>
<li>Added more implmentations of <code class="language-plaintext highlighter-rouge">as_nanoarrow_array_stream()</code> to support more object
types from the <strong>arrow</strong> R package.</li>
<li>Added conversion from Arrow integer arrays to <code class="language-plaintext highlighter-rouge">character()</code>.</li>
</ul>
<h3 id="python-bindings">Python bindings</h3>
<p>The nanoarrow 0.4.0 release is the first release that contains Python bindings to the
nanoarrow C library! These initial Python bindings are experiemntal and are provided
to solicit an initial round of feedback from the Arrow community. Like the nanoarrow
C library and R bindings, it provides tools to facilitate the use of the
<a href="https://arrow.apache.org/docs/format/CDataInterface.html">Arrow C Data</a>
and <a href="https://arrow.apache.org/docs/format/CStreamInterface.html">Arrow C Stream</a>
interfaces.</p>
<p>You can install the initial release of the Python bindings from PyPI. The <code class="language-plaintext highlighter-rouge">nanoarrow</code>
Python package has been submitted to conda-forge and should be available once the
recipe has been reviewed.</p>
<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pip <span class="nb">install </span>nanoarrow
</code></pre></div></div>
<p>The initial release of the Python bindings contain <code class="language-plaintext highlighter-rouge">repr()</code>s to print out human-readable
representations of structures in the Arrow C Data and Stream interfaces.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="n">nanoarrow</span> <span class="k">as</span> <span class="n">na</span>
<span class="kn">import</span> <span class="n">pyarrow</span> <span class="k">as</span> <span class="n">pa</span>
<span class="n">na</span><span class="p">.</span><span class="nf">c_schema</span><span class="p">(</span><span class="n">pa</span><span class="p">.</span><span class="nf">decimal128</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">3</span><span class="p">))</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&lt;nanoarrow.c_lib.CSchema decimal128(10, 3)&gt;
- format: 'd:10,3'
- name: ''
- flags: 2
- metadata: NULL
- dictionary: NULL
- children[0]:
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">na</span><span class="p">.</span><span class="nf">c_array</span><span class="p">(</span><span class="n">pa</span><span class="p">.</span><span class="nf">array</span><span class="p">([</span><span class="sh">"</span><span class="s">one</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">two</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">three</span><span class="sh">"</span><span class="p">,</span> <span class="bp">None</span><span class="p">]))</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&lt;nanoarrow.c_lib.CArray string&gt;
- length: 4
- offset: 0
- null_count: 1
- buffers: (2939032895680, 2939032895616, 2939032895744)
- dictionary: NULL
- children[0]:
</code></pre></div></div>
<p>In addition to Arrow C Data interface wrappers, the initial nanoarrow Python bindings expose
wrappers for a few nanoarrow C library types like the <code class="language-plaintext highlighter-rouge">ArrowArrayView</code> that can be used to
interpret the content of the raw structures.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">na</span><span class="p">.</span><span class="nf">c_array_view</span><span class="p">(</span><span class="n">pa</span><span class="p">.</span><span class="nf">array</span><span class="p">([</span><span class="sh">"</span><span class="s">one</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">two</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">three</span><span class="sh">"</span><span class="p">,</span> <span class="bp">None</span><span class="p">]))</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&lt;nanoarrow.c_lib.CArrayView&gt;
- storage_type: 'string'
- length: 4
- offset: 0
- null_count: 1
- buffers[3]:
- &lt;bool validity[1 b] 11100000&gt;
- &lt;int32 data_offset[20 b] 0 3 6 11 11&gt;
- &lt;string data[11 b] b'onetwothree'&gt;
- dictionary: NULL
- children[0]:
</code></pre></div></div>
<p>Finally, the initial bindings contain a user-facing “data type” class. The <code class="language-plaintext highlighter-rouge">Schema</code>, like its
C Data interface counterpart, can represent a <code class="language-plaintext highlighter-rouge">pyarrow.Schema</code>, a <code class="language-plaintext highlighter-rouge">pyarrow.Field</code>, or a
<code class="language-plaintext highlighter-rouge">pyarrow.DataType</code>.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">na</span><span class="p">.</span><span class="nf">int32</span><span class="p">()</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Schema(INT32)
</code></pre></div></div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">na</span><span class="p">.</span><span class="nf">struct</span><span class="p">({</span><span class="sh">"</span><span class="s">col1</span><span class="sh">"</span><span class="p">:</span> <span class="n">na</span><span class="p">.</span><span class="nf">int32</span><span class="p">()})</span>
</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Schema(STRUCT, fields=[Schema(INT32, name='col1')])
</code></pre></div></div>
<p>The next release of nanoarrow for Python will include a user-facing <code class="language-plaintext highlighter-rouge">Array</code> class among other
improvements and features based on community feedback! For a more in-depth review of the
initial Python bindings, see the
<a href="https://arrow.apache.org/nanoarrow/latest/getting-started/python.html">Getting started in Python guide</a> and the
<a href="https://arrow.apache.org/nanoarrow/latest/reference/python.html">Python API reference</a></p>
<h2 id="contributors">Contributors</h2>
<p>This release consists of contributions from 4 contributors in addition
to the invaluable advice and support of the Apache Arrow developer mailing list.</p>
<div class="language-console highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gp">$</span><span class="w"> </span>git shortlog <span class="nt">-sn</span> 798a1b8f096c84e2b6f887427649f1cb496412b2..apache-arrow-nanoarrow-0.4.0 | <span class="nb">grep</span> <span class="nt">-v</span> <span class="s2">"GitHub Actions"</span>
<span class="go"> 35 Dewey Dunnington
3 William Ayd
2 Dirk Eddelbuettel
2 Joris Van den Bossche
1 eitsupi
</span></code></pre></div></div>]]></content><author><name>pmc</name></author><category term="release" /><summary type="html"><![CDATA[The Apache Arrow team is pleased to announce the 0.4.0 release of Apache Arrow nanoarrow. This release covers 46 resolved issues from 5 contributors. Release Highlights The primary focus of the nanoarrow 0.4.0 release was testing, stability, and code quality. Notably, an implementation of the C data interface integration test protocol was added to ensure data produced or consumed by nanoarrow can be consumed or produced by other Arrow implementations. Apache Arrow nanoarrow 0.4.0 also contains experimental Python bindings to the C library for the purposes of community testing and feedback while a more stable set of bindings is prepared (targeted for 0.5.0). See the Changelog for a detailed list of contributions to this release. Breaking Changes Changes included in the nanoarrow 0.4.0 release will not break most downstream code; however, several changes in the C library may result in additional compiler warnings that could cause downstream build failures for projects with strict compiler warning policies. First, in debug mode (i.e., when NANOARROW_DEBUG is defined), an ignored return value for functions that return ArrowErrorCode now issues a compiler warning for compilers that support an “unused result” attribute. Ignoring the return value of these functions is a common error and return values that are not equal to NANOARROW_OK should be propagated as soon as possible. The C library provides tools to check return values in a readable way. Notably: NANOARROW_RETURN_NOT_OK() can be used in a wrapper function that also returns ArrowErrorCode. NANOARROW_THROW_NOT_OK() can be used from C++ code that inclues nanoarrow.hpp and is prepared to handle exceptions. NANOARROW_ASSERT_OK() can be used to to check for NANOARROW_OK only in debug mode (i.e., silently ignore errors in release mode and will crash in debug mode with a message indicating the location of the error). Of these, the first or second is preferred. The Getting Started with nanoarrow in C/C++ tutorial includes examples and advice for handling errors emanating from the nanoarrow C library. Second, in debug mode (i.e., when NANOARROW_DEBUG is defined), the appropriate attribute was added to check the format string passed to ArrowErrorSet() against the provided arguments. Correct code will be unaffected by this change; however, actual arguments that do not match the format string (e.g., an int64_t that is passed to ArrowErrorSet() with a format string of "%d") should be cast to the appropriate C type (e.g., int) or the format string should be fixed to support the type of the actual argument (e.g., using "%" PRId64). Third, functions in the C library that do not take ownersip of or modify input are now properly marked as const. For example, ArrowArrayViewGetIntUnsafe() previously accepted a struct ArrowArrayView* and now accepts a const struct ArrowArrayView*. This change makes it more difficult to accidentally modify input intended to be read-only and improves usability from C++. Downstream projects that get a new warning about discarding a const qualifier may need to adjust variable declarations or formal parameter types; however, most projects should be unaffected by this change. C/C++ The nanoarrow 0.4.0 release includes a number of bugfixes and improvements to the core C library and C++ helpers. An implementation of the C data interface integration test was added, including a reader/writer for the Arrow integration testing JSON format. This was used to improve test coverage of the IPC reader and to add nanoarrow as a participating member of integration testing in the CI job that runs in the main Arrow repository. The C library now supports a wider range of extended compiler warnings to make it easier to vendor in projects with strict compiler warning policies. C++ helpers were improved to support const-correctness. As a result, the UniqueSchema, UniqueArray, UniqueArrayView, and UniqueBuffer now work with a wider variety of C++ wrappers (e.g., std::unordered_map). R bindings The nanoarrow R bindings are distributed as the nanoarrow package on CRAN. The 0.4.0 release of the R bindings includes improvements in type support and stability. Notably: Documentation was improved for low-level users of nanoarrow that are producing or consuming ArrowArray, ArrowSchema, and/or ArrowArrayStream structures from C or C++ code in other R packages. Improved conversion of list()s to support more types when the arrow R package is not available. Added more implmentations of as_nanoarrow_array_stream() to support more object types from the arrow R package. Added conversion from Arrow integer arrays to character(). Python bindings The nanoarrow 0.4.0 release is the first release that contains Python bindings to the nanoarrow C library! These initial Python bindings are experiemntal and are provided to solicit an initial round of feedback from the Arrow community. Like the nanoarrow C library and R bindings, it provides tools to facilitate the use of the Arrow C Data and Arrow C Stream interfaces. You can install the initial release of the Python bindings from PyPI. The nanoarrow Python package has been submitted to conda-forge and should be available once the recipe has been reviewed. pip install nanoarrow The initial release of the Python bindings contain repr()s to print out human-readable representations of structures in the Arrow C Data and Stream interfaces. import nanoarrow as na import pyarrow as pa na.c_schema(pa.decimal128(10, 3)) &lt;nanoarrow.c_lib.CSchema decimal128(10, 3)&gt; - format: 'd:10,3' - name: '' - flags: 2 - metadata: NULL - dictionary: NULL - children[0]: na.c_array(pa.array(["one", "two", "three", None])) &lt;nanoarrow.c_lib.CArray string&gt; - length: 4 - offset: 0 - null_count: 1 - buffers: (2939032895680, 2939032895616, 2939032895744) - dictionary: NULL - children[0]: In addition to Arrow C Data interface wrappers, the initial nanoarrow Python bindings expose wrappers for a few nanoarrow C library types like the ArrowArrayView that can be used to interpret the content of the raw structures. na.c_array_view(pa.array(["one", "two", "three", None])) &lt;nanoarrow.c_lib.CArrayView&gt; - storage_type: 'string' - length: 4 - offset: 0 - null_count: 1 - buffers[3]: - &lt;bool validity[1 b] 11100000&gt; - &lt;int32 data_offset[20 b] 0 3 6 11 11&gt; - &lt;string data[11 b] b'onetwothree'&gt; - dictionary: NULL - children[0]: Finally, the initial bindings contain a user-facing “data type” class. The Schema, like its C Data interface counterpart, can represent a pyarrow.Schema, a pyarrow.Field, or a pyarrow.DataType. na.int32() Schema(INT32) na.struct({"col1": na.int32()}) Schema(STRUCT, fields=[Schema(INT32, name='col1')]) The next release of nanoarrow for Python will include a user-facing Array class among other improvements and features based on community feedback! For a more in-depth review of the initial Python bindings, see the Getting started in Python guide and the Python API reference Contributors This release consists of contributions from 4 contributors in addition to the invaluable advice and support of the Apache Arrow developer mailing list. $ git shortlog -sn 798a1b8f096c84e2b6f887427649f1cb496412b2..apache-arrow-nanoarrow-0.4.0 | grep -v "GitHub Actions" 35 Dewey Dunnington 3 William Ayd 2 Dirk Eddelbuettel 2 Joris Van den Bossche 1 eitsupi]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://arrow.apache.org/img/arrow-logo_horizontal_black-txt_white-bg.png" /><media:content medium="image" url="https://arrow.apache.org/img/arrow-logo_horizontal_black-txt_white-bg.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Apache Arrow 15.0.0 Release</title><link href="https://arrow.apache.org/blog/2024/01/21/15.0.0-release/" rel="alternate" type="text/html" title="Apache Arrow 15.0.0 Release" /><published>2024-01-21T00:00:00-05:00</published><updated>2024-01-21T00:00:00-05:00</updated><id>https://arrow.apache.org/blog/2024/01/21/15.0.0-release</id><content type="html" xml:base="https://arrow.apache.org/blog/2024/01/21/15.0.0-release/"><![CDATA[<!--
-->
<p>The Apache Arrow team is pleased to announce the 15.0.0 release. This covers
over 3 months of development work and includes <a href="https://github.com/apache/arrow/milestone/56?closed=1"><strong>344 resolved issues</strong></a>
on <a href="/release/15.0.0.html#contributors"><strong>536 distinct commits</strong></a> from <a href="/release/15.0.0.html#contributors"><strong>101 distinct contributors</strong></a>.
See the <a href="https://arrow.apache.org/install/">Install Page</a>
to learn how to get the libraries for your platform.</p>
<p>The release notes below are not exhaustive and only expose selected highlights
of the release. Many other bugfixes and improvements have been made: we refer
you to the <a href="/release/15.0.0.html#changelog">complete changelog</a>.</p>
<h2 id="community">Community</h2>
<p>Since the 14.0.0 release, Curt Hagenlocher, Xuwei Fu, James Duong and Felipe Oliveira Carvalho
have been invited to be committers.
Jonathan Keane and Raúl Cumplido have joined the Project Management Committee (PMC).</p>
<p>As per our tradition of rotating the PMC chair once a year
Andy Grove was elected as the new PMC chair and VP.</p>
<p>Thanks for your contributions and participation in the project!</p>
<h2 id="c-data-interface-notes">C Data Interface notes</h2>
<p>New format strings have been added for ListView, LargeListView, BinaryView and StringView array types.</p>
<h2 id="arrow-flight-rpc-notes">Arrow Flight RPC notes</h2>
<p>Flight SQL is now considered stable (<a href="https://github.com/apache/arrow/issues/39037">GH-39037</a>). The Flight SQL specification was clarified regarding how the result set schema of a prepared statement is affected by bound parameters (<a href="https://github.com/apache/arrow/issues/37061">GH-37061</a>).</p>
<p>The JDBC Arrow Flight SQL driver now supports mTLS authentication (<a href="https://github.com/apache/arrow/issues/38460">GH-38460</a>) and bind parameters (<a href="https://github.com/apache/arrow/issues/33475">GH-33475</a>), follows the Flight RPC spec when fetching data (<a href="https://github.com/apache/arrow/issues/34532">GH-34532</a>), and can reuse credentials across metadata and data connections (<a href="https://github.com/apache/arrow/issues/38576">GH-38576</a>). On macOS it will also use the system keychain to be consistent with other platforms (<a href="https://github.com/apache/arrow/issues/39014">GH-39014</a>). Applications can also retrieve the underlying Flight RPC metadata from the JDBC driver (GH-38024, GH-38022).</p>
<h2 id="c-notes">C++ notes</h2>
<p>For C++ notes refer to the full changelog.</p>
<h3 id="parquet">Parquet</h3>
<h4 id="new-features">New features:</h4>
<ul>
<li>Support row group filtering for nested paths for C++ and Parquet (<a href="https://github.com/apache/arrow/issues/39064">GH-39064</a>)</li>
<li>Implement Parquet Float16 logical type (<a href="https://github.com/apache/arrow/issues/36036">GH-36036</a>)</li>
<li>Expose sorting_columns in RowGroupMetaData for Parquet files (<a href="https://github.com/apache/arrow/issues/35331">GH-35331</a>)</li>
<li>Support decompressing concatenated gzip members (stream) (<a href="https://github.com/apache/arrow/issues/38271">GH-38271</a>)</li>
</ul>
<h4 id="api-change">API change:</h4>
<ul>
<li>Move EstimatedBufferedValueBytes from TypedColumnWriter to ColumnWriter (<a href="https://github.com/apache/arrow/issues/38887">GH-38887</a>)</li>
<li>Change parquet TypedComparator operation to const methods (<a href="https://github.com/apache/arrow/issues/38874">GH-38874</a>)</li>
<li>Remove deprecated AppendRowGroup(int64_t num_rows) (<a href="https://github.com/apache/arrow/issues/39208">GH-39208</a>)</li>
<li>Add api to get RecordReader from RowGroupReader (<a href="https://github.com/apache/arrow/issues/37002">GH-37002</a>)</li>
</ul>
<h4 id="bug-fixes">Bug fixes:</h4>
<ul>
<li>Add more closed file checks for ParquetFileWriter to Prevent from used-after-close (<a href="https://github.com/apache/arrow/issues/38390">GH-38390</a>)</li>
</ul>
<h4 id="performance-enhancement">Performance enhancement:</h4>
<ul>
<li>Faster Scalar BYTE_STREAM_SPLIT encoding/decoding (<a href="https://github.com/apache/arrow/issues/38542">GH-38542</a>)</li>
<li>Faster reading Parquet FLBA (GH-39124, GH-39413)</li>
<li>Using bloom_filter_length in parquet 2.10 to optimize bloom filter read (<a href="https://github.com/apache/arrow/issues/38860">GH-38860</a>)</li>
</ul>
<h3 id="miscellaneous">Miscellaneous</h3>
<ul>
<li>Upgrade ORC to 1.9.2 (<a href="https://github.com/apache/arrow/issues/39430">GH-39340</a>)</li>
</ul>
<h2 id="c-notes-1">C# notes</h2>
<p>Removal of build targets:</p>
<ul>
<li>Remove out-of-support versions of .NET and update C# README <a href="https://github.com/apache/arrow/pull/39165">GH-31579</a></li>
</ul>
<p>New features:</p>
<ul>
<li>Better support for decimal values which exceed the range of the BCL’s System.Decimal <a href="https://github.com/apache/arrow/pull/38481">GH-38351</a>, <a href="https://github.com/apache/arrow/pull/38508">GH-38483</a></li>
<li>Expose ArrayDataConcentrator.Concatenate publicly <a href="https://github.com/apache/arrow/pull/38154">GH-38153</a></li>
<li>Add ToString methods to Arrow classes <a href="https://github.com/apache/arrow/pull/30717">GH-36566</a></li>
<li>Implement common interfaces for structure arrays and record batches <a href="https://github.com/apache/arrow/pull/38759">GH-38757</a></li>
<li>Make primitive arrays support IReadOnlyList&lt;T?&gt; <a href="https://github.com/apache/arrow/pull/38680">GH-38348</a>, <a href="https://github.com/apache/arrow/pull/39224">GH-39223</a></li>
<li>Add ToList to Decimal128Array and Decimal256Array <a href="https://github.com/apache/arrow/pull/37383">GH-37359</a></li>
<li>Support additional types Interval, Utf8View, BinaryView and ListView <a href="https://github.com/apache/arrow/pull/39043">GH-38316</a>, <a href="https://github.com/apache/arrow/pull/39342">GH-39341</a></li>
<li>Support creating FlightClient with Grpc.Core.Channel <a href="https://github.com/apache/arrow/pull/39348">GH-39335</a></li>
</ul>
<p>Fixes and improved compatibility:</p>
<ul>
<li>Make dictionaries in file and memory implementations work correctly and support integration tests <a href="https://github.com/apache/arrow/pull/39146">GH-32662</a></li>
<li>Support blank column names and enable more integration tests <a href="https://github.com/apache/arrow/pull/39167">GH-36588</a></li>
</ul>
<h2 id="go-notes">Go Notes</h2>
<h3 id="bug-fixes-1">Bug Fixes</h3>
<h4 id="arrow">Arrow</h4>
<ul>
<li>Ensured reliability of AuthenticateBasicToken behind proxies (<a href="https://github.com/apache/arrow/issues/38198">GH-38198</a>)</li>
<li>Ensured release callback is properly called on C Data imported arrays/batches (<a href="https://github.com/apache/arrow/issues/38281">GH-38281</a>)</li>
<li>Fixed rounding errors in decimal256 string functions (<a href="https://github.com/apache/arrow/issues/38395">GH-38395</a>)</li>
<li>Added <code class="language-plaintext highlighter-rouge">ValueLen</code> to Binary and String array interface (<a href="https://github.com/apache/arrow/issues/38458">GH-38458</a>)</li>
<li>Fixed Decimal128 rounding issues (<a href="https://github.com/apache/arrow/issues/38477">GH-38477</a>)</li>
<li>Fixed memory leak in IPC LZ4 decompressor (<a href="https://github.com/apache/arrow/issues/38728">GH-38728</a>)</li>
<li>Addressed Data race in <code class="language-plaintext highlighter-rouge">GetToTimeFunc</code> for fixed timestamp data types (<a href="https://github.com/apache/arrow/issues/38795">GH-38795</a>)</li>
<li>Fixed “index out of range” error for empty resultsets of FlightSQL driver (<a href="https://github.com/apache/arrow/issues/39238">GH-39238</a>)</li>
</ul>
<h4 id="parquet-1">Parquet</h4>
<ul>
<li>Fixed issue with max definition levels when writing a Parquet file under certain circumstances (<a href="https://github.com/apache/arrow/issues/38503">GH-38503</a>)</li>
<li>File writer now properly tracks the number of rows written beyond the last row group (<a href="https://github.com/apache/arrow/issues/38516">GH-38516</a>)</li>
</ul>
<h3 id="enhancements">Enhancements</h3>
<h4 id="arrow-1">Arrow</h4>
<ul>
<li>Added an Avro OCF reader for converting Avro files directly to Arrow record batches (<a href="https://github.com/apache/arrow/issues/36760">GH-36760</a>)</li>
<li>Added support for StringView (<a href="https://github.com/apache/arrow/issues/38718">GH-38718</a>) and C Data ABI StringViews (<a href="https://github.com/apache/arrow/issues/39013">GH-39013</a>)</li>
<li>GC Checks were enabled for CI running integration tests (<a href="https://github.com/apache/arrow/issues/38824">GH-38824</a>)</li>
</ul>
<h4 id="parquet-2">Parquet</h4>
<ul>
<li>Implemented Float16 logical type for Parquet files (<a href="https://github.com/apache/arrow/issues/37582">GH-37582</a>)</li>
<li>Added proper boolean RLE encoding/decoding (<a href="https://github.com/apache/arrow/issues/38462">GH-38462</a>)</li>
</ul>
<h3 id="bug-fixes-2">Bug Fixes</h3>
<h3 id="enhancements-1">Enhancements</h3>
<h2 id="java-notes">Java notes</h2>
<p><strong>We expect a breaking change in the next release, Arrow 16.0.0.</strong> Support for Java 9 modules is coming, but that will require changing the JVM flags used to launch your application (<a href="https://github.com/apache/arrow/issues/38998">GH-38998</a>). Arrow 15.0.0 is not affected.</p>
<p>A bill-of-materials (BOM) package was added to make it easier to depend on multiple Arrow libraries (<a href="https://github.com/apache/arrow/issues/38264">GH-38264</a>).</p>
<p>The JDBC adapter (separate from the JDBC driver) now supports 256-bit decimals (<a href="https://github.com/apache/arrow/issues/39484">GH-39484</a>) and throws more informative exceptions (<a href="https://github.com/apache/arrow/issues/39355">GH-39355</a>).</p>
<p>Various improvements were made to utilities for working with vectors (GH-38662, GH-38614, GH-38511, GH-38254, GH-38246).</p>
<h2 id="javascript-notes">JavaScript notes</h2>
<p>This release comes with new features and APIs. We also removed <code class="language-plaintext highlighter-rouge">getByteLength</code> to reduce bundle sizes.</p>
<p>New Features with API changes</p>
<ul>
<li><a href="https://github.com/apache/arrow/pull/39018">GH-39017: [JS] Add typeId as attribute</a></li>
<li><a href="https://github.com/apache/arrow/pull/39258">GH-39257: [JS] LargeBinary</a></li>
<li><a href="https://github.com/apache/arrow/pull/35780">GH-15060: [JS] Add LargeUtf8 type</a></li>
<li><a href="https://github.com/apache/arrow/pull/39260">GH-39259: [JS] Remove getByteLength</a></li>
<li><a href="https://github.com/apache/arrow/pull/39436">GH-39435: [JS] Add Vector.nullable</a></li>
<li><a href="https://github.com/apache/arrow/pull/39256">GH-39255: [JS] Allow customization of schema when passing vectors to table constructor</a></li>
<li><a href="https://github.com/apache/arrow/pull/39254">GH-37983: [JS] Allow nullable fields in table when constructed from vector with nulls</a></li>
</ul>
<p>Package changes</p>
<ul>
<li><a href="https://github.com/apache/arrow/pull/39475">GH-39289: [JS] Add types to exports</a></li>
</ul>
<h2 id="python-notes">Python notes</h2>
<p>Compatibility notes:</p>
<ul>
<li>Legacy <code class="language-plaintext highlighter-rouge">ParquetDataset</code> custom implementation has been removed and only the new dataset API is now in use <a href="https://github.com/apache/arrow/issues/31303">GH-31303</a>.</li>
</ul>
<p>New features:</p>
<ul>
<li>PyArrow version 14.0.0 included a new specification for Arrow PyCapsules and related dunder methods <a href="https://github.com/apache/arrow/pull/37797">GH-35531</a> and now a public <code class="language-plaintext highlighter-rouge">RecordBatchReader</code> constructor from stream object implementing the PyCapsule Protocol has been added <a href="https://github.com/apache/arrow/issues/39217">GH-[39217](https://github.com/apache/arrow/issues/39217)</a> together with some additional documentation <a href="https://github.com/apache/arrow/issues/39196">GH-[39196](https://github.com/apache/arrow/issues/39196)</a>.</li>
<li>DLPack protocol support (producer) was added to the Arrow C++ and is exposed in Python through <code class="language-plaintext highlighter-rouge">__dlpack__</code> and <code class="language-plaintext highlighter-rouge">__dlpack_device__</code> dunder methods <a href="https://github.com/apache/arrow/issues/33984">GH-33984</a>.</li>
<li>Python now exposes enabling CRC checksum for read and write operations in Paquet <a href="https://github.com/apache/arrow/issues/37242">GH-37242</a>. CRC checksum are optional and can detect data corruption.</li>
<li><code class="language-plaintext highlighter-rouge">CacheOptions</code> are now configurable from Python as part of the <code class="language-plaintext highlighter-rouge">pyarrow.dataset.ParquetFragmentScanOptions</code> <a href="https://github.com/apache/arrow/issues/36441">GH-36441</a>.</li>
<li>Parquet metadata to indicate sort order of the data are now exposed in <code class="language-plaintext highlighter-rouge">RowGroupMetaData</code> <a href="https://github.com/apache/arrow/issues/35331">GH-35331</a>.</li>
<li>Parquet Support write and validate Page CRC (<a href="https://github.com/apache/arrow/issues/37242">GH-37242</a>)</li>
</ul>
<p>Other improvements:</p>
<ul>
<li>Append parameter from <code class="language-plaintext highlighter-rouge">FileOutputStream</code> is exposed for the <code class="language-plaintext highlighter-rouge">OSFile</code> class <a href="https://github.com/apache/arrow/issues/38857">GH-38857</a>.</li>
<li>File size can be passed to <code class="language-plaintext highlighter-rouge">make_fragment</code> in the pyarrow datasets (<code class="language-plaintext highlighter-rouge">pyarrow.dataset.FileFormat</code>and <code class="language-plaintext highlighter-rouge">pyarrow.dataset.ParquetFileFormat</code>) <a href="https://github.com/apache/arrow/issues/37857">GH-37857</a>.</li>
<li>Support for mask parameter is added to <code class="language-plaintext highlighter-rouge">FixedSizeListArray.from_arrays</code> <a href="https://github.com/apache/arrow/issues/34316">GH-34316</a></li>
<li><code class="language-plaintext highlighter-rouge">to/from_struct_array</code> are added to the <code class="language-plaintext highlighter-rouge">pyarrow.Table</code> class <a href="https://github.com/apache/arrow/issues/33500">GH-33500</a>.</li>
<li>GIL is released in <code class="language-plaintext highlighter-rouge">.nbytes</code> which is improving performance when calculating the data size <a href="https://github.com/apache/arrow/issues/39096">GH-39096</a>.</li>
<li>Usage of pandas internals <code class="language-plaintext highlighter-rouge">DatetimeTZBlock</code> has been removed <a href="https://github.com/apache/arrow/issues/38341">GH-38341</a>.</li>
<li><code class="language-plaintext highlighter-rouge">DataType</code> instance can be passed to <code class="language-plaintext highlighter-rouge">MapType.from_arrays</code> constructor <a href="https://github.com/apache/arrow/issues/39515">GH-39515</a>.</li>
</ul>
<p>Relevant bug fixes:</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">S3FileSystem</code> equals <code class="language-plaintext highlighter-rouge">None</code> segfault has been fixed <a href="https://github.com/apache/arrow/issues/38535">GH-38535</a>.</li>
<li>No-op kernel is added for <code class="language-plaintext highlighter-rouge">dictionary_encode(dictionary)</code> <a href="https://github.com/apache/arrow/issues/34890">GH-34890</a>.</li>
<li>PrettyPrint for Timestamp type now adds “Z” at the end of the print string when tz is defined in order to add minimum information about the values being stored in UTC <a href="https://github.com/apache/arrow/issues/30117">GH-30117</a>.</li>
</ul>
<h2 id="r-notes">R notes</h2>
<h3 id="new-features-1">New features:</h3>
<ul>
<li>Bindings for <code class="language-plaintext highlighter-rouge">base::prod</code> have been added so you can now use it in your dplyr pipelines (i.e., <code class="language-plaintext highlighter-rouge">tbl |&gt; summarize(prod(col))</code>) without having to pull the data into R <a href="https://github.com/apache/arrow/pull/38601">GH-38601</a>.</li>
<li>Calling <code class="language-plaintext highlighter-rouge">dimnames</code> or <code class="language-plaintext highlighter-rouge">colnames</code> on <code class="language-plaintext highlighter-rouge">Dataset</code> objects now returns a useful result rather than just <code class="language-plaintext highlighter-rouge">NULL</code> <a href="https://github.com/apache/arrow/pull/38377">GH-38377</a>.</li>
<li>The <code class="language-plaintext highlighter-rouge">code()</code> method on Schema objects now takes an optional <code class="language-plaintext highlighter-rouge">namespace</code> argument which, when <code class="language-plaintext highlighter-rouge">TRUE</code>, prefixes names with <code class="language-plaintext highlighter-rouge">arrow::</code> which makes the output more portable <a href="https://github.com/apache/arrow/pull/38144">GH-38144</a>.</li>
</ul>
<h3 id="other-improvements">Other improvements:</h3>
<ul>
<li>To make debugging problems easier when using arrow with AWS S3 (e..g, <code class="language-plaintext highlighter-rouge">s3_bucket</code>, <code class="language-plaintext highlighter-rouge">S3FileSystem</code>), the debug log level for S3 can be set with the <code class="language-plaintext highlighter-rouge">AWS_S3_LOG_LEVEL</code> environment variable. See <code class="language-plaintext highlighter-rouge">?S3FileSystem</code> for more information. <a href="https://github.com/apache/arrow/pull/38267">GH-38267</a></li>
<li>An error is now thrown instead of warning and pulling the data into R when any of <code class="language-plaintext highlighter-rouge">sub</code>, <code class="language-plaintext highlighter-rouge">gsub</code>, <code class="language-plaintext highlighter-rouge">stringr::str_replace</code>, <code class="language-plaintext highlighter-rouge">stringr::str_replace_all</code> are passed a length &gt; 1 vector of values in <code class="language-plaintext highlighter-rouge">pattern</code> <a href="https://github.com/apache/arrow/pull/39219">GH-39219</a>.</li>
<li>Missing documentation was added to <code class="language-plaintext highlighter-rouge">?open_dataset</code> documenting how to use the ND-JSON support added in arrow 13.0.0 <a href="https://github.com/apache/arrow/pull/38258">GH-38258</a>.</li>
<li>Using arrow with duckdb (i.e., <code class="language-plaintext highlighter-rouge">to_duckdb()</code>) no longer results in warnings when quitting your R session. <a href="https://github.com/apache/arrow/pull/38495">GH-38495</a></li>
</ul>
<p>For more on what’s in the 15.0.0 R package, see the <a href="/docs/r/news/">R changelog</a>.</p>
<h2 id="ruby-and-c-glib-notes">Ruby and C GLib notes</h2>
<h3 id="ruby">Ruby</h3>
<ul>
<li>Add <code class="language-plaintext highlighter-rouge">Arrow::Table#each_raw_record</code> and <code class="language-plaintext highlighter-rouge">Arrow::RecordBatch#each_raw_record</code> (<a href="https://github.com/apache/arrow/issues/37137">GH-37137</a>, <a href="https://github.com/apache/arrow/issues/37600">GH-37600</a>)</li>
</ul>
<h3 id="c-glib">C GLib</h3>
<ul>
<li>Follow C++ changes.</li>
</ul>
<h2 id="rust-notes">Rust notes</h2>
<p>The Rust projects have moved to separate repositories outside the
main Arrow monorepo. For notes on the latest release of the Rust
implementation, see the latest <a href="https://github.com/apache/arrow-rs/tags">Arrow Rust changelog</a>.</p>]]></content><author><name>pmc</name></author><category term="release" /><summary type="html"><![CDATA[The Apache Arrow team is pleased to announce the 15.0.0 release. This covers over 3 months of development work and includes 344 resolved issues on 536 distinct commits from 101 distinct contributors. See the Install Page to learn how to get the libraries for your platform. The release notes below are not exhaustive and only expose selected highlights of the release. Many other bugfixes and improvements have been made: we refer you to the complete changelog. Community Since the 14.0.0 release, Curt Hagenlocher, Xuwei Fu, James Duong and Felipe Oliveira Carvalho have been invited to be committers. Jonathan Keane and Raúl Cumplido have joined the Project Management Committee (PMC). As per our tradition of rotating the PMC chair once a year Andy Grove was elected as the new PMC chair and VP. Thanks for your contributions and participation in the project! C Data Interface notes New format strings have been added for ListView, LargeListView, BinaryView and StringView array types. Arrow Flight RPC notes Flight SQL is now considered stable (GH-39037). The Flight SQL specification was clarified regarding how the result set schema of a prepared statement is affected by bound parameters (GH-37061). The JDBC Arrow Flight SQL driver now supports mTLS authentication (GH-38460) and bind parameters (GH-33475), follows the Flight RPC spec when fetching data (GH-34532), and can reuse credentials across metadata and data connections (GH-38576). On macOS it will also use the system keychain to be consistent with other platforms (GH-39014). Applications can also retrieve the underlying Flight RPC metadata from the JDBC driver (GH-38024, GH-38022). C++ notes For C++ notes refer to the full changelog. Parquet New features: Support row group filtering for nested paths for C++ and Parquet (GH-39064) Implement Parquet Float16 logical type (GH-36036) Expose sorting_columns in RowGroupMetaData for Parquet files (GH-35331) Support decompressing concatenated gzip members (stream) (GH-38271) API change: Move EstimatedBufferedValueBytes from TypedColumnWriter to ColumnWriter (GH-38887) Change parquet TypedComparator operation to const methods (GH-38874) Remove deprecated AppendRowGroup(int64_t num_rows) (GH-39208) Add api to get RecordReader from RowGroupReader (GH-37002) Bug fixes: Add more closed file checks for ParquetFileWriter to Prevent from used-after-close (GH-38390) Performance enhancement: Faster Scalar BYTE_STREAM_SPLIT encoding/decoding (GH-38542) Faster reading Parquet FLBA (GH-39124, GH-39413) Using bloom_filter_length in parquet 2.10 to optimize bloom filter read (GH-38860) Miscellaneous Upgrade ORC to 1.9.2 (GH-39340) C# notes Removal of build targets: Remove out-of-support versions of .NET and update C# README GH-31579 New features: Better support for decimal values which exceed the range of the BCL’s System.Decimal GH-38351, GH-38483 Expose ArrayDataConcentrator.Concatenate publicly GH-38153 Add ToString methods to Arrow classes GH-36566 Implement common interfaces for structure arrays and record batches GH-38757 Make primitive arrays support IReadOnlyList&lt;T?&gt; GH-38348, GH-39223 Add ToList to Decimal128Array and Decimal256Array GH-37359 Support additional types Interval, Utf8View, BinaryView and ListView GH-38316, GH-39341 Support creating FlightClient with Grpc.Core.Channel GH-39335 Fixes and improved compatibility: Make dictionaries in file and memory implementations work correctly and support integration tests GH-32662 Support blank column names and enable more integration tests GH-36588 Go Notes Bug Fixes Arrow Ensured reliability of AuthenticateBasicToken behind proxies (GH-38198) Ensured release callback is properly called on C Data imported arrays/batches (GH-38281) Fixed rounding errors in decimal256 string functions (GH-38395) Added ValueLen to Binary and String array interface (GH-38458) Fixed Decimal128 rounding issues (GH-38477) Fixed memory leak in IPC LZ4 decompressor (GH-38728) Addressed Data race in GetToTimeFunc for fixed timestamp data types (GH-38795) Fixed “index out of range” error for empty resultsets of FlightSQL driver (GH-39238) Parquet Fixed issue with max definition levels when writing a Parquet file under certain circumstances (GH-38503) File writer now properly tracks the number of rows written beyond the last row group (GH-38516) Enhancements Arrow Added an Avro OCF reader for converting Avro files directly to Arrow record batches (GH-36760) Added support for StringView (GH-38718) and C Data ABI StringViews (GH-39013) GC Checks were enabled for CI running integration tests (GH-38824) Parquet Implemented Float16 logical type for Parquet files (GH-37582) Added proper boolean RLE encoding/decoding (GH-38462) Bug Fixes Enhancements Java notes We expect a breaking change in the next release, Arrow 16.0.0. Support for Java 9 modules is coming, but that will require changing the JVM flags used to launch your application (GH-38998). Arrow 15.0.0 is not affected. A bill-of-materials (BOM) package was added to make it easier to depend on multiple Arrow libraries (GH-38264). The JDBC adapter (separate from the JDBC driver) now supports 256-bit decimals (GH-39484) and throws more informative exceptions (GH-39355). Various improvements were made to utilities for working with vectors (GH-38662, GH-38614, GH-38511, GH-38254, GH-38246). JavaScript notes This release comes with new features and APIs. We also removed getByteLength to reduce bundle sizes. New Features with API changes GH-39017: [JS] Add typeId as attribute GH-39257: [JS] LargeBinary GH-15060: [JS] Add LargeUtf8 type GH-39259: [JS] Remove getByteLength GH-39435: [JS] Add Vector.nullable GH-39255: [JS] Allow customization of schema when passing vectors to table constructor GH-37983: [JS] Allow nullable fields in table when constructed from vector with nulls Package changes GH-39289: [JS] Add types to exports Python notes Compatibility notes: Legacy ParquetDataset custom implementation has been removed and only the new dataset API is now in use GH-31303. New features: PyArrow version 14.0.0 included a new specification for Arrow PyCapsules and related dunder methods GH-35531 and now a public RecordBatchReader constructor from stream object implementing the PyCapsule Protocol has been added GH-[39217](https://github.com/apache/arrow/issues/39217) together with some additional documentation GH-[39196](https://github.com/apache/arrow/issues/39196). DLPack protocol support (producer) was added to the Arrow C++ and is exposed in Python through __dlpack__ and __dlpack_device__ dunder methods GH-33984. Python now exposes enabling CRC checksum for read and write operations in Paquet GH-37242. CRC checksum are optional and can detect data corruption. CacheOptions are now configurable from Python as part of the pyarrow.dataset.ParquetFragmentScanOptions GH-36441. Parquet metadata to indicate sort order of the data are now exposed in RowGroupMetaData GH-35331. Parquet Support write and validate Page CRC (GH-37242) Other improvements: Append parameter from FileOutputStream is exposed for the OSFile class GH-38857. File size can be passed to make_fragment in the pyarrow datasets (pyarrow.dataset.FileFormatand pyarrow.dataset.ParquetFileFormat) GH-37857. Support for mask parameter is added to FixedSizeListArray.from_arrays GH-34316 to/from_struct_array are added to the pyarrow.Table class GH-33500. GIL is released in .nbytes which is improving performance when calculating the data size GH-39096. Usage of pandas internals DatetimeTZBlock has been removed GH-38341. DataType instance can be passed to MapType.from_arrays constructor GH-39515. Relevant bug fixes: S3FileSystem equals None segfault has been fixed GH-38535. No-op kernel is added for dictionary_encode(dictionary) GH-34890. PrettyPrint for Timestamp type now adds “Z” at the end of the print string when tz is defined in order to add minimum information about the values being stored in UTC GH-30117. R notes New features: Bindings for base::prod have been added so you can now use it in your dplyr pipelines (i.e., tbl |&gt; summarize(prod(col))) without having to pull the data into R GH-38601. Calling dimnames or colnames on Dataset objects now returns a useful result rather than just NULL GH-38377. The code() method on Schema objects now takes an optional namespace argument which, when TRUE, prefixes names with arrow:: which makes the output more portable GH-38144. Other improvements: To make debugging problems easier when using arrow with AWS S3 (e..g, s3_bucket, S3FileSystem), the debug log level for S3 can be set with the AWS_S3_LOG_LEVEL environment variable. See ?S3FileSystem for more information. GH-38267 An error is now thrown instead of warning and pulling the data into R when any of sub, gsub, stringr::str_replace, stringr::str_replace_all are passed a length &gt; 1 vector of values in pattern GH-39219. Missing documentation was added to ?open_dataset documenting how to use the ND-JSON support added in arrow 13.0.0 GH-38258. Using arrow with duckdb (i.e., to_duckdb()) no longer results in warnings when quitting your R session. GH-38495 For more on what’s in the 15.0.0 R package, see the R changelog. Ruby and C GLib notes Ruby Add Arrow::Table#each_raw_record and Arrow::RecordBatch#each_raw_record (GH-37137, GH-37600) C GLib Follow C++ changes. Rust notes The Rust projects have moved to separate repositories outside the main Arrow monorepo. For notes on the latest release of the Rust implementation, see the latest Arrow Rust changelog.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://arrow.apache.org/img/arrow-logo_horizontal_black-txt_white-bg.png" /><media:content medium="image" url="https://arrow.apache.org/img/arrow-logo_horizontal_black-txt_white-bg.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Apache Arrow DataFusion 34.0.0 Released, Looking Forward to 2024</title><link href="https://arrow.apache.org/blog/2024/01/19/datafusion-34.0.0/" rel="alternate" type="text/html" title="Apache Arrow DataFusion 34.0.0 Released, Looking Forward to 2024" /><published>2024-01-19T00:00:00-05:00</published><updated>2024-01-19T00:00:00-05:00</updated><id>https://arrow.apache.org/blog/2024/01/19/datafusion-34.0.0</id><content type="html" xml:base="https://arrow.apache.org/blog/2024/01/19/datafusion-34.0.0/"><![CDATA[<!--
-->
<h2 id="introduction">Introduction</h2>
<p>We recently <a href="https://crates.io/crates/datafusion/34.0.0">released DataFusion 34.0.0</a>. This blog highlights some of the major
improvements since we <a href="https://arrow.apache.org/blog/2023/06/24/datafusion-25.0.0/.">released DataFusion 26.0.0</a> (spoiler alert there are many)
and a preview of where the community plans to focus in the next 6 months.</p>
<p><a href="https://arrow.apache.org/datafusion/">Apache Arrow DataFusion</a> is an extensible query engine, written in <a href="https://www.rust-lang.org/">Rust</a>, that
uses <a href="https://arrow.apache.org">Apache Arrow</a> as its in-memory format. DataFusion is used by developers to
create new, fast data centric systems such as databases, dataframe libraries,
machine learning and streaming applications. While <a href="https://arrow.apache.org/datafusion/user-guide/introduction.html#project-goals">DataFusion’s primary design
goal</a> is to accelerate creating other data centric systems, it has a
reasonable experience directly out of the box as a <a href="https://arrow.apache.org/datafusion-python/">dataframe library</a> and
<a href="https://arrow.apache.org/datafusion/user-guide/cli.html">command line SQL tool</a>.</p>
<p>This may also be our last update on the Apache Arrow Site. Future
updates will likely be on the DataFusion website as we are working to <a href="https://github.com/apache/arrow-datafusion/discussions/6475">graduate
to a top level project</a> (Apache Arrow DataFusion → Apache DataFusion!) which
will help focus governance and project growth. Also exciting, our <a href="https://github.com/apache/arrow-datafusion/discussions/8522">first
DataFusion in person meetup</a> is planned for March 2024.</p>
<p>DataFusion is very much a community endeavor. Our core thesis is that as a
community we can build much more advanced technology than any of us as
individuals or companies could alone. In the last 6 months between <code class="language-plaintext highlighter-rouge">26.0.0</code> and
<code class="language-plaintext highlighter-rouge">34.0.0</code>, community growth has been strong. We accepted and reviewed over a
thousand PRs from 124 different committers, created over 650 issues and closed 517
of them.
You can find a list of all changes in the detailed <a href="https://github.com/apache/arrow-datafusion/blob/main/datafusion/CHANGELOG.md">CHANGELOG</a>.</p>
<!--
$ git log --pretty=oneline 26.0.0..34.0.0 . | wc -l
1009
$ git shortlog -sn 26.0.0..34.0.0 . | wc -l
124
https://crates.io/crates/datafusion/26.0.0
DataFusion 26 released June 7, 2023
https://crates.io/crates/datafusion/34.0.0
DataFusion 34 released Dec 17, 2023
Issues created in this time: 214 open, 437 closed
https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+created%3A2023-06-23..2023-12-17
Issues closes: 517
https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+closed%3A2023-06-23..2023-12-17+
PRs merged in this time 908
https://github.com/apache/arrow-datafusion/pulls?q=is%3Apr+merged%3A2023-06-23..2023-12-17
-->
<h1 id="improved-performance-">Improved Performance 🚀</h1>
<p>Performance is a key feature of DataFusion, DataFusion is
more than 2x faster on <a href="https://benchmark.clickhouse.com/">ClickBench</a> compared to version <code class="language-plaintext highlighter-rouge">25.0.0</code>, as shown below:</p>
<!--
Scripts: https://github.com/alamb/datafusion-duckdb-benchmark/tree/datafusion-25-34
Spreadsheet: https://docs.google.com/spreadsheets/d/1FtI3652WIJMC5LmJbLfT3G06w0JQIxEPG4yfMafexh8/edit#gid=1879366976
Average runtime on 25.0.0: 7.2s (for the queries that actually ran)
Average runtime on 34.0.0: 3.6s (for the same queries that ran in 25.0.0)
-->
<figure style="text-align: center;">
<img src="/img/datafusion-34.0.0/compare-new.png" width="100%" class="img-responsive" alt="Fig 1: Adaptive Arrow schema architecture overview." />
<figcaption>
<b>Figure 1</b>: Performance improvement between <code>25.0.0</code> and <code>34.0.0</code> on ClickBench.
Note that DataFusion <code>25.0.0</code>, could not run several queries due to
unsupported SQL (Q9, Q11, Q12, Q14) or memory requirements (Q33).
</figcaption>
</figure>
<figure style="text-align: center;">
<img src="/img/datafusion-34.0.0/compare.png" width="100%" class="img-responsive" alt="Fig 1: Adaptive Arrow schema architecture overview." />
<figcaption>
<b>Figure 2</b>: Total query runtime for DataFusion <code>34.0.0</code> and DataFusion <code>25.0.0</code>.
</figcaption>
</figure>
<p>Here are some specific enhancements we have made to improve performance:</p>
<ul>
<li><a href="https://arrow.apache.org/blog/2023/08/05/datafusion_fast_grouping/">2-3x better aggregation performance with many distinct groups</a></li>
<li>Partially ordered grouping / streaming grouping</li>
<li><a href="https://github.com/apache/arrow-datafusion/pull/7721">Specialized operator for “TopK” <code class="language-plaintext highlighter-rouge">ORDER BY LIMIT XXX</code></a></li>
<li><a href="https://github.com/apache/arrow-datafusion/pull/7192">Specialized operator for <code class="language-plaintext highlighter-rouge">min(col) GROUP BY .. ORDER by min(col) LIMIT XXX</code></a></li>
<li><a href="https://github.com/apache/arrow-datafusion/pull/8126">Improved join performance</a></li>
<li>Eliminate redundant sorting with sort order aware optimizers</li>
</ul>
<h1 id="new-features-">New Features ✨</h1>
<h2 id="dml--insert--creating-files">DML / Insert / Creating Files</h2>
<p>DataFusion now supports writing data in parallel, to individual or multiple
files, using <code class="language-plaintext highlighter-rouge">Parquet</code>, <code class="language-plaintext highlighter-rouge">CSV</code>, <code class="language-plaintext highlighter-rouge">JSON</code>, <code class="language-plaintext highlighter-rouge">ARROW</code> and user defined formats.
<a href="https://github.com/apache/arrow-datafusion/pull/7655">Benchmark results</a> show improvements up to 5x in some cases.</p>
<p>Similarly to reading, data can now be written to any <a href="https://docs.rs/object_store/0.9.0/object_store/index.html"><code class="language-plaintext highlighter-rouge">ObjectStore</code></a>
implementation, including AWS S3, Azure Blob Storage, GCP Cloud Storage, local
files, and user defined implementations. While reading from <a href="https://docs.rs/datafusion/latest/datafusion/datasource/listing/struct.ListingTable.html#features">hive style
partitioned tables</a> has long been supported, it is now possible to write to such
tables as well.</p>
<p>For example, to write to a local file:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="err"></span> <span class="k">CREATE</span> <span class="k">EXTERNAL</span> <span class="k">TABLE</span> <span class="n">awesome_table</span><span class="p">(</span><span class="n">x</span> <span class="nb">INT</span><span class="p">)</span> <span class="n">STORED</span> <span class="k">AS</span> <span class="n">PARQUET</span> <span class="k">LOCATION</span> <span class="s1">'/tmp/my_awesome_table'</span><span class="p">;</span>
<span class="mi">0</span> <span class="k">rows</span> <span class="k">in</span> <span class="k">set</span><span class="p">.</span> <span class="n">Query</span> <span class="n">took</span> <span class="mi">0</span><span class="p">.</span><span class="mi">003</span> <span class="n">seconds</span><span class="p">.</span>
<span class="err"></span> <span class="k">INSERT</span> <span class="k">INTO</span> <span class="n">awesome_table</span> <span class="k">SELECT</span> <span class="n">x</span> <span class="o">*</span> <span class="mi">10</span> <span class="k">FROM</span> <span class="n">my_source_table</span><span class="p">;</span>
<span class="o">+</span><span class="c1">-------+</span>
<span class="o">|</span> <span class="k">count</span> <span class="o">|</span>
<span class="o">+</span><span class="c1">-------+</span>
<span class="o">|</span> <span class="mi">3</span> <span class="o">|</span>
<span class="o">+</span><span class="c1">-------+</span>
<span class="mi">1</span> <span class="k">row</span> <span class="k">in</span> <span class="k">set</span><span class="p">.</span> <span class="n">Query</span> <span class="n">took</span> <span class="mi">0</span><span class="p">.</span><span class="mi">024</span> <span class="n">seconds</span><span class="p">.</span>
</code></pre></div></div>
<p>You can also write to files with the <a href="https://arrow.apache.org/datafusion/user-guide/sql/dml.html#copy"><code class="language-plaintext highlighter-rouge">COPY</code></a>, similarly to <a href="https://duckdb.org/docs/sql/statements/copy.html">DuckDB’s <code class="language-plaintext highlighter-rouge">COPY</code></a>:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="err"></span> <span class="k">COPY</span> <span class="p">(</span><span class="k">SELECT</span> <span class="n">x</span> <span class="o">+</span> <span class="mi">1</span> <span class="k">FROM</span> <span class="n">my_source_table</span><span class="p">)</span> <span class="k">TO</span> <span class="s1">'/tmp/output.json'</span><span class="p">;</span>
<span class="o">+</span><span class="c1">-------+</span>
<span class="o">|</span> <span class="k">count</span> <span class="o">|</span>
<span class="o">+</span><span class="c1">-------+</span>
<span class="o">|</span> <span class="mi">3</span> <span class="o">|</span>
<span class="o">+</span><span class="c1">-------+</span>
<span class="mi">1</span> <span class="k">row</span> <span class="k">in</span> <span class="k">set</span><span class="p">.</span> <span class="n">Query</span> <span class="n">took</span> <span class="mi">0</span><span class="p">.</span><span class="mi">014</span> <span class="n">seconds</span><span class="p">.</span>
</code></pre></div></div>
<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span><span class="nb">cat</span> /tmp/output.json
<span class="o">{</span><span class="s2">"x"</span>:1<span class="o">}</span>
<span class="o">{</span><span class="s2">"x"</span>:2<span class="o">}</span>
<span class="o">{</span><span class="s2">"x"</span>:3<span class="o">}</span>
</code></pre></div></div>
<h2 id="improved-struct-and-array-support">Improved <code class="language-plaintext highlighter-rouge">STRUCT</code> and <code class="language-plaintext highlighter-rouge">ARRAY</code> support</h2>
<p>DataFusion <code class="language-plaintext highlighter-rouge">34.0.0</code> has much improved <code class="language-plaintext highlighter-rouge">STRUCT</code> and <code class="language-plaintext highlighter-rouge">ARRAY</code>
support, including a full range of <a href="https://arrow.apache.org/datafusion/user-guide/sql/scalar_functions.html#struct-functions">struct functions</a> and <a href="https://arrow.apache.org/datafusion/user-guide/sql/scalar_functions.html#array-functions">array functions</a>.</p>
<!--
❯ create table my_table as values ([1,2,3]), ([2]), ([4,5]);
-->
<p>For example, you can now use <code class="language-plaintext highlighter-rouge">[]</code> syntax and <code class="language-plaintext highlighter-rouge">array_length</code> to access and inspect arrays:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="err"></span> <span class="k">SELECT</span> <span class="n">column1</span><span class="p">,</span>
<span class="n">column1</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="k">AS</span> <span class="n">first_element</span><span class="p">,</span>
<span class="n">array_length</span><span class="p">(</span><span class="n">column1</span><span class="p">)</span> <span class="k">AS</span> <span class="n">len</span>
<span class="k">FROM</span> <span class="n">my_table</span><span class="p">;</span>
<span class="o">+</span><span class="c1">-----------+---------------+-----+</span>
<span class="o">|</span> <span class="n">column1</span> <span class="o">|</span> <span class="n">first_element</span> <span class="o">|</span> <span class="n">len</span> <span class="o">|</span>
<span class="o">+</span><span class="c1">-----------+---------------+-----+</span>
<span class="o">|</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">]</span> <span class="o">|</span> <span class="mi">1</span> <span class="o">|</span> <span class="mi">3</span> <span class="o">|</span>
<span class="o">|</span> <span class="p">[</span><span class="mi">2</span><span class="p">]</span> <span class="o">|</span> <span class="mi">2</span> <span class="o">|</span> <span class="mi">1</span> <span class="o">|</span>
<span class="o">|</span> <span class="p">[</span><span class="mi">4</span><span class="p">,</span> <span class="mi">5</span><span class="p">]</span> <span class="o">|</span> <span class="mi">4</span> <span class="o">|</span> <span class="mi">2</span> <span class="o">|</span>
<span class="o">+</span><span class="c1">-----------+---------------+-----+</span>
</code></pre></div></div>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="err"></span> <span class="k">SELECT</span> <span class="n">column1</span><span class="p">,</span> <span class="n">column1</span><span class="p">[</span><span class="s1">'c0'</span><span class="p">]</span> <span class="k">FROM</span> <span class="n">my_table</span><span class="p">;</span>
<span class="o">+</span><span class="c1">------------------+----------------------+</span>
<span class="o">|</span> <span class="n">column1</span> <span class="o">|</span> <span class="n">my_table</span><span class="p">.</span><span class="n">column1</span><span class="p">[</span><span class="n">c0</span><span class="p">]</span> <span class="o">|</span>
<span class="o">+</span><span class="c1">------------------+----------------------+</span>
<span class="o">|</span> <span class="p">{</span><span class="n">c0</span><span class="p">:</span> <span class="n">foo</span><span class="p">,</span> <span class="n">c1</span><span class="p">:</span> <span class="mi">1</span><span class="p">}</span> <span class="o">|</span> <span class="n">foo</span> <span class="o">|</span>
<span class="o">|</span> <span class="p">{</span><span class="n">c0</span><span class="p">:</span> <span class="n">bar</span><span class="p">,</span> <span class="n">c1</span><span class="p">:</span> <span class="mi">2</span><span class="p">}</span> <span class="o">|</span> <span class="n">bar</span> <span class="o">|</span>
<span class="o">+</span><span class="c1">------------------+----------------------+</span>
<span class="mi">2</span> <span class="k">rows</span> <span class="k">in</span> <span class="k">set</span><span class="p">.</span> <span class="n">Query</span> <span class="n">took</span> <span class="mi">0</span><span class="p">.</span><span class="mi">002</span> <span class="n">seconds</span><span class="p">.</span>
</code></pre></div></div>
<h2 id="other-features">Other Features</h2>
<p>Other notable features include:</p>
<ul>
<li>Support aggregating datasets that exceed memory size, with <a href="https://github.com/apache/arrow-datafusion/pull/7400">group by spill to disk</a></li>
<li>All operators now track and limit their memory consumption, including Joins</li>
</ul>
<h1 id="building-systems-is-easier-with-datafusion-️">Building Systems is Easier with DataFusion 🛠️</h1>
<h2 id="documentation">Documentation</h2>
<p>It is easier than ever to get started using DataFusion with the
new <a href="https://arrow.apache.org/datafusion/library-user-guide/index.html">Library Users Guide</a> as well as significantly improved the <a href="https://docs.rs/datafusion/latest/datafusion/index.html">API documentation</a>.</p>
<h2 id="user-defined-window-and-table-functions">User Defined Window and Table Functions</h2>
<p>In addition to DataFusion’s <a href="https://arrow.apache.org/datafusion/library-user-guide/adding-udfs.html#adding-a-scalar-udf">User Defined Scalar Functions</a>, and <a href="https://arrow.apache.org/datafusion/library-user-guide/adding-udfs.html#adding-an-aggregate-udf">User Defined Aggregate Functions</a>, DataFusion now supports <a href="https://arrow.apache.org/datafusion/library-user-guide/adding-udfs.html#adding-a-window-udf">User Defined Window Functions</a>
and <a href="https://arrow.apache.org/datafusion/library-user-guide/adding-udfs.html#adding-a-user-defined-table-function">User Defined Table Functions</a>.</p>
<p>For example, <a href="https://arrow.apache.org/datafusion/user-guide/cli.html">the <code class="language-plaintext highlighter-rouge">datafusion-cli</code></a> implements a DuckDB style <a href="https://arrow.apache.org/datafusion/user-guide/cli.html#supported-sql"><code class="language-plaintext highlighter-rouge">parquet_metadata</code></a>
function as a user defined table function (<a href="https://github.com/apache/arrow-datafusion/blob/3f219bc929cfd418b0e3d3501f8eba1d5a2c87ae/datafusion-cli/src/functions.rs#L222-L248">source code here</a>):</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="err"></span> <span class="k">SELECT</span>
<span class="n">path_in_schema</span><span class="p">,</span> <span class="n">row_group_id</span><span class="p">,</span> <span class="n">row_group_num_rows</span><span class="p">,</span> <span class="n">stats_min</span><span class="p">,</span> <span class="n">stats_max</span><span class="p">,</span> <span class="n">total_compressed_size</span>
<span class="k">FROM</span>
<span class="n">parquet_metadata</span><span class="p">(</span><span class="s1">'hits.parquet'</span><span class="p">)</span>
<span class="k">WHERE</span> <span class="n">path_in_schema</span> <span class="o">=</span> <span class="s1">'"WatchID"'</span>
<span class="k">LIMIT</span> <span class="mi">3</span><span class="p">;</span>
<span class="o">+</span><span class="c1">----------------+--------------+--------------------+---------------------+---------------------+-----------------------+</span>
<span class="o">|</span> <span class="n">path_in_schema</span> <span class="o">|</span> <span class="n">row_group_id</span> <span class="o">|</span> <span class="n">row_group_num_rows</span> <span class="o">|</span> <span class="n">stats_min</span> <span class="o">|</span> <span class="n">stats_max</span> <span class="o">|</span> <span class="n">total_compressed_size</span> <span class="o">|</span>
<span class="o">+</span><span class="c1">----------------+--------------+--------------------+---------------------+---------------------+-----------------------+</span>
<span class="o">|</span> <span class="nv">"WatchID"</span> <span class="o">|</span> <span class="mi">0</span> <span class="o">|</span> <span class="mi">450560</span> <span class="o">|</span> <span class="mi">4611687214012840539</span> <span class="o">|</span> <span class="mi">9223369186199968220</span> <span class="o">|</span> <span class="mi">3883759</span> <span class="o">|</span>
<span class="o">|</span> <span class="nv">"WatchID"</span> <span class="o">|</span> <span class="mi">1</span> <span class="o">|</span> <span class="mi">612174</span> <span class="o">|</span> <span class="mi">4611689135232456464</span> <span class="o">|</span> <span class="mi">9223371478009085789</span> <span class="o">|</span> <span class="mi">5176803</span> <span class="o">|</span>
<span class="o">|</span> <span class="nv">"WatchID"</span> <span class="o">|</span> <span class="mi">2</span> <span class="o">|</span> <span class="mi">344064</span> <span class="o">|</span> <span class="mi">4611692774829951781</span> <span class="o">|</span> <span class="mi">9223363791697310021</span> <span class="o">|</span> <span class="mi">3031680</span> <span class="o">|</span>
<span class="o">+</span><span class="c1">----------------+--------------+--------------------+---------------------+---------------------+-----------------------+</span>
<span class="mi">3</span> <span class="k">rows</span> <span class="k">in</span> <span class="k">set</span><span class="p">.</span> <span class="n">Query</span> <span class="n">took</span> <span class="mi">0</span><span class="p">.</span><span class="mi">053</span> <span class="n">seconds</span><span class="p">.</span>
</code></pre></div></div>
<h3 id="growth-of-datafusion-">Growth of DataFusion 📈</h3>
<p>DataFusion has been appearing more publically in the wild. For example</p>
<ul>
<li>New projects built using DataFusion such as <a href="https://lancedb.com/">lancedb</a>, <a href="https://glaredb.com/">GlareDB</a>, <a href="https://www.arroyo.dev/">Arroyo</a>, and <a href="https://github.com/cmu-db/optd">optd</a>.</li>
<li>Public talks such as <a href="https://www.youtube.com/watch?v=AJU9rdRNk9I">Apache Arrow Datafusion: Vectorized
Execution Framework For Maximum Performance</a> in <a href="https://www.bagevent.com/event/8432178">CommunityOverCode Asia 2023</a></li>
<li>Blogs posts such as <a href="https://www.synnada.ai/blog/apache-arrow-arrow-datafusion-ai-native-data-infra-an-interview-with-our-ceo-ozan">Apache Arrow, Arrow/DataFusion, AI-native Data Infra</a>,
<a href="https://www.influxdata.com/blog/flight-datafusion-arrow-parquet-fdap-architecture-influxdb/">Flight, DataFusion, Arrow, and Parquet: Using the FDAP Architecture to build InfluxDB 3.0</a>, and
<a href="https://www.linkedin.com/pulse/guide-user-defined-functions-apache-arrow-datafusion-dade-aderemi/">A Guide to User-Defined Functions in Apache Arrow DataFusion</a></li>
</ul>
<p>We have also <a href="https://github.com/apache/arrow-datafusion/issues/6782">submitted a paper</a> to <a href="https://2024.sigmod.org/">SIGMOD 2024</a>, one of the
premiere database conferences, describing DataFusion in a technically formal
style and making the case that it is possible to create a modular and extensive query engine
without sacrificing performance. We hope this paper helps people
evaluating DataFusion for their needs understand it better.</p>
<h1 id="datafusion-in-2024-">DataFusion in 2024 🥳</h1>
<p>Some major initiatives from contributors we know of this year are:</p>
<ol>
<li>
<p><em>Modularity</em>: Make DataFusion even more modular, such as <a href="https://github.com/apache/arrow-datafusion/issues/8045">unifying
built in and user functions</a>, making it easier to customize
DataFusion’s behavior.</p>
</li>
<li>
<p><em>Community Growth</em>: Graduate to our own top level Apache project, and
subsequently add more committers and PMC members to keep pace with project
growth.</p>
</li>
<li>
<p><em>Use case white papers</em>: Write blog posts and videos explaining
how to use DataFusion for real-world use cases.</p>
</li>
<li>
<p><em>Testing</em>: Improve CI infrastructure and test coverage, more fuzz
testing, and better functional and performance regression testing.</p>
</li>
<li>
<p><em>Planning Time</em>: Reduce the time taken to plan queries, both <a href="https://github.com/apache/arrow-datafusion/issues/7698">wide
tables of 1000s of columns</a>, and in <a href="https://github.com/apache/arrow-datafusion/issues/5637">general</a>.</p>
</li>
<li>
<p><em>Aggregate Performance</em>: Improve the speed of <a href="https://github.com/apache/arrow-datafusion/issues/7000">aggregating “high cardinality”</a> data
when there are many (e.g. millions) of distinct groups.</p>
</li>
<li>
<p><em>Statistics</em>: <a href="https://github.com/apache/arrow-datafusion/issues/8227">Improved statistics handling</a> with an eye towards more
sophisticated expression analysis and cost models.</p>
</li>
</ol>
<h1 id="how-to-get-involved">How to Get Involved</h1>
<p>If you are interested in contributing to DataFusion we would love to have you
join us. You can try out DataFusion on some of your own data and projects and
let us know how it goes, contribute suggestions, documentation, bug reports, or
a PR with documentation, tests or code. A list of open issues
suitable for beginners is <a href="https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22">here</a>.</p>
<p>As the community grows, we are also looking to restart biweekly calls /
meetings. Timezones are always a challenge for such meetings, but we hope to
have two calls that can work for most attendees. If you are interested
in helping, or just want to say hi, please drop us a note via one of
the methods listed in our <a href="https://arrow.apache.org/datafusion/contributor-guide/communication.html">Communication Doc</a>.</p>]]></content><author><name>pmc</name></author><category term="release" /><summary type="html"><![CDATA[Introduction We recently released DataFusion 34.0.0. This blog highlights some of the major improvements since we released DataFusion 26.0.0 (spoiler alert there are many) and a preview of where the community plans to focus in the next 6 months. Apache Arrow DataFusion is an extensible query engine, written in Rust, that uses Apache Arrow as its in-memory format. DataFusion is used by developers to create new, fast data centric systems such as databases, dataframe libraries, machine learning and streaming applications. While DataFusion’s primary design goal is to accelerate creating other data centric systems, it has a reasonable experience directly out of the box as a dataframe library and command line SQL tool. This may also be our last update on the Apache Arrow Site. Future updates will likely be on the DataFusion website as we are working to graduate to a top level project (Apache Arrow DataFusion → Apache DataFusion!) which will help focus governance and project growth. Also exciting, our first DataFusion in person meetup is planned for March 2024. DataFusion is very much a community endeavor. Our core thesis is that as a community we can build much more advanced technology than any of us as individuals or companies could alone. In the last 6 months between 26.0.0 and 34.0.0, community growth has been strong. We accepted and reviewed over a thousand PRs from 124 different committers, created over 650 issues and closed 517 of them. You can find a list of all changes in the detailed CHANGELOG. Improved Performance 🚀 Performance is a key feature of DataFusion, DataFusion is more than 2x faster on ClickBench compared to version 25.0.0, as shown below: Figure 1: Performance improvement between 25.0.0 and 34.0.0 on ClickBench. Note that DataFusion 25.0.0, could not run several queries due to unsupported SQL (Q9, Q11, Q12, Q14) or memory requirements (Q33). Figure 2: Total query runtime for DataFusion 34.0.0 and DataFusion 25.0.0. Here are some specific enhancements we have made to improve performance: 2-3x better aggregation performance with many distinct groups Partially ordered grouping / streaming grouping Specialized operator for “TopK” ORDER BY LIMIT XXX Specialized operator for min(col) GROUP BY .. ORDER by min(col) LIMIT XXX Improved join performance Eliminate redundant sorting with sort order aware optimizers New Features ✨ DML / Insert / Creating Files DataFusion now supports writing data in parallel, to individual or multiple files, using Parquet, CSV, JSON, ARROW and user defined formats. Benchmark results show improvements up to 5x in some cases. Similarly to reading, data can now be written to any ObjectStore implementation, including AWS S3, Azure Blob Storage, GCP Cloud Storage, local files, and user defined implementations. While reading from hive style partitioned tables has long been supported, it is now possible to write to such tables as well. For example, to write to a local file: ❯ CREATE EXTERNAL TABLE awesome_table(x INT) STORED AS PARQUET LOCATION '/tmp/my_awesome_table'; 0 rows in set. Query took 0.003 seconds. ❯ INSERT INTO awesome_table SELECT x * 10 FROM my_source_table; +-------+ | count | +-------+ | 3 | +-------+ 1 row in set. Query took 0.024 seconds. You can also write to files with the COPY, similarly to DuckDB’s COPY: ❯ COPY (SELECT x + 1 FROM my_source_table) TO '/tmp/output.json'; +-------+ | count | +-------+ | 3 | +-------+ 1 row in set. Query took 0.014 seconds. $ cat /tmp/output.json {"x":1} {"x":2} {"x":3} Improved STRUCT and ARRAY support DataFusion 34.0.0 has much improved STRUCT and ARRAY support, including a full range of struct functions and array functions. For example, you can now use [] syntax and array_length to access and inspect arrays: ❯ SELECT column1, column1[1] AS first_element, array_length(column1) AS len FROM my_table; +-----------+---------------+-----+ | column1 | first_element | len | +-----------+---------------+-----+ | [1, 2, 3] | 1 | 3 | | [2] | 2 | 1 | | [4, 5] | 4 | 2 | +-----------+---------------+-----+ ❯ SELECT column1, column1['c0'] FROM my_table; +------------------+----------------------+ | column1 | my_table.column1[c0] | +------------------+----------------------+ | {c0: foo, c1: 1} | foo | | {c0: bar, c1: 2} | bar | +------------------+----------------------+ 2 rows in set. Query took 0.002 seconds. Other Features Other notable features include: Support aggregating datasets that exceed memory size, with group by spill to disk All operators now track and limit their memory consumption, including Joins Building Systems is Easier with DataFusion 🛠️ Documentation It is easier than ever to get started using DataFusion with the new Library Users Guide as well as significantly improved the API documentation. User Defined Window and Table Functions In addition to DataFusion’s User Defined Scalar Functions, and User Defined Aggregate Functions, DataFusion now supports User Defined Window Functions and User Defined Table Functions. For example, the datafusion-cli implements a DuckDB style parquet_metadata function as a user defined table function (source code here): ❯ SELECT path_in_schema, row_group_id, row_group_num_rows, stats_min, stats_max, total_compressed_size FROM parquet_metadata('hits.parquet') WHERE path_in_schema = '"WatchID"' LIMIT 3; +----------------+--------------+--------------------+---------------------+---------------------+-----------------------+ | path_in_schema | row_group_id | row_group_num_rows | stats_min | stats_max | total_compressed_size | +----------------+--------------+--------------------+---------------------+---------------------+-----------------------+ | "WatchID" | 0 | 450560 | 4611687214012840539 | 9223369186199968220 | 3883759 | | "WatchID" | 1 | 612174 | 4611689135232456464 | 9223371478009085789 | 5176803 | | "WatchID" | 2 | 344064 | 4611692774829951781 | 9223363791697310021 | 3031680 | +----------------+--------------+--------------------+---------------------+---------------------+-----------------------+ 3 rows in set. Query took 0.053 seconds. Growth of DataFusion 📈 DataFusion has been appearing more publically in the wild. For example New projects built using DataFusion such as lancedb, GlareDB, Arroyo, and optd. Public talks such as Apache Arrow Datafusion: Vectorized Execution Framework For Maximum Performance in CommunityOverCode Asia 2023 Blogs posts such as Apache Arrow, Arrow/DataFusion, AI-native Data Infra, Flight, DataFusion, Arrow, and Parquet: Using the FDAP Architecture to build InfluxDB 3.0, and A Guide to User-Defined Functions in Apache Arrow DataFusion We have also submitted a paper to SIGMOD 2024, one of the premiere database conferences, describing DataFusion in a technically formal style and making the case that it is possible to create a modular and extensive query engine without sacrificing performance. We hope this paper helps people evaluating DataFusion for their needs understand it better. DataFusion in 2024 🥳 Some major initiatives from contributors we know of this year are: Modularity: Make DataFusion even more modular, such as unifying built in and user functions, making it easier to customize DataFusion’s behavior. Community Growth: Graduate to our own top level Apache project, and subsequently add more committers and PMC members to keep pace with project growth. Use case white papers: Write blog posts and videos explaining how to use DataFusion for real-world use cases. Testing: Improve CI infrastructure and test coverage, more fuzz testing, and better functional and performance regression testing. Planning Time: Reduce the time taken to plan queries, both wide tables of 1000s of columns, and in general. Aggregate Performance: Improve the speed of aggregating “high cardinality” data when there are many (e.g. millions) of distinct groups. Statistics: Improved statistics handling with an eye towards more sophisticated expression analysis and cost models. How to Get Involved If you are interested in contributing to DataFusion we would love to have you join us. You can try out DataFusion on some of your own data and projects and let us know how it goes, contribute suggestions, documentation, bug reports, or a PR with documentation, tests or code. A list of open issues suitable for beginners is here. As the community grows, we are also looking to restart biweekly calls / meetings. Timezones are always a challenge for such meetings, but we hope to have two calls that can work for most attendees. If you are interested in helping, or just want to say hi, please drop us a note via one of the methods listed in our Communication Doc.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://arrow.apache.org/img/arrow-logo_horizontal_black-txt_white-bg.png" /><media:content medium="image" url="https://arrow.apache.org/img/arrow-logo_horizontal_black-txt_white-bg.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Apache Arrow ADBC 0.9.0 (Libraries) Release</title><link href="https://arrow.apache.org/blog/2024/01/08/adbc-0.9.0-release/" rel="alternate" type="text/html" title="Apache Arrow ADBC 0.9.0 (Libraries) Release" /><published>2024-01-08T00:00:00-05:00</published><updated>2024-01-08T00:00:00-05:00</updated><id>https://arrow.apache.org/blog/2024/01/08/adbc-0.9.0-release</id><content type="html" xml:base="https://arrow.apache.org/blog/2024/01/08/adbc-0.9.0-release/"><![CDATA[<!--
-->
<p>The Apache Arrow team is pleased to announce the 0.9.0 release of
the Apache Arrow ADBC libraries. This covers includes <a href="https://github.com/apache/arrow-adbc/milestone/13"><strong>34
resolved issues</strong></a> from <a href="#contributors"><strong>16 distinct contributors</strong></a>.</p>
<p>This is a release of the <strong>libraries</strong>, which are at version
0.9.0. The <strong>API specification</strong> is versioned separately and is
at version 1.1.0.</p>
<p>The release notes below are not exhaustive and only expose selected
highlights of the release. Many other bugfixes and improvements have
been made: we refer you to the <a href="https://github.com/apache/arrow-adbc/blob/apache-arrow-adbc-0.9.0/CHANGELOG.md">complete changelog</a>.</p>
<h2 id="release-highlights">Release Highlights</h2>
<p>The C#/.NET implementation is gearing up for a proper release and has had
various bugfixes in the core libraries, the drivers, and the ADO.NET wrapper.</p>
<p>The Go implementation now supports executing more methods (like
<code class="language-plaintext highlighter-rouge">ConnectionCommit</code>) when importing a native C/C++ driver.</p>
<p>The PostgreSQL driver can now write decimal values and dictionary-encoded
string/bytestring. Also, <code class="language-plaintext highlighter-rouge">AdbcConnectionGetTableSchema</code> was fixed to handle
the catalog and schema parameters correctly, and the driver now tries to
provide a row count for non-select queries.</p>
<p>The Python bindings now accept a PyArrow Dataset as a data source for bulk
ingestion. The new <a href="https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html">PyCapsule interface</a> is supported. It is
possible to build the Python packages via CMake, which simplifies handling the
mixed-C++-and-Python builds.</p>
<p>The R bindings reference-count objects to try to detect improper usage and
avoid crashes.</p>
<p>The Snowflake driver properly handles escaping and case sensitivity in various
metadata methods. It also now supports <code class="language-plaintext highlighter-rouge">StatementExecuteSchema</code>.</p>
<p>The Javadoc API reference is now included in the documentation.</p>
<h2 id="contributors">Contributors</h2>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ git shortlog --perl-regexp --author='^((?!dependabot\[bot\]).*)$' -sn apache-arrow-adbc-0.8.0..apache-arrow-adbc-0.9.0
24 David Li
10 Curt Hagenlocher
10 Dewey Dunnington
7 davidhcoe
4 William Ayd
3 Ryan Syed
3 vleslief-ms
2 Daniel Espinosa
2 Joel Lubinitsky
2 Sutou Kouhei
1 AnithaPanduranganMS
1 Joris Van den Bossche
1 Matt Topol
1 Ruoxuan Wang
1 William
1 rtadepalli
</code></pre></div></div>
<h2 id="roadmap">Roadmap</h2>
<p>The next release may increase the minimum required C++ revision to C++17,
which will line up with the mainline Arrow project
(<a href="https://github.com/apache/arrow-adbc/issues/1431">GH-1431</a>).</p>
<h2 id="getting-involved">Getting Involved</h2>
<p>We welcome questions and contributions from all interested. Issues
can be filed on <a href="https://github.com/apache/arrow-adbc/issues">GitHub</a>, and questions can be directed to GitHub
or the <a href="/community/">Arrow mailing lists</a>.</p>]]></content><author><name>pmc</name></author><category term="release" /><summary type="html"><![CDATA[The Apache Arrow team is pleased to announce the 0.9.0 release of the Apache Arrow ADBC libraries. This covers includes 34 resolved issues from 16 distinct contributors. This is a release of the libraries, which are at version 0.9.0. The API specification is versioned separately and is at version 1.1.0. The release notes below are not exhaustive and only expose selected highlights of the release. Many other bugfixes and improvements have been made: we refer you to the complete changelog. Release Highlights The C#/.NET implementation is gearing up for a proper release and has had various bugfixes in the core libraries, the drivers, and the ADO.NET wrapper. The Go implementation now supports executing more methods (like ConnectionCommit) when importing a native C/C++ driver. The PostgreSQL driver can now write decimal values and dictionary-encoded string/bytestring. Also, AdbcConnectionGetTableSchema was fixed to handle the catalog and schema parameters correctly, and the driver now tries to provide a row count for non-select queries. The Python bindings now accept a PyArrow Dataset as a data source for bulk ingestion. The new PyCapsule interface is supported. It is possible to build the Python packages via CMake, which simplifies handling the mixed-C++-and-Python builds. The R bindings reference-count objects to try to detect improper usage and avoid crashes. The Snowflake driver properly handles escaping and case sensitivity in various metadata methods. It also now supports StatementExecuteSchema. The Javadoc API reference is now included in the documentation. Contributors $ git shortlog --perl-regexp --author='^((?!dependabot\[bot\]).*)$' -sn apache-arrow-adbc-0.8.0..apache-arrow-adbc-0.9.0 24 David Li 10 Curt Hagenlocher 10 Dewey Dunnington 7 davidhcoe 4 William Ayd 3 Ryan Syed 3 vleslief-ms 2 Daniel Espinosa 2 Joel Lubinitsky 2 Sutou Kouhei 1 AnithaPanduranganMS 1 Joris Van den Bossche 1 Matt Topol 1 Ruoxuan Wang 1 William 1 rtadepalli Roadmap The next release may increase the minimum required C++ revision to C++17, which will line up with the mainline Arrow project (GH-1431). Getting Involved We welcome questions and contributions from all interested. Issues can be filed on GitHub, and questions can be directed to GitHub or the Arrow mailing lists.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://arrow.apache.org/img/arrow-logo_horizontal_black-txt_white-bg.png" /><media:content medium="image" url="https://arrow.apache.org/img/arrow-logo_horizontal_black-txt_white-bg.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry></feed>