layout: post title: Apache DataFusion Python 44.0.0 Released date: 2025-02-07 author: timsaucer categories: [release]

We are happy to announce that datafusion-python 44.0.0 has been released. This release brings in all of the new features of the core DataFusion 44.0.0 library. You can see the full details of the improvements in the changelogs.

Asynchronous Iteration of Record Batches

Retrieving a RecordBatch from a RecordBatchStream was a synchronous call, which would require the end user's code to wait for the data retrieval. This is described in Issue 974. We continue to support this as a synchronous iterator, but we have also added in the ability to retrieve the RecordBatch using the Python asynchronous anext function.

Default Compression for Parquet files

With PR 981, we change the saving of Parquet files to use zstd compression by default. Previously the default was uncompressed, causing excessive disk storage. Zstd is an excellent compression scheme that balances speed and compression ratio. Users can still save their Parquet files uncompressed by passing in the appropriate value to the compression argument when calling DataFrame.write_parquet.

uv package management

uv is an extremely fast Python package manager, written in Rust. In the previous version of datafusion-python we had a combination of settings of PyPi and Conda. Instead, we switch to using uv is our primary method for dependency management.

For most users of DataFusion, this change will be transparent. You can still install via pip or conda. For developers, the instructions in the repository have been updated.

Migration Guide

During the upgrade from DataFusion 43.0.0 to DataFusion 44.0.0 as our upstream core dependency, we discovered a few changes were necessary within our repository and our unit tests. These notes serve to help guide users who may encounter similar issues when upgrading.

  • RuntimeConfig is now deprecated in favor of RuntimeEnvBuilder. The migration is fairly straightforward, and the corresponding classes have been marked as deprecated. For end users it should be simply a matter of changing the class name.
  • If you perform a concat of a string_view and string, it will now return a string_view instead of a string. This likely only impacts unit tests that are validating return types. In general, it is recommended to switch to using string_view whenever possible. You can see the blog articles String View Pt 1 and Pt 2 for more information on these performance improvements.
  • The function date_part now returns an int32 instead of a float64. This is likely only impactful to unit tests.

Appreciation

We would like to thank everyone who has helped with these releases through their helpful conversations, code review, issue descriptions, and code authoring. We would especially like to thank the following authors of PRs who made these releases possible, listed in alphabetical order by username: @chenkovsky, @ion-elgreco, @kylebarron, and @kosiew.

Thank you!

Get Involved

The DataFusion Python team is an active and engaging community and we would love to have you join us and help the project.

Here are some ways to get involved: