blob: 0b227000f73d91f4db9ebc8b8c6f670e3d9c1b75 [file] [log] [blame] [view]
<!---
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
# Upgrade Guides
## DataFusion `51.0.0`
**Note:** DataFusion `51.0.0` has not been released yet. The information provided in this section pertains to features and changes that have already been merged to the main branch and are awaiting release in this version.
You can see the current [status of the `51.0.0`release here](https://github.com/apache/datafusion/issues/17558)
### `arrow` / `parquet` updated to 57.0.0
### Upgrade to arrow `57.0.0` and parquet `57.0.0`
This version of DataFusion upgrades the underlying Apache Arrow implementation
to version `57.0.0`, including several dependent crates such as `prost`,
`tonic`, `pyo3`, and `substrait`. . See the [release
notes](https://github.com/apache/arrow-rs/releases/tag/57.0.0) for more details.
### `MSRV` updated to 1.88.0
The Minimum Supported Rust Version (MSRV) has been updated to [`1.88.0`].
[`1.88.0`]: https://releases.rs/docs/1.88.0/
### `FunctionRegistry` exposes two additional methods
`FunctionRegistry` exposes two additional methods `udafs` and `udwfs` which expose set of registered user defined aggregation and window function names. To upgrade implement methods returning set of registered function names:
```diff
impl FunctionRegistry for FunctionRegistryImpl {
fn udfs(&self) -> HashSet<String> {
self.scalar_functions.keys().cloned().collect()
}
+ fn udafs(&self) -> HashSet<String> {
+ self.aggregate_functions.keys().cloned().collect()
+ }
+
+ fn udwfs(&self) -> HashSet<String> {
+ self.window_functions.keys().cloned().collect()
+ }
}
```
### `datafusion-proto` use `TaskContext` rather than `SessionContext` in physical plan serde methods
There have been changes in the public API methods of `datafusion-proto` which handle physical plan serde.
Methods like `physical_plan_from_bytes`, `parse_physical_expr` and similar, expect `TaskContext` instead of `SessionContext`
```diff
- let plan2 = physical_plan_from_bytes(&bytes, &ctx)?;
+ let plan2 = physical_plan_from_bytes(&bytes, &ctx.task_ctx())?;
```
as `TaskContext` contains `RuntimeEnv` methods such as `try_into_physical_plan` will not have explicit `RuntimeEnv` parameter.
```diff
let result_exec_plan: Arc<dyn ExecutionPlan> = proto
- .try_into_physical_plan(&ctx, runtime.deref(), &composed_codec)
+. .try_into_physical_plan(&ctx.task_ctx(), &composed_codec)
```
`PhysicalExtensionCodec::try_decode()` expects `TaskContext` instead of `FunctionRegistry`:
```diff
pub trait PhysicalExtensionCodec {
fn try_decode(
&self,
buf: &[u8],
inputs: &[Arc<dyn ExecutionPlan>],
- registry: &dyn FunctionRegistry,
+ ctx: &TaskContext,
) -> Result<Arc<dyn ExecutionPlan>>;
```
See [issue #17601] for more details.
[issue #17601]: https://github.com/apache/datafusion/issues/17601
### `SessionState`'s `sql_to_statement` method takes `Dialect` rather than a `str`
The `dialect` parameter of `sql_to_statement` method defined in `datafusion::execution::session_state::SessionState`
has changed from `&str` to `&Dialect`.
`Dialect` is an enum defined in the `datafusion-common`
crate under the `config` module that provides type safety
and better validation for SQL dialect selection
### Reorganization of `ListingTable` into `datafusion-catalog-listing` crate
There has been a long standing request to remove features such as `ListingTable`
from the `datafusion` crate to support faster build times. The structs
`ListingOptions`, `ListingTable`, and `ListingTableConfig` are now available
within the `datafusion-catalog-listing` crate. These are re-exported in
the `datafusion` crate, so this should be a minimal impact to existing users.
See [issue #14462] and [issue #17713] for more details.
[issue #14462]: https://github.com/apache/datafusion/issues/14462
[issue #17713]: https://github.com/apache/datafusion/issues/17713
### Reorganization of `ArrowSource` into `datafusion-datasource-arrow` crate
To support [issue #17713] the `ArrowSource` code has been removed from
the `datafusion` core crate into it's own crate, `datafusion-datasource-arrow`.
This follows the pattern for the AVRO, CSV, JSON, and Parquet data sources.
Users may need to update their paths to account for these changes.
See [issue #17713] for more details.
### `FileScanConfig::projection` renamed to `FileScanConfig::projection_exprs`
The `projection` field in `FileScanConfig` has been renamed to `projection_exprs` and its type has changed from `Option<Vec<usize>>` to `Option<ProjectionExprs>`. This change enables more powerful projection pushdown capabilities by supporting arbitrary physical expressions rather than just column indices.
**Impact on direct field access:**
If you directly access the `projection` field:
```rust,ignore
let config: FileScanConfig = ...;
let projection = config.projection;
```
You should update to:
```rust,ignore
let config: FileScanConfig = ...;
let projection_exprs = config.projection_exprs;
```
**Impact on builders:**
The `FileScanConfigBuilder::with_projection()` method has been deprecated in favor of `with_projection_indices()`:
```diff
let config = FileScanConfigBuilder::new(url, schema, file_source)
- .with_projection(Some(vec![0, 2, 3]))
+ .with_projection_indices(Some(vec![0, 2, 3]))
.build();
```
Note: `with_projection()` still works but is deprecated and will be removed in a future release.
**What is `ProjectionExprs`?**
`ProjectionExprs` is a new type that represents a list of physical expressions for projection. While it can be constructed from column indices (which is what `with_projection_indices` does internally), it also supports arbitrary physical expressions, enabling advanced features like expression evaluation during scanning.
You can access column indices from `ProjectionExprs` using its methods if needed:
```rust,ignore
let projection_exprs: ProjectionExprs = ...;
// Get the column indices if the projection only contains simple column references
let indices = projection_exprs.column_indices();
```
### `DESCRIBE query` support
`DESCRIBE query` was previously an alias for `EXPLAIN query`, which outputs the
_execution plan_ of the query. With this release, `DESCRIBE query` now outputs
the computed _schema_ of the query, consistent with the behavior of `DESCRIBE table_name`.
### `datafusion.execution.time_zone` default configuration changed
The default value for `datafusion.execution.time_zone` previously was a string value of `+00:00` (GMT/Zulu time).
This was changed to be an `Option<String>` with a default of `None`. If you want to change the timezone back
to the previous value you can execute the sql:
```sql
SET
TIMEZONE = '+00:00';
```
This change was made to better support using the default timezone in scalar UDF functions such as
`now`, `current_date`, `current_time`, and `to_timestamp` among others.
### Introduction of `TableSchema` and changes to `FileSource::with_schema()` method
A new `TableSchema` struct has been introduced in the `datafusion-datasource` crate to better manage table schemas with partition columns. This struct helps distinguish between:
- **File schema**: The schema of actual data files on disk
- **Partition columns**: Columns derived from directory structure (e.g., Hive-style partitioning)
- **Table schema**: The complete schema combining both file and partition columns
As part of this change, the `FileSource::with_schema()` method signature has changed from accepting a `SchemaRef` to accepting a `TableSchema`.
**Who is affected:**
- Users who have implemented custom `FileSource` implementations will need to update their code
- Users who only use built-in file sources (Parquet, CSV, JSON, AVRO, Arrow) are not affected
**Migration guide for custom `FileSource` implementations:**
```diff
use datafusion_datasource::file::FileSource;
-use arrow::datatypes::SchemaRef;
+use datafusion_datasource::TableSchema;
impl FileSource for MyCustomSource {
- fn with_schema(&self, schema: SchemaRef) -> Arc<dyn FileSource> {
+ fn with_schema(&self, schema: TableSchema) -> Arc<dyn FileSource> {
Arc::new(Self {
- schema: Some(schema),
+ // Use schema.file_schema() to get the file schema without partition columns
+ schema: Some(Arc::clone(schema.file_schema())),
..self.clone()
})
}
}
```
For implementations that need access to partition columns:
```rust,ignore
fn with_schema(&self, schema: TableSchema) -> Arc<dyn FileSource> {
Arc::new(Self {
file_schema: Arc::clone(schema.file_schema()),
partition_cols: schema.table_partition_cols().clone(),
table_schema: Arc::clone(schema.table_schema()),
..self.clone()
})
}
```
**Note**: Most `FileSource` implementations only need to store the file schema (without partition columns), as shown in the first example. The second pattern of storing all three schema components is typically only needed for advanced use cases where you need access to different schema representations for different operations (e.g., ParquetSource uses the file schema for building pruning predicates but needs the table schema for filter pushdown logic).
**Using `TableSchema` directly:**
If you're constructing a `FileScanConfig` or working with table schemas and partition columns, you can now use `TableSchema`:
```rust
use datafusion_datasource::TableSchema;
use arrow::datatypes::{Schema, Field, DataType};
use std::sync::Arc;
// Create a TableSchema with partition columns
let file_schema = Arc::new(Schema::new(vec![
Field::new("user_id", DataType::Int64, false),
Field::new("amount", DataType::Float64, false),
]));
let partition_cols = vec![
Arc::new(Field::new("date", DataType::Utf8, false)),
Arc::new(Field::new("region", DataType::Utf8, false)),
];
let table_schema = TableSchema::new(file_schema, partition_cols);
// Access different schema representations
let file_schema_ref = table_schema.file_schema(); // Schema without partition columns
let full_schema = table_schema.table_schema(); // Complete schema with partition columns
let partition_cols_ref = table_schema.table_partition_cols(); // Just the partition columns
```
### `AggregateUDFImpl::is_ordered_set_aggregate` has been renamed to `AggregateUDFImpl::supports_within_group_clause`
This method has been renamed to better reflect the actual impact it has for aggregate UDF implementations.
The accompanying `AggregateUDF::is_ordered_set_aggregate` has also been renamed to `AggregateUDF::supports_within_group_clause`.
No functionality has been changed with regards to this method; it still refers only to permitting use of `WITHIN GROUP`
SQL syntax for the aggregate function.
## DataFusion `50.0.0`
### ListingTable automatically detects Hive Partitioned tables
DataFusion 50.0.0 automatically infers Hive partitions when using the `ListingTableFactory` and `CREATE EXTERNAL TABLE`. Previously,
when creating a `ListingTable`, datasets that use Hive partitioning (e.g.
`/table_root/column1=value1/column2=value2/data.parquet`) would not have the Hive columns reflected in
the table's schema or data. The previous behavior can be
restored by setting the `datafusion.execution.listing_table_factory_infer_partitions` configuration option to `false`.
See [issue #17049] for more details.
[issue #17049]: https://github.com/apache/datafusion/issues/17049
### `MSRV` updated to 1.86.0
The Minimum Supported Rust Version (MSRV) has been updated to [`1.86.0`].
See [#17230] for details.
[`1.86.0`]: https://releases.rs/docs/1.86.0/
[#17230]: https://github.com/apache/datafusion/pull/17230
### `ScalarUDFImpl`, `AggregateUDFImpl` and `WindowUDFImpl` traits now require `PartialEq`, `Eq`, and `Hash` traits
To address error-proneness of `ScalarUDFImpl::equals`, `AggregateUDFImpl::equals`and
`WindowUDFImpl::equals` methods and to make it easy to implement function equality correctly,
the `equals` and `hash_value` methods have been removed from `ScalarUDFImpl`, `AggregateUDFImpl`
and `WindowUDFImpl` traits. They are replaced the requirement to implement the `PartialEq`, `Eq`,
and `Hash` traits on any type implementing `ScalarUDFImpl`, `AggregateUDFImpl` or `WindowUDFImpl`.
Please see [issue #16677] for more details.
Most of the scalar functions are stateless and have a `signature` field. These can be migrated
using regular expressions
- search for `\#\[derive\(Debug\)\](\n *(pub )?struct \w+ \{\n *signature\: Signature\,\n *\})`,
- replace with `#[derive(Debug, PartialEq, Eq, Hash)]$1`,
- review all the changes and make sure only function structs were changed.
[issue #16677]: https://github.com/apache/datafusion/issues/16677
### `AsyncScalarUDFImpl::invoke_async_with_args` returns `ColumnarValue`
In order to enable single value optimizations and be consistent with other
user defined function APIs, the `AsyncScalarUDFImpl::invoke_async_with_args` method now
returns a `ColumnarValue` instead of a `ArrayRef`.
To upgrade, change the return type of your implementation
```rust
# /* comment to avoid running
impl AsyncScalarUDFImpl for AskLLM {
async fn invoke_async_with_args(
&self,
args: ScalarFunctionArgs,
_option: &ConfigOptions,
) -> Result<ColumnarValue> {
..
return array_ref; // old code
}
}
# */
```
To return a `ColumnarValue`
```rust
# /* comment to avoid running
impl AsyncScalarUDFImpl for AskLLM {
async fn invoke_async_with_args(
&self,
args: ScalarFunctionArgs,
_option: &ConfigOptions,
) -> Result<ColumnarValue> {
..
return ColumnarValue::from(array_ref); // new code
}
}
# */
```
See [#16896](https://github.com/apache/datafusion/issues/16896) for more details.
### `ProjectionExpr` changed from type alias to struct
`ProjectionExpr` has been changed from a type alias to a struct with named fields to improve code clarity and maintainability.
**Before:**
```rust,ignore
pub type ProjectionExpr = (Arc<dyn PhysicalExpr>, String);
```
**After:**
```rust,ignore
#[derive(Debug, Clone)]
pub struct ProjectionExpr {
pub expr: Arc<dyn PhysicalExpr>,
pub alias: String,
}
```
To upgrade your code:
- Replace tuple construction `(expr, alias)` with `ProjectionExpr::new(expr, alias)` or `ProjectionExpr { expr, alias }`
- Replace tuple field access `.0` and `.1` with `.expr` and `.alias`
- Update pattern matching from `(expr, alias)` to `ProjectionExpr { expr, alias }`
This mainly impacts use of `ProjectionExec`.
This change was done in [#17398]
[#17398]: https://github.com/apache/datafusion/pull/17398
### `SessionState`, `SessionConfig`, and `OptimizerConfig` returns `&Arc<ConfigOptions>` instead of `&ConfigOptions`
To provide broader access to `ConfigOptions` and reduce required clones, some
APIs have been changed to return a `&Arc<ConfigOptions>` instead of a
`&ConfigOptions`. This allows sharing the same `ConfigOptions` across multiple
threads without needing to clone the entire `ConfigOptions` structure unless it
is modified.
Most users will not be impacted by this change since the Rust compiler typically
automatically dereference the `Arc` when needed. However, in some cases you may
have to change your code to explicitly call `as_ref()` for example, from
```rust
# /* comment to avoid running
let optimizer_config: &ConfigOptions = state.options();
# */
```
To
```rust
# /* comment to avoid running
let optimizer_config: &ConfigOptions = state.options().as_ref();
# */
```
See PR [#16970](https://github.com/apache/datafusion/pull/16970)
### API Change to `AsyncScalarUDFImpl::invoke_async_with_args`
The `invoke_async_with_args` method of the `AsyncScalarUDFImpl` trait has been
updated to remove the `_option: &ConfigOptions` parameter to simplify the API
now that the `ConfigOptions` can be accessed through the `ScalarFunctionArgs`
parameter.
You can change your code like this
```rust
# /* comment to avoid running
impl AsyncScalarUDFImpl for AskLLM {
async fn invoke_async_with_args(
&self,
args: ScalarFunctionArgs,
_option: &ConfigOptions,
) -> Result<ArrayRef> {
..
}
...
}
# */
```
To this:
```rust
# /* comment to avoid running
impl AsyncScalarUDFImpl for AskLLM {
async fn invoke_async_with_args(
&self,
args: ScalarFunctionArgs,
) -> Result<ArrayRef> {
let options = &args.config_options;
..
}
...
}
# */
```
### Schema Rewriter Module Moved to New Crate
The `schema_rewriter` module and its associated symbols have been moved from `datafusion_physical_expr` to a new crate `datafusion_physical_expr_adapter`. This affects the following symbols:
- `DefaultPhysicalExprAdapter`
- `DefaultPhysicalExprAdapterFactory`
- `PhysicalExprAdapter`
- `PhysicalExprAdapterFactory`
To upgrade, change your imports to:
```rust
use datafusion_physical_expr_adapter::{
DefaultPhysicalExprAdapter, DefaultPhysicalExprAdapterFactory,
PhysicalExprAdapter, PhysicalExprAdapterFactory
};
```
### Upgrade to arrow `56.0.0` and parquet `56.0.0`
This version of DataFusion upgrades the underlying Apache Arrow implementation
to version `56.0.0`. See the [release notes](https://github.com/apache/arrow-rs/releases/tag/56.0.0)
for more details.
### Added `ExecutionPlan::reset_state`
In order to fix a bug in DataFusion `49.0.0` where dynamic filters (currently only generated in the presence of a query such as `ORDER BY ... LIMIT ...`)
produced incorrect results in recursive queries, a new method `reset_state` has been added to the `ExecutionPlan` trait.
Any `ExecutionPlan` that needs to maintain internal state or references to other nodes in the execution plan tree should implement this method to reset that state.
See [#17028] for more details and an example implementation for `SortExec`.
[#17028]: https://github.com/apache/datafusion/pull/17028
### Nested Loop Join input sort order cannot be preserved
The Nested Loop Join operator has been rewritten from scratch to improve performance and memory efficiency. From the micro-benchmarks: this change introduces up to 5X speed-up and uses only 1% memory in extreme cases compared to the previous implementation.
However, the new implementation cannot preserve input sort order like the old version could. This is a fundamental design trade-off that prioritizes performance and memory efficiency over sort order preservation.
See [#16996] for details.
[#16996]: https://github.com/apache/datafusion/pull/16996
### Add `as_any()` method to `LazyBatchGenerator`
To help with protobuf serialization, the `as_any()` method has been added to the `LazyBatchGenerator` trait. This means you will need to add `as_any()` to your implementation of `LazyBatchGenerator`:
```rust
# /* comment to avoid running
impl LazyBatchGenerator for MyBatchGenerator {
fn as_any(&self) -> &dyn Any {
self
}
...
}
# */
```
See [#17200](https://github.com/apache/datafusion/pull/17200) for details.
### Refactored `DataSource::try_swapping_with_projection`
We refactored `DataSource::try_swapping_with_projection` to simplify the method and minimize leakage across the ExecutionPlan <-> DataSource abstraction layer.
Reimplementation for any custom `DataSource` should be relatively straightforward, see [#17395] for more details.
[#17395]: https://github.com/apache/datafusion/pull/17395/
### `FileOpenFuture` now uses `DataFusionError` instead of `ArrowError`
The `FileOpenFuture` type alias has been updated to use `DataFusionError` instead of `ArrowError` for its error type. This change affects the `FileOpener` trait and any implementations that work with file streaming operations.
**Before:**
```rust,ignore
pub type FileOpenFuture = BoxFuture<'static, Result<BoxStream<'static, Result<RecordBatch, ArrowError>>>>;
```
**After:**
```rust,ignore
pub type FileOpenFuture = BoxFuture<'static, Result<BoxStream<'static, Result<RecordBatch>>>>;
```
If you have custom implementations of `FileOpener` or work directly with `FileOpenFuture`, you'll need to update your error handling to use `DataFusionError` instead of `ArrowError`. The `FileStreamState` enum's `Open` variant has also been updated accordingly. See [#17397] for more details.
[#17397]: https://github.com/apache/datafusion/pull/17397
### FFI user defined aggregate function signature change
The Foreign Function Interface (FFI) signature for user defined aggregate functions
has been updated to call `return_field` instead of `return_type` on the underlying
aggregate function. This is to support metadata handling with these aggregate functions.
This change should be transparent to most users. If you have written unit tests to call
`return_type` directly, you may need to change them to calling `return_field` instead.
This update is a breaking change to the FFI API. The current best practice when using the
FFI crate is to ensure that all libraries that are interacting are using the same
underlying Rust version. Issue [#17374] has been opened to discuss stabilization of
this interface so that these libraries can be used across different DataFusion versions.
See [#17407] for details.
[#17407]: https://github.com/apache/datafusion/pull/17407
[#17374]: https://github.com/apache/datafusion/issues/17374
### Added `PhysicalExpr::is_volatile_node`
We added a method to `PhysicalExpr` to mark a `PhysicalExpr` as volatile:
```rust,ignore
impl PhysicalExpr for MyRandomExpr {
fn is_volatile_node(&self) -> bool {
true
}
}
```
We've shipped this with a default value of `false` to minimize breakage but we highly recommend that implementers of `PhysicalExpr` opt into a behavior, even if it is returning `false`.
You can see more discussion and example implementations in [#17351].
[#17351]: https://github.com/apache/datafusion/pull/17351
## DataFusion `49.0.0`
### `MSRV` updated to 1.85.1
The Minimum Supported Rust Version (MSRV) has been updated to [`1.85.1`]. See
[#16728] for details.
[`1.85.1`]: https://releases.rs/docs/1.85.1/
[#16728]: https://github.com/apache/datafusion/pull/16728
### `DataFusionError` variants are now `Box`ed
To reduce the size of `DataFusionError`, several variants that were previously stored inline are now `Box`ed. This reduces the size of `Result<T, DataFusionError>` and thus stack usage and async state machine size. Please see [#16652] for more details.
The following variants of `DataFusionError` are now boxed:
- `ArrowError`
- `SQL`
- `SchemaError`
This is a breaking change. Code that constructs or matches on these variants will need to be updated.
For example, to create a `SchemaError`, instead of:
```rust
# /* comment to avoid running
use datafusion_common::{DataFusionError, SchemaError};
DataFusionError::SchemaError(
SchemaError::DuplicateUnqualifiedField { name: "foo".to_string() },
Box::new(None)
)
# */
```
You now need to `Box` the inner error:
```rust
# /* comment to avoid running
use datafusion_common::{DataFusionError, SchemaError};
DataFusionError::SchemaError(
Box::new(SchemaError::DuplicateUnqualifiedField { name: "foo".to_string() }),
Box::new(None)
)
# */
```
[#16652]: https://github.com/apache/datafusion/issues/16652
### Metadata on Arrow Types is now represented by `FieldMetadata`
Metadata from the Arrow `Field` is now stored using the `FieldMetadata`
structure. In prior versions it was stored as both a `HashMap<String, String>`
and a `BTreeMap<String, String>`. `FieldMetadata` is a easier to work with and
is more efficient.
To create `FieldMetadata` from a `Field`:
```rust
# /* comment to avoid running
let metadata = FieldMetadata::from(&field);
# */
```
To add metadata to a `Field`, use the `add_to_field` method:
```rust
# /* comment to avoid running
let updated_field = metadata.add_to_field(field);
# */
```
See [#16317] for details.
[#16317]: https://github.com/apache/datafusion/pull/16317
### New `datafusion.execution.spill_compression` configuration option
DataFusion 49.0.0 adds support for compressing spill files when data is written to disk during spilling query execution. A new configuration option `datafusion.execution.spill_compression` controls the compression codec used.
**Configuration:**
- **Key**: `datafusion.execution.spill_compression`
- **Default**: `uncompressed`
- **Valid values**: `uncompressed`, `lz4_frame`, `zstd`
**Usage:**
```rust
# /* comment to avoid running
use datafusion::prelude::*;
use datafusion_common::config::SpillCompression;
let config = SessionConfig::default()
.with_spill_compression(SpillCompression::Zstd);
let ctx = SessionContext::new_with_config(config);
# */
```
Or via SQL:
```sql
SET datafusion.execution.spill_compression = 'zstd';
```
For more details about this configuration option, including performance trade-offs between different compression codecs, see the [Configuration Settings](../user-guide/configs.md) documentation.
### Deprecated `map_varchar_to_utf8view` configuration option
See [issue #16290](https://github.com/apache/datafusion/pull/16290) for more information
The old configuration
```text
datafusion.sql_parser.map_varchar_to_utf8view
```
is now **deprecated** in favor of the unified option below.\
If you previously used this to control only `VARCHAR`→`Utf8View` mapping, please migrate to `map_string_types_to_utf8view`.
---
### New `map_string_types_to_utf8view` configuration option
To unify **all** SQL string types (`CHAR`, `VARCHAR`, `TEXT`, `STRING`) to Arrow’s zero‑copy `Utf8View`, DataFusion 49.0.0 introduces:
- **Key**: `datafusion.sql_parser.map_string_types_to_utf8view`
- **Default**: `true`
**Description:**
- When **true** (default), **all** SQL string types are mapped to `Utf8View`, avoiding full‑copy UTF‑8 allocations and improving performance.
- When **false**, DataFusion falls back to the legacy `Utf8` mapping for **all** string types.
#### Examples
```rust
# /* comment to avoid running
// Disable Utf8View mapping for all SQL string types
let opts = datafusion::sql::planner::ParserOptions::new()
.with_map_string_types_to_utf8view(false);
// Verify the setting is applied
assert!(!opts.map_string_types_to_utf8view);
# */
```
---
```sql
-- Disable Utf8View mapping globally
SET datafusion.sql_parser.map_string_types_to_utf8view = false;
-- Now VARCHAR, CHAR, TEXT, STRING all use Utf8 rather than Utf8View
CREATE TABLE my_table (a VARCHAR, b TEXT, c STRING);
DESCRIBE my_table;
```
### Deprecating `SchemaAdapterFactory` and `SchemaAdapter`
We are moving away from converting data (using `SchemaAdapter`) to converting the expressions themselves (which is more efficient and flexible).
See [issue #16800](https://github.com/apache/datafusion/issues/16800) for more information
The first place this change has taken place is in predicate pushdown for Parquet.
By default if you do not use a custom `SchemaAdapterFactory` we will use expression conversion instead.
If you do set a custom `SchemaAdapterFactory` we will continue to use it but emit a warning about that code path being deprecated.
To resolve this you need to implement a custom `PhysicalExprAdapterFactory` and use that instead of a `SchemaAdapterFactory`.
See the [default values](https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/default_column_values.rs) for an example of how to do this.
Opting into the new APIs will set you up for future changes since we plan to expand use of `PhysicalExprAdapterFactory` to other areas of DataFusion.
See [#16800] for details.
[#16800]: https://github.com/apache/datafusion/issues/16800
### `TableParquetOptions` Updated
The `TableParquetOptions` struct has a new `crypto` field to specify encryption
options for Parquet files. The `ParquetEncryptionOptions` implements `Default`
so you can upgrade your existing code like this:
```rust
# /* comment to avoid running
TableParquetOptions {
global,
column_specific_options,
key_value_metadata,
}
# */
```
To this:
```rust
# /* comment to avoid running
TableParquetOptions {
global,
column_specific_options,
key_value_metadata,
crypto: Default::default(), // New crypto field
}
# */
```
## DataFusion `48.0.1`
### `datafusion.execution.collect_statistics` now defaults to `true`
The default value of the `datafusion.execution.collect_statistics` configuration
setting is now true. This change impacts users that use that value directly and relied
on its default value being `false`.
This change also restores the default behavior of `ListingTable` to its previous. If you use it directly
you can maintain the current behavior by overriding the default value in your code.
```rust
# /* comment to avoid running
ListingOptions::new(Arc::new(ParquetFormat::default()))
.with_collect_stat(false)
// other options
# */
```
## DataFusion `48.0.0`
### `Expr::Literal` has optional metadata
The [`Expr::Literal`] variant now includes optional metadata, which allows for
carrying through Arrow field metadata to support extension types and other uses.
This means code such as
```rust
# /* comment to avoid running
match expr {
...
Expr::Literal(scalar) => ...
...
}
# */
```
Should be updated to:
```rust
# /* comment to avoid running
match expr {
...
Expr::Literal(scalar, _metadata) => ...
...
}
# */
```
Likewise constructing `Expr::Literal` requires metadata as well. The [`lit`] function
has not changed and returns an `Expr::Literal` with no metadata.
[`expr::literal`]: https://docs.rs/datafusion/latest/datafusion/logical_expr/enum.Expr.html#variant.Literal
[`lit`]: https://docs.rs/datafusion/latest/datafusion/logical_expr/fn.lit.html
### `Expr::WindowFunction` is now `Box`ed
`Expr::WindowFunction` is now a `Box<WindowFunction>` instead of a `WindowFunction` directly.
This change was made to reduce the size of `Expr` and improve performance when
planning queries (see [details on #16207]).
This is a breaking change, so you will need to update your code if you match
on `Expr::WindowFunction` directly. For example, if you have code like this:
```rust
# /* comment to avoid running
match expr {
Expr::WindowFunction(WindowFunction {
params:
WindowFunctionParams {
partition_by,
order_by,
..
}
}) => {
// Use partition_by and order_by as needed
}
_ => {
// other expr
}
}
# */
```
You will need to change it to:
```rust
# /* comment to avoid running
match expr {
Expr::WindowFunction(window_fun) => {
let WindowFunction {
fun,
params: WindowFunctionParams {
args,
partition_by,
..
},
} = window_fun.as_ref();
// Use partition_by and order_by as needed
}
_ => {
// other expr
}
}
# */
```
[details on #16207]: https://github.com/apache/datafusion/pull/16207#issuecomment-2922659103
### The `VARCHAR` SQL type is now represented as `Utf8View` in Arrow
The mapping of the SQL `VARCHAR` type has been changed from `Utf8` to `Utf8View`
which improves performance for many string operations. You can read more about
`Utf8View` in the [DataFusion blog post on German-style strings]
[datafusion blog post on german-style strings]: https://datafusion.apache.org/blog/2024/09/13/string-view-german-style-strings-part-1/
This means that when you create a table with a `VARCHAR` column, it will now use
`Utf8View` as the underlying data type. For example:
```sql
> CREATE TABLE my_table (my_column VARCHAR);
0 row(s) fetched.
Elapsed 0.001 seconds.
> DESCRIBE my_table;
+-------------+-----------+-------------+
| column_name | data_type | is_nullable |
+-------------+-----------+-------------+
| my_column | Utf8View | YES |
+-------------+-----------+-------------+
1 row(s) fetched.
Elapsed 0.000 seconds.
```
You can restore the old behavior of using `Utf8` by changing the
`datafusion.sql_parser.map_varchar_to_utf8view` configuration setting. For
example
```sql
> set datafusion.sql_parser.map_varchar_to_utf8view = false;
0 row(s) fetched.
Elapsed 0.001 seconds.
> CREATE TABLE my_table (my_column VARCHAR);
0 row(s) fetched.
Elapsed 0.014 seconds.
> DESCRIBE my_table;
+-------------+-----------+-------------+
| column_name | data_type | is_nullable |
+-------------+-----------+-------------+
| my_column | Utf8 | YES |
+-------------+-----------+-------------+
1 row(s) fetched.
Elapsed 0.004 seconds.
```
### `ListingOptions` default for `collect_stat` changed from `true` to `false`
This makes it agree with the default for `SessionConfig`.
Most users won't be impacted by this change but if you were using `ListingOptions` directly
and relied on the default value of `collect_stat` being `true`, you will need to
explicitly set it to `true` in your code.
```rust
# /* comment to avoid running
ListingOptions::new(Arc::new(ParquetFormat::default()))
.with_collect_stat(true)
// other options
# */
```
### Processing `FieldRef` instead of `DataType` for user defined functions
In order to support metadata handling and extension types, user defined functions are
now switching to traits which use `FieldRef` rather than a `DataType` and nullability.
This gives a single interface to both of these parameters and additionally allows
access to metadata fields, which can be used for extension types.
To upgrade structs which implement `ScalarUDFImpl`, if you have implemented
`return_type_from_args` you need instead to implement `return_field_from_args`.
If your functions do not need to handle metadata, this should be straightforward
repackaging of the output data into a `FieldRef`. The name you specify on the
field is not important. It will be overwritten during planning. `ReturnInfo`
has been removed, so you will need to remove all references to it.
`ScalarFunctionArgs` now contains a field called `arg_fields`. You can use this
to access the metadata associated with the columnar values during invocation.
To upgrade user defined aggregate functions, there is now a function
`return_field` that will allow you to specify both metadata and nullability of
your function. You are not required to implement this if you do not need to
handle metadata.
The largest change to aggregate functions happens in the accumulator arguments.
Both the `AccumulatorArgs` and `StateFieldsArgs` now contain `FieldRef` rather
than `DataType`.
To upgrade window functions, `ExpressionArgs` now contains input fields instead
of input data types. When setting these fields, the name of the field is
not important since this gets overwritten during the planning stage. All you
should need to do is wrap your existing data types in fields with nullability
set depending on your use case.
### Physical Expression return `Field`
To support the changes to user defined functions processing metadata, the
`PhysicalExpr` trait, which now must specify a return `Field` based on the input
schema. To upgrade structs which implement `PhysicalExpr` you need to implement
the `return_field` function. There are numerous examples in the `physical-expr`
crate.
### `FileFormat::supports_filters_pushdown` replaced with `FileSource::try_pushdown_filters`
To support more general filter pushdown, the `FileFormat::supports_filters_pushdown` was replaced with
`FileSource::try_pushdown_filters`.
If you implemented a custom `FileFormat` that uses a custom `FileSource` you will need to implement
`FileSource::try_pushdown_filters`.
See `ParquetSource::try_pushdown_filters` for an example of how to implement this.
`FileFormat::supports_filters_pushdown` has been removed.
### `ParquetExec`, `AvroExec`, `CsvExec`, `JsonExec` Removed
`ParquetExec`, `AvroExec`, `CsvExec`, and `JsonExec` were deprecated in
DataFusion 46 and are removed in DataFusion 48. This is sooner than the normal
process described in the [API Deprecation Guidelines] because all the tests
cover the new `DataSourceExec` rather than the older structures. As we evolve
`DataSource`, the old structures began to show signs of "bit rotting" (not
working but no one knows due to lack of test coverage).
[api deprecation guidelines]: https://datafusion.apache.org/contributor-guide/api-health.html#deprecation-guidelines
### `PartitionedFile` added as an argument to the `FileOpener` trait
This is necessary to properly fix filter pushdown for filters that combine partition
columns and file columns (e.g. `day = username['dob']`).
If you implemented a custom `FileOpener` you will need to add the `PartitionedFile` argument
but are not required to use it in any way.
## DataFusion `47.0.0`
This section calls out some of the major changes in the `47.0.0` release of DataFusion.
Here are some example upgrade PRs that demonstrate changes required when upgrading from DataFusion 46.0.0:
- [delta-rs Upgrade to `47.0.0`](https://github.com/delta-io/delta-rs/pull/3378)
- [DataFusion Comet Upgrade to `47.0.0`](https://github.com/apache/datafusion-comet/pull/1563)
- [Sail Upgrade to `47.0.0`](https://github.com/lakehq/sail/pull/434)
### Upgrades to `arrow-rs` and `arrow-parquet` 55.0.0 and `object_store` 0.12.0
Several APIs are changed in the underlying arrow and parquet libraries to use a
`u64` instead of `usize` to better support WASM (See [#7371] and [#6961])
Additionally `ObjectStore::list` and `ObjectStore::list_with_offset` have been changed to return `static` lifetimes (See [#6619])
[#6619]: https://github.com/apache/arrow-rs/pull/6619
[#7371]: https://github.com/apache/arrow-rs/pull/7371
This requires converting from `usize` to `u64` occasionally as well as changes to `ObjectStore` implementations such as
```rust
# /* comment to avoid running
impl Objectstore {
...
// The range is now a u64 instead of usize
async fn get_range(&self, location: &Path, range: Range<u64>) -> ObjectStoreResult<Bytes> {
self.inner.get_range(location, range).await
}
...
// the lifetime is now 'static instead of `_ (meaning the captured closure can't contain references)
// (this also applies to list_with_offset)
fn list(&self, prefix: Option<&Path>) -> BoxStream<'static, ObjectStoreResult<ObjectMeta>> {
self.inner.list(prefix)
}
}
# */
```
The `ParquetObjectReader` has been updated to no longer require the object size
(it can be fetched using a single suffix request). See [#7334] for details
[#7334]: https://github.com/apache/arrow-rs/pull/7334
Pattern in DataFusion `46.0.0`:
```rust
# /* comment to avoid running
let meta: ObjectMeta = ...;
let reader = ParquetObjectReader::new(store, meta);
# */
```
Pattern in DataFusion `47.0.0`:
```rust
# /* comment to avoid running
let meta: ObjectMeta = ...;
let reader = ParquetObjectReader::new(store, location)
.with_file_size(meta.size);
# */
```
### `DisplayFormatType::TreeRender`
DataFusion now supports [`tree` style explain plans]. Implementations of
`Executionplan` must also provide a description in the
`DisplayFormatType::TreeRender` format. This can be the same as the existing
`DisplayFormatType::Default`.
[`tree` style explain plans]: https://datafusion.apache.org/user-guide/sql/explain.html#tree-format-default
### Removed Deprecated APIs
Several APIs have been removed in this release. These were either deprecated
previously or were hard to use correctly such as the multiple different
`ScalarUDFImpl::invoke*` APIs. See [#15130], [#15123], and [#15027] for more
details.
[#15130]: https://github.com/apache/datafusion/pull/15130
[#15123]: https://github.com/apache/datafusion/pull/15123
[#15027]: https://github.com/apache/datafusion/pull/15027
### `FileScanConfig` --> `FileScanConfigBuilder`
Previously, `FileScanConfig::build()` directly created ExecutionPlans. In
DataFusion 47.0.0 this has been changed to use `FileScanConfigBuilder`. See
[#15352] for details.
[#15352]: https://github.com/apache/datafusion/pull/15352
Pattern in DataFusion `46.0.0`:
```rust
# /* comment to avoid running
let plan = FileScanConfig::new(url, schema, Arc::new(file_source))
.with_statistics(stats)
...
.build()
# */
```
Pattern in DataFusion `47.0.0`:
```rust
# /* comment to avoid running
let config = FileScanConfigBuilder::new(url, schema, Arc::new(file_source))
.with_statistics(stats)
...
.build();
let scan = DataSourceExec::from_data_source(config);
# */
```
## DataFusion `46.0.0`
### Use `invoke_with_args` instead of `invoke()` and `invoke_batch()`
DataFusion is moving to a consistent API for invoking ScalarUDFs,
[`ScalarUDFImpl::invoke_with_args()`], and deprecating
[`ScalarUDFImpl::invoke()`], [`ScalarUDFImpl::invoke_batch()`], and [`ScalarUDFImpl::invoke_no_args()`]
If you see errors such as the following it means the older APIs are being used:
```text
This feature is not implemented: Function concat does not implement invoke but called
```
To fix this error, use [`ScalarUDFImpl::invoke_with_args()`] instead, as shown
below. See [PR 14876] for an example.
Given existing code like this:
```rust
# /* comment to avoid running
impl ScalarUDFImpl for SparkConcat {
...
fn invoke_batch(&self, args: &[ColumnarValue], number_rows: usize) -> Result<ColumnarValue> {
if args
.iter()
.any(|arg| matches!(arg.data_type(), DataType::List(_)))
{
ArrayConcat::new().invoke_batch(args, number_rows)
} else {
ConcatFunc::new().invoke_batch(args, number_rows)
}
}
}
# */
```
To
```rust
# /* comment to avoid running
impl ScalarUDFImpl for SparkConcat {
...
fn invoke_with_args(&self, args: ScalarFunctionArgs) -> Result<ColumnarValue> {
if args
.args
.iter()
.any(|arg| matches!(arg.data_type(), DataType::List(_)))
{
ArrayConcat::new().invoke_with_args(args)
} else {
ConcatFunc::new().invoke_with_args(args)
}
}
}
# */
```
[`scalarudfimpl::invoke()`]: https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.ScalarUDFImpl.html#method.invoke
[`scalarudfimpl::invoke_batch()`]: https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.ScalarUDFImpl.html#method.invoke_batch
[`scalarudfimpl::invoke_no_args()`]: https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.ScalarUDFImpl.html#method.invoke_no_args
[`scalarudfimpl::invoke_with_args()`]: https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.ScalarUDFImpl.html#method.invoke_with_args
[pr 14876]: https://github.com/apache/datafusion/pull/14876
### `ParquetExec`, `AvroExec`, `CsvExec`, `JsonExec` deprecated
DataFusion 46 has a major change to how the built in DataSources are organized.
Instead of individual `ExecutionPlan`s for the different file formats they now
all use `DataSourceExec` and the format specific information is embodied in new
traits `DataSource` and `FileSource`.
Here is more information about
- [Design Ticket]
- Change PR [PR #14224]
- Example of an Upgrade [PR in delta-rs]
[design ticket]: https://github.com/apache/datafusion/issues/13838
[pr #14224]: https://github.com/apache/datafusion/pull/14224
[pr in delta-rs]: https://github.com/delta-io/delta-rs/pull/3261
### Cookbook: Changes to `ParquetExecBuilder`
Code that looks for `ParquetExec` like this will no longer work:
```rust
# /* comment to avoid running
if let Some(parquet_exec) = plan.as_any().downcast_ref::<ParquetExec>() {
// Do something with ParquetExec here
}
# */
```
Instead, with `DataSourceExec`, the same information is now on `FileScanConfig` and
`ParquetSource`. The equivalent code is
```rust
# /* comment to avoid running
if let Some(datasource_exec) = plan.as_any().downcast_ref::<DataSourceExec>() {
if let Some(scan_config) = datasource_exec.data_source().as_any().downcast_ref::<FileScanConfig>() {
// FileGroups, and other information is on the FileScanConfig
// parquet
if let Some(parquet_source) = scan_config.file_source.as_any().downcast_ref::<ParquetSource>()
{
// Information on PruningPredicates and parquet options are here
}
}
# */
```
### Cookbook: Changes to `ParquetExecBuilder`
Likewise code that builds `ParquetExec` using the `ParquetExecBuilder` such as
the following must be changed:
```rust
# /* comment to avoid running
let mut exec_plan_builder = ParquetExecBuilder::new(
FileScanConfig::new(self.log_store.object_store_url(), file_schema)
.with_projection(self.projection.cloned())
.with_limit(self.limit)
.with_table_partition_cols(table_partition_cols),
)
.with_schema_adapter_factory(Arc::new(DeltaSchemaAdapterFactory {}))
.with_table_parquet_options(parquet_options);
// Add filter
if let Some(predicate) = logical_filter {
if config.enable_parquet_pushdown {
exec_plan_builder = exec_plan_builder.with_predicate(predicate);
}
};
# */
```
New code should use `FileScanConfig` to build the appropriate `DataSourceExec`:
```rust
# /* comment to avoid running
let mut file_source = ParquetSource::new(parquet_options)
.with_schema_adapter_factory(Arc::new(DeltaSchemaAdapterFactory {}));
// Add filter
if let Some(predicate) = logical_filter {
if config.enable_parquet_pushdown {
file_source = file_source.with_predicate(predicate);
}
};
let file_scan_config = FileScanConfig::new(
self.log_store.object_store_url(),
file_schema,
Arc::new(file_source),
)
.with_statistics(stats)
.with_projection(self.projection.cloned())
.with_limit(self.limit)
.with_table_partition_cols(table_partition_cols);
// Build the actual scan like this
parquet_scan: file_scan_config.build(),
# */
```
### `datafusion-cli` no longer automatically unescapes strings
`datafusion-cli` previously would incorrectly unescape string literals (see [ticket] for more details).
To escape `'` in SQL literals, use `''`:
```sql
> select 'it''s escaped';
+----------------------+
| Utf8("it's escaped") |
+----------------------+
| it's escaped |
+----------------------+
1 row(s) fetched.
```
To include special characters (such as newlines via `\n`) you can use an `E` literal string. For example
```sql
> select 'foo\nbar';
+------------------+
| Utf8("foo\nbar") |
+------------------+
| foo\nbar |
+------------------+
1 row(s) fetched.
Elapsed 0.005 seconds.
```
### Changes to array scalar function signatures
DataFusion 46 has changed the way scalar array function signatures are
declared. Previously, functions needed to select from a list of predefined
signatures within the `ArrayFunctionSignature` enum. Now the signatures
can be defined via a `Vec` of pseudo-types, which each correspond to a
single argument. Those pseudo-types are the variants of the
`ArrayFunctionArgument` enum and are as follows:
- `Array`: An argument of type List/LargeList/FixedSizeList. All Array
arguments must be coercible to the same type.
- `Element`: An argument that is coercible to the inner type of the `Array`
arguments.
- `Index`: An `Int64` argument.
Each of the old variants can be converted to the new format as follows:
`TypeSignature::ArraySignature(ArrayFunctionSignature::ArrayAndElement)`:
```rust
# use datafusion::common::utils::ListCoercion;
# use datafusion_expr_common::signature::{ArrayFunctionArgument, ArrayFunctionSignature, TypeSignature};
TypeSignature::ArraySignature(ArrayFunctionSignature::Array {
arguments: vec![ArrayFunctionArgument::Array, ArrayFunctionArgument::Element],
array_coercion: Some(ListCoercion::FixedSizedListToList),
});
```
`TypeSignature::ArraySignature(ArrayFunctionSignature::ElementAndArray)`:
```rust
# use datafusion::common::utils::ListCoercion;
# use datafusion_expr_common::signature::{ArrayFunctionArgument, ArrayFunctionSignature, TypeSignature};
TypeSignature::ArraySignature(ArrayFunctionSignature::Array {
arguments: vec![ArrayFunctionArgument::Element, ArrayFunctionArgument::Array],
array_coercion: Some(ListCoercion::FixedSizedListToList),
});
```
`TypeSignature::ArraySignature(ArrayFunctionSignature::ArrayAndIndex)`:
```rust
# use datafusion::common::utils::ListCoercion;
# use datafusion_expr_common::signature::{ArrayFunctionArgument, ArrayFunctionSignature, TypeSignature};
TypeSignature::ArraySignature(ArrayFunctionSignature::Array {
arguments: vec![ArrayFunctionArgument::Array, ArrayFunctionArgument::Index],
array_coercion: None,
});
```
`TypeSignature::ArraySignature(ArrayFunctionSignature::ArrayAndElementAndOptionalIndex)`:
```rust
# use datafusion::common::utils::ListCoercion;
# use datafusion_expr_common::signature::{ArrayFunctionArgument, ArrayFunctionSignature, TypeSignature};
TypeSignature::OneOf(vec![
TypeSignature::ArraySignature(ArrayFunctionSignature::Array {
arguments: vec![ArrayFunctionArgument::Array, ArrayFunctionArgument::Element],
array_coercion: None,
}),
TypeSignature::ArraySignature(ArrayFunctionSignature::Array {
arguments: vec![
ArrayFunctionArgument::Array,
ArrayFunctionArgument::Element,
ArrayFunctionArgument::Index,
],
array_coercion: None,
}),
]);
```
`TypeSignature::ArraySignature(ArrayFunctionSignature::Array)`:
```rust
# use datafusion::common::utils::ListCoercion;
# use datafusion_expr_common::signature::{ArrayFunctionArgument, ArrayFunctionSignature, TypeSignature};
TypeSignature::ArraySignature(ArrayFunctionSignature::Array {
arguments: vec![ArrayFunctionArgument::Array],
array_coercion: None,
});
```
Alternatively, you can switch to using one of the following functions which
take care of constructing the `TypeSignature` for you:
- `Signature::array_and_element`
- `Signature::array_and_element_and_optional_index`
- `Signature::array_and_index`
- `Signature::array`
[ticket]: https://github.com/apache/datafusion/issues/13286