| <!--- |
| Licensed to the Apache Software Foundation (ASF) under one |
| or more contributor license agreements. See the NOTICE file |
| distributed with this work for additional information |
| regarding copyright ownership. The ASF licenses this file |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this file except in compliance |
| with the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, |
| software distributed under the License is distributed on an |
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| KIND, either express or implied. See the License for the |
| specific language governing permissions and limitations |
| under the License. |
| --> |
| |
| # Upgrade Guides |
| |
| ## DataFusion `51.0.0` |
| |
| **Note:** DataFusion `51.0.0` has not been released yet. The information provided in this section pertains to features and changes that have already been merged to the main branch and are awaiting release in this version. |
| |
| You can see the current [status of the `51.0.0`release here](https://github.com/apache/datafusion/issues/17558) |
| |
| ### `arrow` / `parquet` updated to 57.0.0 |
| |
| ### Upgrade to arrow `57.0.0` and parquet `57.0.0` |
| |
| This version of DataFusion upgrades the underlying Apache Arrow implementation |
| to version `57.0.0`, including several dependent crates such as `prost`, |
| `tonic`, `pyo3`, and `substrait`. . See the [release |
| notes](https://github.com/apache/arrow-rs/releases/tag/57.0.0) for more details. |
| |
| ### `MSRV` updated to 1.88.0 |
| |
| The Minimum Supported Rust Version (MSRV) has been updated to [`1.88.0`]. |
| |
| [`1.88.0`]: https://releases.rs/docs/1.88.0/ |
| |
| ### `FunctionRegistry` exposes two additional methods |
| |
| `FunctionRegistry` exposes two additional methods `udafs` and `udwfs` which expose set of registered user defined aggregation and window function names. To upgrade implement methods returning set of registered function names: |
| |
| ```diff |
| impl FunctionRegistry for FunctionRegistryImpl { |
| fn udfs(&self) -> HashSet<String> { |
| self.scalar_functions.keys().cloned().collect() |
| } |
| + fn udafs(&self) -> HashSet<String> { |
| + self.aggregate_functions.keys().cloned().collect() |
| + } |
| + |
| + fn udwfs(&self) -> HashSet<String> { |
| + self.window_functions.keys().cloned().collect() |
| + } |
| } |
| ``` |
| |
| ### `datafusion-proto` use `TaskContext` rather than `SessionContext` in physical plan serde methods |
| |
| There have been changes in the public API methods of `datafusion-proto` which handle physical plan serde. |
| |
| Methods like `physical_plan_from_bytes`, `parse_physical_expr` and similar, expect `TaskContext` instead of `SessionContext` |
| |
| ```diff |
| - let plan2 = physical_plan_from_bytes(&bytes, &ctx)?; |
| + let plan2 = physical_plan_from_bytes(&bytes, &ctx.task_ctx())?; |
| ``` |
| |
| as `TaskContext` contains `RuntimeEnv` methods such as `try_into_physical_plan` will not have explicit `RuntimeEnv` parameter. |
| |
| ```diff |
| let result_exec_plan: Arc<dyn ExecutionPlan> = proto |
| - .try_into_physical_plan(&ctx, runtime.deref(), &composed_codec) |
| +. .try_into_physical_plan(&ctx.task_ctx(), &composed_codec) |
| ``` |
| |
| `PhysicalExtensionCodec::try_decode()` expects `TaskContext` instead of `FunctionRegistry`: |
| |
| ```diff |
| pub trait PhysicalExtensionCodec { |
| fn try_decode( |
| &self, |
| buf: &[u8], |
| inputs: &[Arc<dyn ExecutionPlan>], |
| - registry: &dyn FunctionRegistry, |
| + ctx: &TaskContext, |
| ) -> Result<Arc<dyn ExecutionPlan>>; |
| ``` |
| |
| See [issue #17601] for more details. |
| |
| [issue #17601]: https://github.com/apache/datafusion/issues/17601 |
| |
| ### `SessionState`'s `sql_to_statement` method takes `Dialect` rather than a `str` |
| |
| The `dialect` parameter of `sql_to_statement` method defined in `datafusion::execution::session_state::SessionState` |
| has changed from `&str` to `&Dialect`. |
| `Dialect` is an enum defined in the `datafusion-common` |
| crate under the `config` module that provides type safety |
| and better validation for SQL dialect selection |
| |
| ### Reorganization of `ListingTable` into `datafusion-catalog-listing` crate |
| |
| There has been a long standing request to remove features such as `ListingTable` |
| from the `datafusion` crate to support faster build times. The structs |
| `ListingOptions`, `ListingTable`, and `ListingTableConfig` are now available |
| within the `datafusion-catalog-listing` crate. These are re-exported in |
| the `datafusion` crate, so this should be a minimal impact to existing users. |
| |
| See [issue #14462] and [issue #17713] for more details. |
| |
| [issue #14462]: https://github.com/apache/datafusion/issues/14462 |
| [issue #17713]: https://github.com/apache/datafusion/issues/17713 |
| |
| ### Reorganization of `ArrowSource` into `datafusion-datasource-arrow` crate |
| |
| To support [issue #17713] the `ArrowSource` code has been removed from |
| the `datafusion` core crate into it's own crate, `datafusion-datasource-arrow`. |
| This follows the pattern for the AVRO, CSV, JSON, and Parquet data sources. |
| Users may need to update their paths to account for these changes. |
| |
| See [issue #17713] for more details. |
| |
| ### `FileScanConfig::projection` renamed to `FileScanConfig::projection_exprs` |
| |
| The `projection` field in `FileScanConfig` has been renamed to `projection_exprs` and its type has changed from `Option<Vec<usize>>` to `Option<ProjectionExprs>`. This change enables more powerful projection pushdown capabilities by supporting arbitrary physical expressions rather than just column indices. |
| |
| **Impact on direct field access:** |
| |
| If you directly access the `projection` field: |
| |
| ```rust,ignore |
| let config: FileScanConfig = ...; |
| let projection = config.projection; |
| ``` |
| |
| You should update to: |
| |
| ```rust,ignore |
| let config: FileScanConfig = ...; |
| let projection_exprs = config.projection_exprs; |
| ``` |
| |
| **Impact on builders:** |
| |
| The `FileScanConfigBuilder::with_projection()` method has been deprecated in favor of `with_projection_indices()`: |
| |
| ```diff |
| let config = FileScanConfigBuilder::new(url, schema, file_source) |
| - .with_projection(Some(vec![0, 2, 3])) |
| + .with_projection_indices(Some(vec![0, 2, 3])) |
| .build(); |
| ``` |
| |
| Note: `with_projection()` still works but is deprecated and will be removed in a future release. |
| |
| **What is `ProjectionExprs`?** |
| |
| `ProjectionExprs` is a new type that represents a list of physical expressions for projection. While it can be constructed from column indices (which is what `with_projection_indices` does internally), it also supports arbitrary physical expressions, enabling advanced features like expression evaluation during scanning. |
| |
| You can access column indices from `ProjectionExprs` using its methods if needed: |
| |
| ```rust,ignore |
| let projection_exprs: ProjectionExprs = ...; |
| // Get the column indices if the projection only contains simple column references |
| let indices = projection_exprs.column_indices(); |
| ``` |
| |
| ### `DESCRIBE query` support |
| |
| `DESCRIBE query` was previously an alias for `EXPLAIN query`, which outputs the |
| _execution plan_ of the query. With this release, `DESCRIBE query` now outputs |
| the computed _schema_ of the query, consistent with the behavior of `DESCRIBE table_name`. |
| |
| ### `datafusion.execution.time_zone` default configuration changed |
| |
| The default value for `datafusion.execution.time_zone` previously was a string value of `+00:00` (GMT/Zulu time). |
| This was changed to be an `Option<String>` with a default of `None`. If you want to change the timezone back |
| to the previous value you can execute the sql: |
| |
| ```sql |
| SET |
| TIMEZONE = '+00:00'; |
| ``` |
| |
| This change was made to better support using the default timezone in scalar UDF functions such as |
| `now`, `current_date`, `current_time`, and `to_timestamp` among others. |
| |
| ### Introduction of `TableSchema` and changes to `FileSource::with_schema()` method |
| |
| A new `TableSchema` struct has been introduced in the `datafusion-datasource` crate to better manage table schemas with partition columns. This struct helps distinguish between: |
| |
| - **File schema**: The schema of actual data files on disk |
| - **Partition columns**: Columns derived from directory structure (e.g., Hive-style partitioning) |
| - **Table schema**: The complete schema combining both file and partition columns |
| |
| As part of this change, the `FileSource::with_schema()` method signature has changed from accepting a `SchemaRef` to accepting a `TableSchema`. |
| |
| **Who is affected:** |
| |
| - Users who have implemented custom `FileSource` implementations will need to update their code |
| - Users who only use built-in file sources (Parquet, CSV, JSON, AVRO, Arrow) are not affected |
| |
| **Migration guide for custom `FileSource` implementations:** |
| |
| ```diff |
| use datafusion_datasource::file::FileSource; |
| -use arrow::datatypes::SchemaRef; |
| +use datafusion_datasource::TableSchema; |
| |
| impl FileSource for MyCustomSource { |
| - fn with_schema(&self, schema: SchemaRef) -> Arc<dyn FileSource> { |
| + fn with_schema(&self, schema: TableSchema) -> Arc<dyn FileSource> { |
| Arc::new(Self { |
| - schema: Some(schema), |
| + // Use schema.file_schema() to get the file schema without partition columns |
| + schema: Some(Arc::clone(schema.file_schema())), |
| ..self.clone() |
| }) |
| } |
| } |
| ``` |
| |
| For implementations that need access to partition columns: |
| |
| ```rust,ignore |
| fn with_schema(&self, schema: TableSchema) -> Arc<dyn FileSource> { |
| Arc::new(Self { |
| file_schema: Arc::clone(schema.file_schema()), |
| partition_cols: schema.table_partition_cols().clone(), |
| table_schema: Arc::clone(schema.table_schema()), |
| ..self.clone() |
| }) |
| } |
| ``` |
| |
| **Note**: Most `FileSource` implementations only need to store the file schema (without partition columns), as shown in the first example. The second pattern of storing all three schema components is typically only needed for advanced use cases where you need access to different schema representations for different operations (e.g., ParquetSource uses the file schema for building pruning predicates but needs the table schema for filter pushdown logic). |
| |
| **Using `TableSchema` directly:** |
| |
| If you're constructing a `FileScanConfig` or working with table schemas and partition columns, you can now use `TableSchema`: |
| |
| ```rust |
| use datafusion_datasource::TableSchema; |
| use arrow::datatypes::{Schema, Field, DataType}; |
| use std::sync::Arc; |
| |
| // Create a TableSchema with partition columns |
| let file_schema = Arc::new(Schema::new(vec![ |
| Field::new("user_id", DataType::Int64, false), |
| Field::new("amount", DataType::Float64, false), |
| ])); |
| |
| let partition_cols = vec![ |
| Arc::new(Field::new("date", DataType::Utf8, false)), |
| Arc::new(Field::new("region", DataType::Utf8, false)), |
| ]; |
| |
| let table_schema = TableSchema::new(file_schema, partition_cols); |
| |
| // Access different schema representations |
| let file_schema_ref = table_schema.file_schema(); // Schema without partition columns |
| let full_schema = table_schema.table_schema(); // Complete schema with partition columns |
| let partition_cols_ref = table_schema.table_partition_cols(); // Just the partition columns |
| ``` |
| |
| ### `AggregateUDFImpl::is_ordered_set_aggregate` has been renamed to `AggregateUDFImpl::supports_within_group_clause` |
| |
| This method has been renamed to better reflect the actual impact it has for aggregate UDF implementations. |
| The accompanying `AggregateUDF::is_ordered_set_aggregate` has also been renamed to `AggregateUDF::supports_within_group_clause`. |
| No functionality has been changed with regards to this method; it still refers only to permitting use of `WITHIN GROUP` |
| SQL syntax for the aggregate function. |
| |
| ## DataFusion `50.0.0` |
| |
| ### ListingTable automatically detects Hive Partitioned tables |
| |
| DataFusion 50.0.0 automatically infers Hive partitions when using the `ListingTableFactory` and `CREATE EXTERNAL TABLE`. Previously, |
| when creating a `ListingTable`, datasets that use Hive partitioning (e.g. |
| `/table_root/column1=value1/column2=value2/data.parquet`) would not have the Hive columns reflected in |
| the table's schema or data. The previous behavior can be |
| restored by setting the `datafusion.execution.listing_table_factory_infer_partitions` configuration option to `false`. |
| See [issue #17049] for more details. |
| |
| [issue #17049]: https://github.com/apache/datafusion/issues/17049 |
| |
| ### `MSRV` updated to 1.86.0 |
| |
| The Minimum Supported Rust Version (MSRV) has been updated to [`1.86.0`]. |
| See [#17230] for details. |
| |
| [`1.86.0`]: https://releases.rs/docs/1.86.0/ |
| [#17230]: https://github.com/apache/datafusion/pull/17230 |
| |
| ### `ScalarUDFImpl`, `AggregateUDFImpl` and `WindowUDFImpl` traits now require `PartialEq`, `Eq`, and `Hash` traits |
| |
| To address error-proneness of `ScalarUDFImpl::equals`, `AggregateUDFImpl::equals`and |
| `WindowUDFImpl::equals` methods and to make it easy to implement function equality correctly, |
| the `equals` and `hash_value` methods have been removed from `ScalarUDFImpl`, `AggregateUDFImpl` |
| and `WindowUDFImpl` traits. They are replaced the requirement to implement the `PartialEq`, `Eq`, |
| and `Hash` traits on any type implementing `ScalarUDFImpl`, `AggregateUDFImpl` or `WindowUDFImpl`. |
| Please see [issue #16677] for more details. |
| |
| Most of the scalar functions are stateless and have a `signature` field. These can be migrated |
| using regular expressions |
| |
| - search for `\#\[derive\(Debug\)\](\n *(pub )?struct \w+ \{\n *signature\: Signature\,\n *\})`, |
| - replace with `#[derive(Debug, PartialEq, Eq, Hash)]$1`, |
| - review all the changes and make sure only function structs were changed. |
| |
| [issue #16677]: https://github.com/apache/datafusion/issues/16677 |
| |
| ### `AsyncScalarUDFImpl::invoke_async_with_args` returns `ColumnarValue` |
| |
| In order to enable single value optimizations and be consistent with other |
| user defined function APIs, the `AsyncScalarUDFImpl::invoke_async_with_args` method now |
| returns a `ColumnarValue` instead of a `ArrayRef`. |
| |
| To upgrade, change the return type of your implementation |
| |
| ```rust |
| # /* comment to avoid running |
| impl AsyncScalarUDFImpl for AskLLM { |
| async fn invoke_async_with_args( |
| &self, |
| args: ScalarFunctionArgs, |
| _option: &ConfigOptions, |
| ) -> Result<ColumnarValue> { |
| .. |
| return array_ref; // old code |
| } |
| } |
| # */ |
| ``` |
| |
| To return a `ColumnarValue` |
| |
| ```rust |
| # /* comment to avoid running |
| impl AsyncScalarUDFImpl for AskLLM { |
| async fn invoke_async_with_args( |
| &self, |
| args: ScalarFunctionArgs, |
| _option: &ConfigOptions, |
| ) -> Result<ColumnarValue> { |
| .. |
| return ColumnarValue::from(array_ref); // new code |
| } |
| } |
| # */ |
| ``` |
| |
| See [#16896](https://github.com/apache/datafusion/issues/16896) for more details. |
| |
| ### `ProjectionExpr` changed from type alias to struct |
| |
| `ProjectionExpr` has been changed from a type alias to a struct with named fields to improve code clarity and maintainability. |
| |
| **Before:** |
| |
| ```rust,ignore |
| pub type ProjectionExpr = (Arc<dyn PhysicalExpr>, String); |
| ``` |
| |
| **After:** |
| |
| ```rust,ignore |
| #[derive(Debug, Clone)] |
| pub struct ProjectionExpr { |
| pub expr: Arc<dyn PhysicalExpr>, |
| pub alias: String, |
| } |
| ``` |
| |
| To upgrade your code: |
| |
| - Replace tuple construction `(expr, alias)` with `ProjectionExpr::new(expr, alias)` or `ProjectionExpr { expr, alias }` |
| - Replace tuple field access `.0` and `.1` with `.expr` and `.alias` |
| - Update pattern matching from `(expr, alias)` to `ProjectionExpr { expr, alias }` |
| |
| This mainly impacts use of `ProjectionExec`. |
| |
| This change was done in [#17398] |
| |
| [#17398]: https://github.com/apache/datafusion/pull/17398 |
| |
| ### `SessionState`, `SessionConfig`, and `OptimizerConfig` returns `&Arc<ConfigOptions>` instead of `&ConfigOptions` |
| |
| To provide broader access to `ConfigOptions` and reduce required clones, some |
| APIs have been changed to return a `&Arc<ConfigOptions>` instead of a |
| `&ConfigOptions`. This allows sharing the same `ConfigOptions` across multiple |
| threads without needing to clone the entire `ConfigOptions` structure unless it |
| is modified. |
| |
| Most users will not be impacted by this change since the Rust compiler typically |
| automatically dereference the `Arc` when needed. However, in some cases you may |
| have to change your code to explicitly call `as_ref()` for example, from |
| |
| ```rust |
| # /* comment to avoid running |
| let optimizer_config: &ConfigOptions = state.options(); |
| # */ |
| ``` |
| |
| To |
| |
| ```rust |
| # /* comment to avoid running |
| let optimizer_config: &ConfigOptions = state.options().as_ref(); |
| # */ |
| ``` |
| |
| See PR [#16970](https://github.com/apache/datafusion/pull/16970) |
| |
| ### API Change to `AsyncScalarUDFImpl::invoke_async_with_args` |
| |
| The `invoke_async_with_args` method of the `AsyncScalarUDFImpl` trait has been |
| updated to remove the `_option: &ConfigOptions` parameter to simplify the API |
| now that the `ConfigOptions` can be accessed through the `ScalarFunctionArgs` |
| parameter. |
| |
| You can change your code like this |
| |
| ```rust |
| # /* comment to avoid running |
| impl AsyncScalarUDFImpl for AskLLM { |
| async fn invoke_async_with_args( |
| &self, |
| args: ScalarFunctionArgs, |
| _option: &ConfigOptions, |
| ) -> Result<ArrayRef> { |
| .. |
| } |
| ... |
| } |
| # */ |
| ``` |
| |
| To this: |
| |
| ```rust |
| # /* comment to avoid running |
| |
| impl AsyncScalarUDFImpl for AskLLM { |
| async fn invoke_async_with_args( |
| &self, |
| args: ScalarFunctionArgs, |
| ) -> Result<ArrayRef> { |
| let options = &args.config_options; |
| .. |
| } |
| ... |
| } |
| # */ |
| ``` |
| |
| ### Schema Rewriter Module Moved to New Crate |
| |
| The `schema_rewriter` module and its associated symbols have been moved from `datafusion_physical_expr` to a new crate `datafusion_physical_expr_adapter`. This affects the following symbols: |
| |
| - `DefaultPhysicalExprAdapter` |
| - `DefaultPhysicalExprAdapterFactory` |
| - `PhysicalExprAdapter` |
| - `PhysicalExprAdapterFactory` |
| |
| To upgrade, change your imports to: |
| |
| ```rust |
| use datafusion_physical_expr_adapter::{ |
| DefaultPhysicalExprAdapter, DefaultPhysicalExprAdapterFactory, |
| PhysicalExprAdapter, PhysicalExprAdapterFactory |
| }; |
| ``` |
| |
| ### Upgrade to arrow `56.0.0` and parquet `56.0.0` |
| |
| This version of DataFusion upgrades the underlying Apache Arrow implementation |
| to version `56.0.0`. See the [release notes](https://github.com/apache/arrow-rs/releases/tag/56.0.0) |
| for more details. |
| |
| ### Added `ExecutionPlan::reset_state` |
| |
| In order to fix a bug in DataFusion `49.0.0` where dynamic filters (currently only generated in the presence of a query such as `ORDER BY ... LIMIT ...`) |
| produced incorrect results in recursive queries, a new method `reset_state` has been added to the `ExecutionPlan` trait. |
| |
| Any `ExecutionPlan` that needs to maintain internal state or references to other nodes in the execution plan tree should implement this method to reset that state. |
| See [#17028] for more details and an example implementation for `SortExec`. |
| |
| [#17028]: https://github.com/apache/datafusion/pull/17028 |
| |
| ### Nested Loop Join input sort order cannot be preserved |
| |
| The Nested Loop Join operator has been rewritten from scratch to improve performance and memory efficiency. From the micro-benchmarks: this change introduces up to 5X speed-up and uses only 1% memory in extreme cases compared to the previous implementation. |
| |
| However, the new implementation cannot preserve input sort order like the old version could. This is a fundamental design trade-off that prioritizes performance and memory efficiency over sort order preservation. |
| |
| See [#16996] for details. |
| |
| [#16996]: https://github.com/apache/datafusion/pull/16996 |
| |
| ### Add `as_any()` method to `LazyBatchGenerator` |
| |
| To help with protobuf serialization, the `as_any()` method has been added to the `LazyBatchGenerator` trait. This means you will need to add `as_any()` to your implementation of `LazyBatchGenerator`: |
| |
| ```rust |
| # /* comment to avoid running |
| |
| impl LazyBatchGenerator for MyBatchGenerator { |
| fn as_any(&self) -> &dyn Any { |
| self |
| } |
| |
| ... |
| } |
| |
| # */ |
| ``` |
| |
| See [#17200](https://github.com/apache/datafusion/pull/17200) for details. |
| |
| ### Refactored `DataSource::try_swapping_with_projection` |
| |
| We refactored `DataSource::try_swapping_with_projection` to simplify the method and minimize leakage across the ExecutionPlan <-> DataSource abstraction layer. |
| Reimplementation for any custom `DataSource` should be relatively straightforward, see [#17395] for more details. |
| |
| [#17395]: https://github.com/apache/datafusion/pull/17395/ |
| |
| ### `FileOpenFuture` now uses `DataFusionError` instead of `ArrowError` |
| |
| The `FileOpenFuture` type alias has been updated to use `DataFusionError` instead of `ArrowError` for its error type. This change affects the `FileOpener` trait and any implementations that work with file streaming operations. |
| |
| **Before:** |
| |
| ```rust,ignore |
| pub type FileOpenFuture = BoxFuture<'static, Result<BoxStream<'static, Result<RecordBatch, ArrowError>>>>; |
| ``` |
| |
| **After:** |
| |
| ```rust,ignore |
| pub type FileOpenFuture = BoxFuture<'static, Result<BoxStream<'static, Result<RecordBatch>>>>; |
| ``` |
| |
| If you have custom implementations of `FileOpener` or work directly with `FileOpenFuture`, you'll need to update your error handling to use `DataFusionError` instead of `ArrowError`. The `FileStreamState` enum's `Open` variant has also been updated accordingly. See [#17397] for more details. |
| |
| [#17397]: https://github.com/apache/datafusion/pull/17397 |
| |
| ### FFI user defined aggregate function signature change |
| |
| The Foreign Function Interface (FFI) signature for user defined aggregate functions |
| has been updated to call `return_field` instead of `return_type` on the underlying |
| aggregate function. This is to support metadata handling with these aggregate functions. |
| This change should be transparent to most users. If you have written unit tests to call |
| `return_type` directly, you may need to change them to calling `return_field` instead. |
| |
| This update is a breaking change to the FFI API. The current best practice when using the |
| FFI crate is to ensure that all libraries that are interacting are using the same |
| underlying Rust version. Issue [#17374] has been opened to discuss stabilization of |
| this interface so that these libraries can be used across different DataFusion versions. |
| |
| See [#17407] for details. |
| |
| [#17407]: https://github.com/apache/datafusion/pull/17407 |
| [#17374]: https://github.com/apache/datafusion/issues/17374 |
| |
| ### Added `PhysicalExpr::is_volatile_node` |
| |
| We added a method to `PhysicalExpr` to mark a `PhysicalExpr` as volatile: |
| |
| ```rust,ignore |
| impl PhysicalExpr for MyRandomExpr { |
| fn is_volatile_node(&self) -> bool { |
| true |
| } |
| } |
| ``` |
| |
| We've shipped this with a default value of `false` to minimize breakage but we highly recommend that implementers of `PhysicalExpr` opt into a behavior, even if it is returning `false`. |
| |
| You can see more discussion and example implementations in [#17351]. |
| |
| [#17351]: https://github.com/apache/datafusion/pull/17351 |
| |
| ## DataFusion `49.0.0` |
| |
| ### `MSRV` updated to 1.85.1 |
| |
| The Minimum Supported Rust Version (MSRV) has been updated to [`1.85.1`]. See |
| [#16728] for details. |
| |
| [`1.85.1`]: https://releases.rs/docs/1.85.1/ |
| [#16728]: https://github.com/apache/datafusion/pull/16728 |
| |
| ### `DataFusionError` variants are now `Box`ed |
| |
| To reduce the size of `DataFusionError`, several variants that were previously stored inline are now `Box`ed. This reduces the size of `Result<T, DataFusionError>` and thus stack usage and async state machine size. Please see [#16652] for more details. |
| |
| The following variants of `DataFusionError` are now boxed: |
| |
| - `ArrowError` |
| - `SQL` |
| - `SchemaError` |
| |
| This is a breaking change. Code that constructs or matches on these variants will need to be updated. |
| |
| For example, to create a `SchemaError`, instead of: |
| |
| ```rust |
| # /* comment to avoid running |
| use datafusion_common::{DataFusionError, SchemaError}; |
| DataFusionError::SchemaError( |
| SchemaError::DuplicateUnqualifiedField { name: "foo".to_string() }, |
| Box::new(None) |
| ) |
| # */ |
| ``` |
| |
| You now need to `Box` the inner error: |
| |
| ```rust |
| # /* comment to avoid running |
| use datafusion_common::{DataFusionError, SchemaError}; |
| DataFusionError::SchemaError( |
| Box::new(SchemaError::DuplicateUnqualifiedField { name: "foo".to_string() }), |
| Box::new(None) |
| ) |
| # */ |
| ``` |
| |
| [#16652]: https://github.com/apache/datafusion/issues/16652 |
| |
| ### Metadata on Arrow Types is now represented by `FieldMetadata` |
| |
| Metadata from the Arrow `Field` is now stored using the `FieldMetadata` |
| structure. In prior versions it was stored as both a `HashMap<String, String>` |
| and a `BTreeMap<String, String>`. `FieldMetadata` is a easier to work with and |
| is more efficient. |
| |
| To create `FieldMetadata` from a `Field`: |
| |
| ```rust |
| # /* comment to avoid running |
| let metadata = FieldMetadata::from(&field); |
| # */ |
| ``` |
| |
| To add metadata to a `Field`, use the `add_to_field` method: |
| |
| ```rust |
| # /* comment to avoid running |
| let updated_field = metadata.add_to_field(field); |
| # */ |
| ``` |
| |
| See [#16317] for details. |
| |
| [#16317]: https://github.com/apache/datafusion/pull/16317 |
| |
| ### New `datafusion.execution.spill_compression` configuration option |
| |
| DataFusion 49.0.0 adds support for compressing spill files when data is written to disk during spilling query execution. A new configuration option `datafusion.execution.spill_compression` controls the compression codec used. |
| |
| **Configuration:** |
| |
| - **Key**: `datafusion.execution.spill_compression` |
| - **Default**: `uncompressed` |
| - **Valid values**: `uncompressed`, `lz4_frame`, `zstd` |
| |
| **Usage:** |
| |
| ```rust |
| # /* comment to avoid running |
| use datafusion::prelude::*; |
| use datafusion_common::config::SpillCompression; |
| |
| let config = SessionConfig::default() |
| .with_spill_compression(SpillCompression::Zstd); |
| let ctx = SessionContext::new_with_config(config); |
| # */ |
| ``` |
| |
| Or via SQL: |
| |
| ```sql |
| SET datafusion.execution.spill_compression = 'zstd'; |
| ``` |
| |
| For more details about this configuration option, including performance trade-offs between different compression codecs, see the [Configuration Settings](../user-guide/configs.md) documentation. |
| |
| ### Deprecated `map_varchar_to_utf8view` configuration option |
| |
| See [issue #16290](https://github.com/apache/datafusion/pull/16290) for more information |
| The old configuration |
| |
| ```text |
| datafusion.sql_parser.map_varchar_to_utf8view |
| ``` |
| |
| is now **deprecated** in favor of the unified option below.\ |
| If you previously used this to control only `VARCHAR`→`Utf8View` mapping, please migrate to `map_string_types_to_utf8view`. |
| |
| --- |
| |
| ### New `map_string_types_to_utf8view` configuration option |
| |
| To unify **all** SQL string types (`CHAR`, `VARCHAR`, `TEXT`, `STRING`) to Arrow’s zero‑copy `Utf8View`, DataFusion 49.0.0 introduces: |
| |
| - **Key**: `datafusion.sql_parser.map_string_types_to_utf8view` |
| - **Default**: `true` |
| |
| **Description:** |
| |
| - When **true** (default), **all** SQL string types are mapped to `Utf8View`, avoiding full‑copy UTF‑8 allocations and improving performance. |
| - When **false**, DataFusion falls back to the legacy `Utf8` mapping for **all** string types. |
| |
| #### Examples |
| |
| ```rust |
| # /* comment to avoid running |
| // Disable Utf8View mapping for all SQL string types |
| let opts = datafusion::sql::planner::ParserOptions::new() |
| .with_map_string_types_to_utf8view(false); |
| |
| // Verify the setting is applied |
| assert!(!opts.map_string_types_to_utf8view); |
| # */ |
| ``` |
| |
| --- |
| |
| ```sql |
| -- Disable Utf8View mapping globally |
| SET datafusion.sql_parser.map_string_types_to_utf8view = false; |
| |
| -- Now VARCHAR, CHAR, TEXT, STRING all use Utf8 rather than Utf8View |
| CREATE TABLE my_table (a VARCHAR, b TEXT, c STRING); |
| DESCRIBE my_table; |
| ``` |
| |
| ### Deprecating `SchemaAdapterFactory` and `SchemaAdapter` |
| |
| We are moving away from converting data (using `SchemaAdapter`) to converting the expressions themselves (which is more efficient and flexible). |
| |
| See [issue #16800](https://github.com/apache/datafusion/issues/16800) for more information |
| The first place this change has taken place is in predicate pushdown for Parquet. |
| By default if you do not use a custom `SchemaAdapterFactory` we will use expression conversion instead. |
| If you do set a custom `SchemaAdapterFactory` we will continue to use it but emit a warning about that code path being deprecated. |
| |
| To resolve this you need to implement a custom `PhysicalExprAdapterFactory` and use that instead of a `SchemaAdapterFactory`. |
| See the [default values](https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/default_column_values.rs) for an example of how to do this. |
| Opting into the new APIs will set you up for future changes since we plan to expand use of `PhysicalExprAdapterFactory` to other areas of DataFusion. |
| |
| See [#16800] for details. |
| |
| [#16800]: https://github.com/apache/datafusion/issues/16800 |
| |
| ### `TableParquetOptions` Updated |
| |
| The `TableParquetOptions` struct has a new `crypto` field to specify encryption |
| options for Parquet files. The `ParquetEncryptionOptions` implements `Default` |
| so you can upgrade your existing code like this: |
| |
| ```rust |
| # /* comment to avoid running |
| TableParquetOptions { |
| global, |
| column_specific_options, |
| key_value_metadata, |
| } |
| # */ |
| ``` |
| |
| To this: |
| |
| ```rust |
| # /* comment to avoid running |
| TableParquetOptions { |
| global, |
| column_specific_options, |
| key_value_metadata, |
| crypto: Default::default(), // New crypto field |
| } |
| # */ |
| ``` |
| |
| ## DataFusion `48.0.1` |
| |
| ### `datafusion.execution.collect_statistics` now defaults to `true` |
| |
| The default value of the `datafusion.execution.collect_statistics` configuration |
| setting is now true. This change impacts users that use that value directly and relied |
| on its default value being `false`. |
| |
| This change also restores the default behavior of `ListingTable` to its previous. If you use it directly |
| you can maintain the current behavior by overriding the default value in your code. |
| |
| ```rust |
| # /* comment to avoid running |
| ListingOptions::new(Arc::new(ParquetFormat::default())) |
| .with_collect_stat(false) |
| // other options |
| # */ |
| ``` |
| |
| ## DataFusion `48.0.0` |
| |
| ### `Expr::Literal` has optional metadata |
| |
| The [`Expr::Literal`] variant now includes optional metadata, which allows for |
| carrying through Arrow field metadata to support extension types and other uses. |
| |
| This means code such as |
| |
| ```rust |
| # /* comment to avoid running |
| match expr { |
| ... |
| Expr::Literal(scalar) => ... |
| ... |
| } |
| # */ |
| ``` |
| |
| Should be updated to: |
| |
| ```rust |
| # /* comment to avoid running |
| match expr { |
| ... |
| Expr::Literal(scalar, _metadata) => ... |
| ... |
| } |
| # */ |
| ``` |
| |
| Likewise constructing `Expr::Literal` requires metadata as well. The [`lit`] function |
| has not changed and returns an `Expr::Literal` with no metadata. |
| |
| [`expr::literal`]: https://docs.rs/datafusion/latest/datafusion/logical_expr/enum.Expr.html#variant.Literal |
| [`lit`]: https://docs.rs/datafusion/latest/datafusion/logical_expr/fn.lit.html |
| |
| ### `Expr::WindowFunction` is now `Box`ed |
| |
| `Expr::WindowFunction` is now a `Box<WindowFunction>` instead of a `WindowFunction` directly. |
| This change was made to reduce the size of `Expr` and improve performance when |
| planning queries (see [details on #16207]). |
| |
| This is a breaking change, so you will need to update your code if you match |
| on `Expr::WindowFunction` directly. For example, if you have code like this: |
| |
| ```rust |
| # /* comment to avoid running |
| match expr { |
| Expr::WindowFunction(WindowFunction { |
| params: |
| WindowFunctionParams { |
| partition_by, |
| order_by, |
| .. |
| } |
| }) => { |
| // Use partition_by and order_by as needed |
| } |
| _ => { |
| // other expr |
| } |
| } |
| # */ |
| ``` |
| |
| You will need to change it to: |
| |
| ```rust |
| # /* comment to avoid running |
| match expr { |
| Expr::WindowFunction(window_fun) => { |
| let WindowFunction { |
| fun, |
| params: WindowFunctionParams { |
| args, |
| partition_by, |
| .. |
| }, |
| } = window_fun.as_ref(); |
| // Use partition_by and order_by as needed |
| } |
| _ => { |
| // other expr |
| } |
| } |
| # */ |
| ``` |
| |
| [details on #16207]: https://github.com/apache/datafusion/pull/16207#issuecomment-2922659103 |
| |
| ### The `VARCHAR` SQL type is now represented as `Utf8View` in Arrow |
| |
| The mapping of the SQL `VARCHAR` type has been changed from `Utf8` to `Utf8View` |
| which improves performance for many string operations. You can read more about |
| `Utf8View` in the [DataFusion blog post on German-style strings] |
| |
| [datafusion blog post on german-style strings]: https://datafusion.apache.org/blog/2024/09/13/string-view-german-style-strings-part-1/ |
| |
| This means that when you create a table with a `VARCHAR` column, it will now use |
| `Utf8View` as the underlying data type. For example: |
| |
| ```sql |
| > CREATE TABLE my_table (my_column VARCHAR); |
| 0 row(s) fetched. |
| Elapsed 0.001 seconds. |
| |
| > DESCRIBE my_table; |
| +-------------+-----------+-------------+ |
| | column_name | data_type | is_nullable | |
| +-------------+-----------+-------------+ |
| | my_column | Utf8View | YES | |
| +-------------+-----------+-------------+ |
| 1 row(s) fetched. |
| Elapsed 0.000 seconds. |
| ``` |
| |
| You can restore the old behavior of using `Utf8` by changing the |
| `datafusion.sql_parser.map_varchar_to_utf8view` configuration setting. For |
| example |
| |
| ```sql |
| > set datafusion.sql_parser.map_varchar_to_utf8view = false; |
| 0 row(s) fetched. |
| Elapsed 0.001 seconds. |
| |
| > CREATE TABLE my_table (my_column VARCHAR); |
| 0 row(s) fetched. |
| Elapsed 0.014 seconds. |
| |
| > DESCRIBE my_table; |
| +-------------+-----------+-------------+ |
| | column_name | data_type | is_nullable | |
| +-------------+-----------+-------------+ |
| | my_column | Utf8 | YES | |
| +-------------+-----------+-------------+ |
| 1 row(s) fetched. |
| Elapsed 0.004 seconds. |
| ``` |
| |
| ### `ListingOptions` default for `collect_stat` changed from `true` to `false` |
| |
| This makes it agree with the default for `SessionConfig`. |
| Most users won't be impacted by this change but if you were using `ListingOptions` directly |
| and relied on the default value of `collect_stat` being `true`, you will need to |
| explicitly set it to `true` in your code. |
| |
| ```rust |
| # /* comment to avoid running |
| ListingOptions::new(Arc::new(ParquetFormat::default())) |
| .with_collect_stat(true) |
| // other options |
| # */ |
| ``` |
| |
| ### Processing `FieldRef` instead of `DataType` for user defined functions |
| |
| In order to support metadata handling and extension types, user defined functions are |
| now switching to traits which use `FieldRef` rather than a `DataType` and nullability. |
| This gives a single interface to both of these parameters and additionally allows |
| access to metadata fields, which can be used for extension types. |
| |
| To upgrade structs which implement `ScalarUDFImpl`, if you have implemented |
| `return_type_from_args` you need instead to implement `return_field_from_args`. |
| If your functions do not need to handle metadata, this should be straightforward |
| repackaging of the output data into a `FieldRef`. The name you specify on the |
| field is not important. It will be overwritten during planning. `ReturnInfo` |
| has been removed, so you will need to remove all references to it. |
| |
| `ScalarFunctionArgs` now contains a field called `arg_fields`. You can use this |
| to access the metadata associated with the columnar values during invocation. |
| |
| To upgrade user defined aggregate functions, there is now a function |
| `return_field` that will allow you to specify both metadata and nullability of |
| your function. You are not required to implement this if you do not need to |
| handle metadata. |
| |
| The largest change to aggregate functions happens in the accumulator arguments. |
| Both the `AccumulatorArgs` and `StateFieldsArgs` now contain `FieldRef` rather |
| than `DataType`. |
| |
| To upgrade window functions, `ExpressionArgs` now contains input fields instead |
| of input data types. When setting these fields, the name of the field is |
| not important since this gets overwritten during the planning stage. All you |
| should need to do is wrap your existing data types in fields with nullability |
| set depending on your use case. |
| |
| ### Physical Expression return `Field` |
| |
| To support the changes to user defined functions processing metadata, the |
| `PhysicalExpr` trait, which now must specify a return `Field` based on the input |
| schema. To upgrade structs which implement `PhysicalExpr` you need to implement |
| the `return_field` function. There are numerous examples in the `physical-expr` |
| crate. |
| |
| ### `FileFormat::supports_filters_pushdown` replaced with `FileSource::try_pushdown_filters` |
| |
| To support more general filter pushdown, the `FileFormat::supports_filters_pushdown` was replaced with |
| `FileSource::try_pushdown_filters`. |
| If you implemented a custom `FileFormat` that uses a custom `FileSource` you will need to implement |
| `FileSource::try_pushdown_filters`. |
| See `ParquetSource::try_pushdown_filters` for an example of how to implement this. |
| |
| `FileFormat::supports_filters_pushdown` has been removed. |
| |
| ### `ParquetExec`, `AvroExec`, `CsvExec`, `JsonExec` Removed |
| |
| `ParquetExec`, `AvroExec`, `CsvExec`, and `JsonExec` were deprecated in |
| DataFusion 46 and are removed in DataFusion 48. This is sooner than the normal |
| process described in the [API Deprecation Guidelines] because all the tests |
| cover the new `DataSourceExec` rather than the older structures. As we evolve |
| `DataSource`, the old structures began to show signs of "bit rotting" (not |
| working but no one knows due to lack of test coverage). |
| |
| [api deprecation guidelines]: https://datafusion.apache.org/contributor-guide/api-health.html#deprecation-guidelines |
| |
| ### `PartitionedFile` added as an argument to the `FileOpener` trait |
| |
| This is necessary to properly fix filter pushdown for filters that combine partition |
| columns and file columns (e.g. `day = username['dob']`). |
| |
| If you implemented a custom `FileOpener` you will need to add the `PartitionedFile` argument |
| but are not required to use it in any way. |
| |
| ## DataFusion `47.0.0` |
| |
| This section calls out some of the major changes in the `47.0.0` release of DataFusion. |
| |
| Here are some example upgrade PRs that demonstrate changes required when upgrading from DataFusion 46.0.0: |
| |
| - [delta-rs Upgrade to `47.0.0`](https://github.com/delta-io/delta-rs/pull/3378) |
| - [DataFusion Comet Upgrade to `47.0.0`](https://github.com/apache/datafusion-comet/pull/1563) |
| - [Sail Upgrade to `47.0.0`](https://github.com/lakehq/sail/pull/434) |
| |
| ### Upgrades to `arrow-rs` and `arrow-parquet` 55.0.0 and `object_store` 0.12.0 |
| |
| Several APIs are changed in the underlying arrow and parquet libraries to use a |
| `u64` instead of `usize` to better support WASM (See [#7371] and [#6961]) |
| |
| Additionally `ObjectStore::list` and `ObjectStore::list_with_offset` have been changed to return `static` lifetimes (See [#6619]) |
| |
| [#6619]: https://github.com/apache/arrow-rs/pull/6619 |
| [#7371]: https://github.com/apache/arrow-rs/pull/7371 |
| |
| This requires converting from `usize` to `u64` occasionally as well as changes to `ObjectStore` implementations such as |
| |
| ```rust |
| # /* comment to avoid running |
| impl Objectstore { |
| ... |
| // The range is now a u64 instead of usize |
| async fn get_range(&self, location: &Path, range: Range<u64>) -> ObjectStoreResult<Bytes> { |
| self.inner.get_range(location, range).await |
| } |
| ... |
| // the lifetime is now 'static instead of `_ (meaning the captured closure can't contain references) |
| // (this also applies to list_with_offset) |
| fn list(&self, prefix: Option<&Path>) -> BoxStream<'static, ObjectStoreResult<ObjectMeta>> { |
| self.inner.list(prefix) |
| } |
| } |
| # */ |
| ``` |
| |
| The `ParquetObjectReader` has been updated to no longer require the object size |
| (it can be fetched using a single suffix request). See [#7334] for details |
| |
| [#7334]: https://github.com/apache/arrow-rs/pull/7334 |
| |
| Pattern in DataFusion `46.0.0`: |
| |
| ```rust |
| # /* comment to avoid running |
| let meta: ObjectMeta = ...; |
| let reader = ParquetObjectReader::new(store, meta); |
| # */ |
| ``` |
| |
| Pattern in DataFusion `47.0.0`: |
| |
| ```rust |
| # /* comment to avoid running |
| let meta: ObjectMeta = ...; |
| let reader = ParquetObjectReader::new(store, location) |
| .with_file_size(meta.size); |
| # */ |
| ``` |
| |
| ### `DisplayFormatType::TreeRender` |
| |
| DataFusion now supports [`tree` style explain plans]. Implementations of |
| `Executionplan` must also provide a description in the |
| `DisplayFormatType::TreeRender` format. This can be the same as the existing |
| `DisplayFormatType::Default`. |
| |
| [`tree` style explain plans]: https://datafusion.apache.org/user-guide/sql/explain.html#tree-format-default |
| |
| ### Removed Deprecated APIs |
| |
| Several APIs have been removed in this release. These were either deprecated |
| previously or were hard to use correctly such as the multiple different |
| `ScalarUDFImpl::invoke*` APIs. See [#15130], [#15123], and [#15027] for more |
| details. |
| |
| [#15130]: https://github.com/apache/datafusion/pull/15130 |
| [#15123]: https://github.com/apache/datafusion/pull/15123 |
| [#15027]: https://github.com/apache/datafusion/pull/15027 |
| |
| ### `FileScanConfig` --> `FileScanConfigBuilder` |
| |
| Previously, `FileScanConfig::build()` directly created ExecutionPlans. In |
| DataFusion 47.0.0 this has been changed to use `FileScanConfigBuilder`. See |
| [#15352] for details. |
| |
| [#15352]: https://github.com/apache/datafusion/pull/15352 |
| |
| Pattern in DataFusion `46.0.0`: |
| |
| ```rust |
| # /* comment to avoid running |
| let plan = FileScanConfig::new(url, schema, Arc::new(file_source)) |
| .with_statistics(stats) |
| ... |
| .build() |
| # */ |
| ``` |
| |
| Pattern in DataFusion `47.0.0`: |
| |
| ```rust |
| # /* comment to avoid running |
| let config = FileScanConfigBuilder::new(url, schema, Arc::new(file_source)) |
| .with_statistics(stats) |
| ... |
| .build(); |
| let scan = DataSourceExec::from_data_source(config); |
| # */ |
| ``` |
| |
| ## DataFusion `46.0.0` |
| |
| ### Use `invoke_with_args` instead of `invoke()` and `invoke_batch()` |
| |
| DataFusion is moving to a consistent API for invoking ScalarUDFs, |
| [`ScalarUDFImpl::invoke_with_args()`], and deprecating |
| [`ScalarUDFImpl::invoke()`], [`ScalarUDFImpl::invoke_batch()`], and [`ScalarUDFImpl::invoke_no_args()`] |
| |
| If you see errors such as the following it means the older APIs are being used: |
| |
| ```text |
| This feature is not implemented: Function concat does not implement invoke but called |
| ``` |
| |
| To fix this error, use [`ScalarUDFImpl::invoke_with_args()`] instead, as shown |
| below. See [PR 14876] for an example. |
| |
| Given existing code like this: |
| |
| ```rust |
| # /* comment to avoid running |
| impl ScalarUDFImpl for SparkConcat { |
| ... |
| fn invoke_batch(&self, args: &[ColumnarValue], number_rows: usize) -> Result<ColumnarValue> { |
| if args |
| .iter() |
| .any(|arg| matches!(arg.data_type(), DataType::List(_))) |
| { |
| ArrayConcat::new().invoke_batch(args, number_rows) |
| } else { |
| ConcatFunc::new().invoke_batch(args, number_rows) |
| } |
| } |
| } |
| # */ |
| ``` |
| |
| To |
| |
| ```rust |
| # /* comment to avoid running |
| impl ScalarUDFImpl for SparkConcat { |
| ... |
| fn invoke_with_args(&self, args: ScalarFunctionArgs) -> Result<ColumnarValue> { |
| if args |
| .args |
| .iter() |
| .any(|arg| matches!(arg.data_type(), DataType::List(_))) |
| { |
| ArrayConcat::new().invoke_with_args(args) |
| } else { |
| ConcatFunc::new().invoke_with_args(args) |
| } |
| } |
| } |
| # */ |
| ``` |
| |
| [`scalarudfimpl::invoke()`]: https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.ScalarUDFImpl.html#method.invoke |
| [`scalarudfimpl::invoke_batch()`]: https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.ScalarUDFImpl.html#method.invoke_batch |
| [`scalarudfimpl::invoke_no_args()`]: https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.ScalarUDFImpl.html#method.invoke_no_args |
| [`scalarudfimpl::invoke_with_args()`]: https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.ScalarUDFImpl.html#method.invoke_with_args |
| [pr 14876]: https://github.com/apache/datafusion/pull/14876 |
| |
| ### `ParquetExec`, `AvroExec`, `CsvExec`, `JsonExec` deprecated |
| |
| DataFusion 46 has a major change to how the built in DataSources are organized. |
| Instead of individual `ExecutionPlan`s for the different file formats they now |
| all use `DataSourceExec` and the format specific information is embodied in new |
| traits `DataSource` and `FileSource`. |
| |
| Here is more information about |
| |
| - [Design Ticket] |
| - Change PR [PR #14224] |
| - Example of an Upgrade [PR in delta-rs] |
| |
| [design ticket]: https://github.com/apache/datafusion/issues/13838 |
| [pr #14224]: https://github.com/apache/datafusion/pull/14224 |
| [pr in delta-rs]: https://github.com/delta-io/delta-rs/pull/3261 |
| |
| ### Cookbook: Changes to `ParquetExecBuilder` |
| |
| Code that looks for `ParquetExec` like this will no longer work: |
| |
| ```rust |
| # /* comment to avoid running |
| if let Some(parquet_exec) = plan.as_any().downcast_ref::<ParquetExec>() { |
| // Do something with ParquetExec here |
| } |
| # */ |
| ``` |
| |
| Instead, with `DataSourceExec`, the same information is now on `FileScanConfig` and |
| `ParquetSource`. The equivalent code is |
| |
| ```rust |
| # /* comment to avoid running |
| if let Some(datasource_exec) = plan.as_any().downcast_ref::<DataSourceExec>() { |
| if let Some(scan_config) = datasource_exec.data_source().as_any().downcast_ref::<FileScanConfig>() { |
| // FileGroups, and other information is on the FileScanConfig |
| // parquet |
| if let Some(parquet_source) = scan_config.file_source.as_any().downcast_ref::<ParquetSource>() |
| { |
| // Information on PruningPredicates and parquet options are here |
| } |
| } |
| # */ |
| ``` |
| |
| ### Cookbook: Changes to `ParquetExecBuilder` |
| |
| Likewise code that builds `ParquetExec` using the `ParquetExecBuilder` such as |
| the following must be changed: |
| |
| ```rust |
| # /* comment to avoid running |
| let mut exec_plan_builder = ParquetExecBuilder::new( |
| FileScanConfig::new(self.log_store.object_store_url(), file_schema) |
| .with_projection(self.projection.cloned()) |
| .with_limit(self.limit) |
| .with_table_partition_cols(table_partition_cols), |
| ) |
| .with_schema_adapter_factory(Arc::new(DeltaSchemaAdapterFactory {})) |
| .with_table_parquet_options(parquet_options); |
| |
| // Add filter |
| if let Some(predicate) = logical_filter { |
| if config.enable_parquet_pushdown { |
| exec_plan_builder = exec_plan_builder.with_predicate(predicate); |
| } |
| }; |
| # */ |
| ``` |
| |
| New code should use `FileScanConfig` to build the appropriate `DataSourceExec`: |
| |
| ```rust |
| # /* comment to avoid running |
| let mut file_source = ParquetSource::new(parquet_options) |
| .with_schema_adapter_factory(Arc::new(DeltaSchemaAdapterFactory {})); |
| |
| // Add filter |
| if let Some(predicate) = logical_filter { |
| if config.enable_parquet_pushdown { |
| file_source = file_source.with_predicate(predicate); |
| } |
| }; |
| |
| let file_scan_config = FileScanConfig::new( |
| self.log_store.object_store_url(), |
| file_schema, |
| Arc::new(file_source), |
| ) |
| .with_statistics(stats) |
| .with_projection(self.projection.cloned()) |
| .with_limit(self.limit) |
| .with_table_partition_cols(table_partition_cols); |
| |
| // Build the actual scan like this |
| parquet_scan: file_scan_config.build(), |
| # */ |
| ``` |
| |
| ### `datafusion-cli` no longer automatically unescapes strings |
| |
| `datafusion-cli` previously would incorrectly unescape string literals (see [ticket] for more details). |
| |
| To escape `'` in SQL literals, use `''`: |
| |
| ```sql |
| > select 'it''s escaped'; |
| +----------------------+ |
| | Utf8("it's escaped") | |
| +----------------------+ |
| | it's escaped | |
| +----------------------+ |
| 1 row(s) fetched. |
| ``` |
| |
| To include special characters (such as newlines via `\n`) you can use an `E` literal string. For example |
| |
| ```sql |
| > select 'foo\nbar'; |
| +------------------+ |
| | Utf8("foo\nbar") | |
| +------------------+ |
| | foo\nbar | |
| +------------------+ |
| 1 row(s) fetched. |
| Elapsed 0.005 seconds. |
| ``` |
| |
| ### Changes to array scalar function signatures |
| |
| DataFusion 46 has changed the way scalar array function signatures are |
| declared. Previously, functions needed to select from a list of predefined |
| signatures within the `ArrayFunctionSignature` enum. Now the signatures |
| can be defined via a `Vec` of pseudo-types, which each correspond to a |
| single argument. Those pseudo-types are the variants of the |
| `ArrayFunctionArgument` enum and are as follows: |
| |
| - `Array`: An argument of type List/LargeList/FixedSizeList. All Array |
| arguments must be coercible to the same type. |
| - `Element`: An argument that is coercible to the inner type of the `Array` |
| arguments. |
| - `Index`: An `Int64` argument. |
| |
| Each of the old variants can be converted to the new format as follows: |
| |
| `TypeSignature::ArraySignature(ArrayFunctionSignature::ArrayAndElement)`: |
| |
| ```rust |
| # use datafusion::common::utils::ListCoercion; |
| # use datafusion_expr_common::signature::{ArrayFunctionArgument, ArrayFunctionSignature, TypeSignature}; |
| |
| TypeSignature::ArraySignature(ArrayFunctionSignature::Array { |
| arguments: vec![ArrayFunctionArgument::Array, ArrayFunctionArgument::Element], |
| array_coercion: Some(ListCoercion::FixedSizedListToList), |
| }); |
| ``` |
| |
| `TypeSignature::ArraySignature(ArrayFunctionSignature::ElementAndArray)`: |
| |
| ```rust |
| # use datafusion::common::utils::ListCoercion; |
| # use datafusion_expr_common::signature::{ArrayFunctionArgument, ArrayFunctionSignature, TypeSignature}; |
| |
| TypeSignature::ArraySignature(ArrayFunctionSignature::Array { |
| arguments: vec![ArrayFunctionArgument::Element, ArrayFunctionArgument::Array], |
| array_coercion: Some(ListCoercion::FixedSizedListToList), |
| }); |
| ``` |
| |
| `TypeSignature::ArraySignature(ArrayFunctionSignature::ArrayAndIndex)`: |
| |
| ```rust |
| # use datafusion::common::utils::ListCoercion; |
| # use datafusion_expr_common::signature::{ArrayFunctionArgument, ArrayFunctionSignature, TypeSignature}; |
| |
| TypeSignature::ArraySignature(ArrayFunctionSignature::Array { |
| arguments: vec![ArrayFunctionArgument::Array, ArrayFunctionArgument::Index], |
| array_coercion: None, |
| }); |
| ``` |
| |
| `TypeSignature::ArraySignature(ArrayFunctionSignature::ArrayAndElementAndOptionalIndex)`: |
| |
| ```rust |
| # use datafusion::common::utils::ListCoercion; |
| # use datafusion_expr_common::signature::{ArrayFunctionArgument, ArrayFunctionSignature, TypeSignature}; |
| |
| TypeSignature::OneOf(vec![ |
| TypeSignature::ArraySignature(ArrayFunctionSignature::Array { |
| arguments: vec![ArrayFunctionArgument::Array, ArrayFunctionArgument::Element], |
| array_coercion: None, |
| }), |
| TypeSignature::ArraySignature(ArrayFunctionSignature::Array { |
| arguments: vec![ |
| ArrayFunctionArgument::Array, |
| ArrayFunctionArgument::Element, |
| ArrayFunctionArgument::Index, |
| ], |
| array_coercion: None, |
| }), |
| ]); |
| ``` |
| |
| `TypeSignature::ArraySignature(ArrayFunctionSignature::Array)`: |
| |
| ```rust |
| # use datafusion::common::utils::ListCoercion; |
| # use datafusion_expr_common::signature::{ArrayFunctionArgument, ArrayFunctionSignature, TypeSignature}; |
| |
| TypeSignature::ArraySignature(ArrayFunctionSignature::Array { |
| arguments: vec![ArrayFunctionArgument::Array], |
| array_coercion: None, |
| }); |
| ``` |
| |
| Alternatively, you can switch to using one of the following functions which |
| take care of constructing the `TypeSignature` for you: |
| |
| - `Signature::array_and_element` |
| - `Signature::array_and_element_and_optional_index` |
| - `Signature::array_and_index` |
| - `Signature::array` |
| |
| [ticket]: https://github.com/apache/datafusion/issues/13286 |