Apache DataFusion 48.0.0 Changelog

This release consists of 267 commits from 89 contributors. See credits at the end of this changelog for more information.

Breaking changes:

  • Attach Diagnostic to syntax errors #15680 (logan-keede)
  • Change flatten so it does only a level, not recursively #15160 (delamarch3)
  • Improve simplify_expressions rule #15735 (xudong963)
  • Support WITHIN GROUP syntax to standardize certain existing aggregate functions #13511 (Garamda)
  • Add Extension Type / Metadata support for Scalar UDFs #15646 (timsaucer)
  • chore: fix clippy::large_enum_variant for DataFusionError #15861 (rroelke)
  • Feat: introduce ExecutionPlan::partition_statistics API #15852 (xudong963)
  • refactor: remove deprecated ParquetExec #15973 (miroim)
  • refactor: remove deprecated ArrowExec #16006 (miroim)
  • refactor: remove deprecated MemoryExec #16007 (miroim)
  • refactor: remove deprecated JsonExec #16005 (miroim)
  • feat: metadata handling for aggregates and window functions #15911 (timsaucer)
  • Remove Filter::having field #16154 (findepi)
  • Shift from Field to FieldRef for all user defined functions #16122 (timsaucer)
  • Change default SQL mapping for VARCAHR from Utf8 to Utf8View #16142 (zhuqi-lucas)
  • Minor: remove unused IPCWriter #16215 (alamb)
  • Reduce size of Expr struct #16207 (hendrikmakait)

Performance related:

  • Apply pre-selection and computation skipping to short-circuit optimization #15694 (acking-you)
  • Add a fast path for optimize_projection #15746 (xudong963)
  • Speed up optimize_projection by improving is_projection_unnecessary #15761 (xudong963)
  • Speed up optimize_projection #15787 (xudong963)
  • Support GroupsAccumulator for Avg duration #15748 (shruti2522)
  • Optimize performance of string::ascii function #16087 (tlm365)

Implemented enhancements:

  • Set DataFusion runtime configurations through SQL interface #15594 (kumarlokesh)
  • feat: Add option to adjust writer buffer size for query output #15747 (m09526)
  • feat: Add datafusion-spark crate #15168 (shehabgamin)
  • feat: create helpers to set the max_temp_directory_size #15919 (jdrouet)
  • feat: ORDER BY ALL #15772 (PokIsemaine)
  • feat: support min/max for struct #15667 (chenkovsky)
  • feat(proto): udf decoding fallback #15997 (leoyvens)
  • feat: make error handling in indent explain consistent with that in tree #16097 (chenkovsky)
  • feat: coerce to/from fixed size binary to binary view #16110 (chenkovsky)
  • feat: array_length for fixed size list #16167 (chenkovsky)
  • feat: ADD sha2 spark function #16168 (getChan)
  • feat: create builder for disk manager #16191 (jdrouet)
  • feat: Add Aggregate UDF to FFI crate #14775 (timsaucer)
  • feat(small): Add BaselineMetrics to generate_series() table function #16255 (2010YOUY01)
  • feat: Add Window UDFs to FFI Crate #16261 (timsaucer)

Fixed bugs:

  • fix: serialize listing table without partition column #15737 (chenkovsky)
  • fix: describe Parquet schema with coerce_int96 #15750 (chenkovsky)
  • fix: clickbench type err #15773 (chenkovsky)
  • Fix: fetch is missing in replace_order_preserving_variants method during EnforceDistribution optimizer #15808 (xudong963)
  • Fix: fetch is missing in EnforceSorting optimizer (two places) #15822 (xudong963)
  • fix: Avoid mistaken ILike to string equality optimization #15836 (srh)
  • Map file-level column statistics to the table-level #15865 (xudong963)
  • fix(avro): Respect projection order in Avro reader #15840 (nantunes)
  • fix: correctly specify the nullability of map_values return type #15901 (rluvaton)
  • Fix CI in main #15917 (blaginin)
  • fix: sqllogictest on Windows #15932 (nuno-faria)
  • fix: fold cast null to substrait typed null #15854 (discord9)
  • Fix: build_predicate_expression method doesn't process false expr correctly #15995 (xudong963)
  • fix: add an “expr_planners” method to SessionState #15119 (niebayes)
  • fix: overcounting of memory in first/last. #15924 (ashdnazg)
  • fix: track timing for coalescer's in execution time #16048 (waynexia)
  • fix: stack overflow for substrait functions with large argument lists that translate to DataFusion binary operators #16031 (fmonjalet)
  • fix: coerce int96 resolution inside of list, struct, and map types #16058 (mbutrovich)
  • fix: Add coercion rules for Float16 types #15816 (etseidl)
  • fix: describe escaped quoted identifiers #16082 (jfahne)
  • fix: Remove trailing whitespace in Display for LogicalPlan::Projection #16164 (atahanyorganci)
  • fix: metadata of join schema #16221 (chenkovsky)
  • fix: add missing row count limits to TPC-H queries #16230 (0ax1)
  • fix: NaN semantics in GROUP BY #16256 (chenkovsky)

Documentation updates:

  • Add DataFusion 47.0.0 Upgrade Guide #15749 (alamb)
  • Improve documentation for format OPTIONS clause #15708 (marvelshan)
  • doc: Adding Feldera as known user #15799 (comphead)
  • docs: add ArkFlow #15826 (chenquan)
  • Fix from_unixtime function documentation #15844 (Viicos)
  • Upgrade-guide: Downgrade “FileScanConfig –> FileScanConfigBuilder” headline #15883 (simonvandel)
  • doc: Update known users docs #15895 (comphead)
  • Add union_tag scalar function #14687 (gstvg)
  • Fix typo in introduction.md #15910 (tom-mont)
  • Add FormatOptions to Config #15793 (blaginin)
  • docs: Label bloom_filter_on_read as a reading config #15933 (nuno-faria)
  • Implement Parquet filter pushdown via new filter pushdown APIs #15769 (adriangb)
  • Enable repartitioning on MemTable. #15409 (wiedld)
  • Updated extending operators documentation #15612 (the0ninjas)
  • chore: Replace MSRV link on main page with Github badge #16020 (comphead)
  • Add note to upgrade guide for removal of ParquetExec, AvroExec, CsvExec, JsonExec #16034 (alamb)
  • docs: Clarify that it is only the name of the field that is ignored #16052 (alamb)
  • [Docs]: Added SQL example for all window functions #16074 (Adez017)
  • Fix CI on main: Add window function examples in code #16102 (alamb)
  • chore: Remove SMJ experimental status in docs #16072 (comphead)
  • doc: fix indent format explain #16085 (chenkovsky)
  • Update documentation for datafusion.execution.collect_statistics #16100 (alamb)
  • Make SessionContext::register_parquet obey collect_statistics config #16080 (adriangb)
  • Improve the DML / DDL Documentation #16115 (alamb)
  • docs: Fix typos and minor grammatical issues in Architecture docs #16119 (patrickcsullivan)
  • Set TrackConsumersPool as default in datafusion-cli #16081 (ding-young)
  • Minor: Fix links in substrait readme #16156 (alamb)
  • Add macro for creating DataFrame (#16090) #16104 (cj-zhukov)
  • doc: Move dataframe! example into dedicated example #16197 (comphead)
  • doc: add diagram to describe how DataSource, FileSource, and DataSourceExec are related #16181 (onlyjackfrost)
  • Clarify documentation about gathering statistics for parquet files #16157 (alamb)
  • Add change to VARCHAR in the upgrade guide #16216 (alamb)
  • Add iceberg-rust to user list #16246 (jonathanc-n)
  • Prepare for 48.0.0 release: Version and Changelog #16238 (xudong963)

Other:

  • Enable setting default values for target_partitions and planning_concurrency #15712 (nuno-faria)
  • minor: fix doc comment #15733 (niebayes)
  • chore(deps-dev): bump http-proxy-middleware from 2.0.6 to 2.0.9 in /datafusion/wasmtest/datafusion-wasm-app #15738 (dependabot[bot])
  • Avoid computing unnecessary statstics #15729 (xudong963)
  • chore(deps): bump libc from 0.2.171 to 0.2.172 #15745 (dependabot[bot])
  • Final release note touchups #15741 (alamb)
  • Refactor regexp slt tests #15709 (kumarlokesh)
  • ExecutionPlan: add APIs for filter pushdown & optimizer rule to apply them #15566 (adriangb)
  • Coerce and simplify FixedSizeBinary equality to literal binary #15726 (leoyvens)
  • Minor: simplify code in datafusion-proto #15752 (alamb)
  • chore(deps): bump clap from 4.5.35 to 4.5.36 #15759 (dependabot[bot])
  • Support Accumulator for avg duration #15468 (shruti2522)
  • Show current SQL recursion limit in RecursionLimitExceeded error message #15644 (kumarlokesh)
  • Minor: fix flaky test in aggregate.slt #15786 (xudong963)
  • Minor: remove unused logic for limit pushdown #15730 (zhuqi-lucas)
  • chore(deps): bump sqllogictest from 0.28.0 to 0.28.1 #15788 (dependabot[bot])
  • Add try_new for LogicalPlan::Join #15757 (kumarlokesh)
  • Minor: eliminate unnecessary struct creation in session state build #15800 (Rachelint)
  • chore(deps): bump half from 2.5.0 to 2.6.0 #15806 (dependabot[bot])
  • Add or_fun_call and unnecessary_lazy_evaluations lints on core #15807 (Rachelint)
  • chore(deps): bump env_logger from 0.11.7 to 0.11.8 #15823 (dependabot[bot])
  • Support unparsing UNION for distinct results #15814 (phillipleblanc)
  • Add MemoryPool::memory_limit to expose setting memory usage limit #15828 (Rachelint)
  • Preserve projection for inline scan #15825 (jayzhan211)
  • Minor: cleanup hash table after emit all #15834 (jayzhan211)
  • chore(deps): bump pyo3 from 0.24.1 to 0.24.2 #15838 (dependabot[bot])
  • Minor: fix potential flaky test in aggregate.slt #15829 (bikbov)
  • Fix ILIKE expression support in SQL unparser #15820 (ewgenius)
  • Make Diagnostic easy/convinient to attach by using macro and avoiding map_err #15796 (logan-keede)
  • Feature/benchmark config from env #15782 (ctsk)
  • predicate pruning: support cast and try_cast for more types #15764 (adriangb)
  • Fix: fetch is missing in plan_with_order_breaking_variants method #15842 (xudong963)
  • Fix CoalescePartitionsExec proto serialization #15824 (lewiszlw)
  • Fix build failure caused by new CoalescePartitionsExec::with_fetch method #15849 (lewiszlw)
  • Fix ScalarValue::List comparison when the compared lists have different lengths #15856 (gabotechs)
  • chore: More details to No UDF registered error #15843 (comphead)
  • chore(deps): bump clap from 4.5.36 to 4.5.37 #15853 (dependabot[bot])
  • Remove usage of dbg! #15858 (phillipleblanc)
  • Minor: Interval singleton #15859 (jayzhan211)
  • Make aggr fuzzer query builder more configurable #15851 (Rachelint)
  • chore(deps): bump aws-config from 1.6.1 to 1.6.2 #15874 (dependabot[bot])
  • Add slt tests for datafusion.execution.parquet.coerce_int96 setting #15723 (alamb)
  • Improve ListingTable / ListingTableOptions docs #15767 (alamb)
  • Migrate Optimizer tests to insta, part2 #15884 (qstommyshu)
  • Improve documentation for FileSource, DataSource and DataSourceExec #15766 (alamb)
  • Implement min max for dictionary types #15827 (XiangpengHao)
  • chore(deps): bump blake3 from 1.8.1 to 1.8.2 #15890 (dependabot[bot])
  • Respect ignore_nulls in array_agg #15544 (joroKr21)
  • Set HashJoin seed #15783 (ctsk)
  • Saner handling of nulls inside arrays #15149 (joroKr21)
  • Keeping pull request in sync with the base branch #15894 (xudong963)
  • Fix flatten scalar function when inner list is FixedSizeList #15898 (gstvg)
  • support OR operator in binary evaluate_bounds #15716 (davidhewitt)
  • infer placeholder datatype for IN lists #15864 (kczimm)
  • Fix allow_update_branch #15904 (xudong963)
  • chore(deps): bump tokio from 1.44.1 to 1.44.2 #15900 (dependabot[bot])
  • chore(deps): bump assert_cmd from 2.0.16 to 2.0.17 #15909 (dependabot[bot])
  • Factor out Substrait consumers into separate files #15794 (gabotechs)
  • Unparse UNNEST projection with the table column alias #15879 (goldmedal)
  • Migrate Optimizer tests to insta, part3 #15893 (qstommyshu)
  • Minor: cleanup datafusion-spark scalar functions #15921 (alamb)
  • Fix ClickBench extended queries after update to APPROX_PERCENTILE_CONT #15929 (alamb)
  • Add extended query for checking improvement for blocked groups optimization #15936 (Rachelint)
  • Speedup character_length #15931 (Dandandan)
  • chore(deps): bump tokio-util from 0.7.14 to 0.7.15 #15918 (dependabot[bot])
  • Migrate Optimizer tests to insta, part4 #15937 (qstommyshu)
  • fix query results for predicates referencing partition columns and data columns #15935 (adriangb)
  • chore(deps): bump substrait from 0.55.0 to 0.55.1 #15941 (dependabot[bot])
  • Fix main CI by adding rowsort to slt test #15942 (xudong963)
  • Improve sqllogictest error reporting #15905 (gabotechs)
  • refactor filter pushdown apis #15801 (adriangb)
  • Add additional tests for filter pushdown apis #15955 (adriangb)
  • Improve filter pushdown optimizer rule performance #15959 (adriangb)
  • Reduce rehashing cost for primitive grouping by also reusing hash value #15962 (Rachelint)
  • chore(deps): bump chrono from 0.4.40 to 0.4.41 #15956 (dependabot[bot])
  • refactor: replace unwrap_or with unwrap_or_else for improved lazy… #15841 (NevroHelios)
  • add benchmark code for Reuse rows in row cursor stream #15913 (acking-you)
  • [Update] : Removal of duplicate CI jobs #15966 (Adez017)
  • Segfault in ByteGroupValueBuilder #15968 (thinkharderdev)
  • make can_expr_be_pushed_down_with_schemas public again #15971 (adriangb)
  • re-export can_expr_be_pushed_down_with_schemas to be public #15974 (adriangb)
  • Migrate Optimizer tests to insta, part5 #15945 (qstommyshu)
  • Show LogicalType name for INFORMATION_SCHEMA #15965 (goldmedal)
  • chore(deps): bump sha2 from 0.10.8 to 0.10.9 #15970 (dependabot[bot])
  • chore(deps): bump insta from 1.42.2 to 1.43.1 #15988 (dependabot[bot])
  • [datafusion-spark] Add Spark-compatible hex function #15947 (andygrove)
  • refactor: remove deprecated AvroExec #15987 (miroim)
  • Substrait: Handle inner map fields in schema renaming #15869 (cht42)
  • refactor: remove deprecated CsvExec #15991 (miroim)
  • Migrate Optimizer tests to insta, part6 #15984 (qstommyshu)
  • chore(deps): bump nix from 0.29.0 to 0.30.1 #16002 (dependabot[bot])
  • Implement RightSemi join for SortMergeJoin #15972 (irenjj)
  • Migrate Optimizer tests to insta, part7 #16010 (qstommyshu)
  • chore(deps): bump sysinfo from 0.34.2 to 0.35.1 #16027 (dependabot[bot])
  • refactor: move should_enable_page_index from mod.rs to opener.rs #16026 (miroim)
  • chore(deps): bump sqllogictest from 0.28.1 to 0.28.2 #16037 (dependabot[bot])
  • chores: Add lint rule to enforce string formatting style #16024 (Lordworms)
  • Use human-readable byte sizes in EXPLAIN #16043 (tlm365)
  • Docs: Add example of creating a field in return_field_from_args #16039 (alamb)
  • Support MIN and MAX for DataType::List #16025 (gabotechs)
  • Improve docs for Exprs and scalar functions #16036 (alamb)
  • Add h2o window benchmark #16003 (2010YOUY01)
  • Fix Infer prepare statement type tests #15743 (brayanjuls)
  • style: simplify some strings for readability #15999 (hamirmahal)
  • support simple/cross lateral joins #16015 (jayzhan211)
  • Improve error message on Out of Memory #16050 (ding-young)
  • chore(deps): bump the arrow-parquet group with 7 updates #16047 (dependabot[bot])
  • chore(deps): bump petgraph from 0.7.1 to 0.8.1 #15669 (dependabot[bot])
  • [datafusion-spark] Add Spark-compatible char expression #15994 (andygrove)
  • chore(deps): bump substrait from 0.55.1 to 0.56.0 #16091 (dependabot[bot])
  • Add test that demonstrate behavior for collect_statistics #16098 (alamb)
  • Refactor substrait producer into multiple files #16089 (gabotechs)
  • Fix temp dir leak in tests #16094 (findepi)
  • Label Spark functions PRs with spark label #16095 (findepi)
  • Added SLT tests for IMDB benchmark queries #16067 (kumarlokesh)
  • chore(CI) Upgrade toolchain to Rust-1.87 #16068 (kadai0308)
  • minor: Add benchmark query and corresponding documentation for Average Duration #16105 (logan-keede)
  • Use qualified names on DELETE selections #16033 (nuno-faria)
  • chore(deps): bump testcontainers from 0.23.3 to 0.24.0 #15989 (dependabot[bot])
  • Clean up ExternalSorter and use upstream kernel #16109 (alamb)
  • Test Duration in aggregation fuzz tests #16111 (alamb)
  • Move PruningStatistics into datafusion::common #16069 (adriangb)
  • Revert use file schema in parquet pruning #16086 (adriangb)
  • Minor: Add ScalarFunctionArgs::return_type method #16113 (alamb)
  • Fix contains function expression #16046 (liamzwbao)
  • chore: Use materialized data for filter pushdown tests #16123 (comphead)
  • chore: Upgrade rand crate and some other minor crates #16062 (comphead)
  • Include data types in logical plans of inferred prepare statements #16019 (brayanjuls)
  • CI: Fix extended test failure #16144 (2010YOUY01)
  • Fix: handle column name collisions when combining UNION logical inputs & nested Column expressions in maybe_fix_physical_column_name #16064 (LiaCastaneda)
  • adding support for Min/Max over LargeList and FixedSizeList #16071 (logan-keede)
  • Move prepare/parameter handling tests into params.rs #16141 (liamzwbao)
  • Minor: Add Accumulator::return_type and StateFieldsArgs::return_type to help with upgrade to 48 #16112 (alamb)
  • Support filtering specific sqllogictests identified by line number #16029 (gabotechs)
  • Enrich GroupedHashAggregateStream name to ease debugging Resources exhausted errors #16152 (ahmed-mez)
  • chore(deps): bump uuid from 1.16.0 to 1.17.0 #16162 (dependabot[bot])
  • Clarify docs and names in parquet predicate pushdown tests #16155 (alamb)
  • Minor: Fix name() for FilterPushdown physical optimizer rule #16175 (adriangb)
  • migrate tests in pool.rs to use insta #16145 (lifan-ake)
  • refactor(optimizer): Add support for dynamically adding test tables #16138 (atahanyorganci)
  • [Minor] Speedup TPC-H benchmark run with memtable option #16159 (Dandandan)
  • Fast path for joins with distinct values in build side #16153 (Dandandan)
  • chore: Reduce repetition in the parameter type inference tests #16079 (jsai28)
  • chore(deps): bump tokio from 1.45.0 to 1.45.1 #16190 (dependabot[bot])
  • Improve unproject_sort_expr to handle arbitrary expressions #16127 (phillipleblanc)
  • chore(deps): bump rustyline from 15.0.0 to 16.0.0 #16194 (dependabot[bot])
  • migrate logical_plan tests to insta #16184 (lifan-ake)
  • chore(deps): bump clap from 4.5.38 to 4.5.39 #16204 (dependabot[bot])
  • implement AggregateExec.partition_statistics #15954 (UBarney)
  • Propagate .execute() calls immediately in RepartitionExec #16093 (gabotechs)
  • Set aggregation hash seed #16165 (ctsk)
  • Fix ScalarStructBuilder::build() for an empty struct #16205 (Blizzara)
  • Return an error on overflow in do_append_val_inner #16201 (liamzwbao)
  • chore(deps): bump testcontainers-modules from 0.12.0 to 0.12.1 #16212 (dependabot[bot])
  • Substrait: handle identical grouping expressions #16189 (cht42)
  • Add new stats pruning helpers to allow combining partition values in file level stats #16139 (adriangb)
  • Implement schema adapter support for FileSource and add integration tests #16148 (kosiew)
  • Minor: update documentation for PrunableStatistics #16213 (alamb)
  • Remove use of deprecated dict_ordered in datafusion-proto (#16218) #16220 (cj-zhukov)
  • Minor: Print cargo command in bench script #16236 (2010YOUY01)
  • Simplify FileSource / SchemaAdapterFactory API #16214 (alamb)
  • Add dicts to aggregation fuzz testing #16232 (blaginin)
  • chore(deps): bump sysinfo from 0.35.1 to 0.35.2 #16247 (dependabot[bot])
  • Improve performance of constant aggregate window expression #16234 (suibianwanwank)
  • Support compound identifier when parsing tuples #16225 (hozan23)
  • Schema adapter helper #16108 (kosiew)
  • Update tpch, clickbench, sort_tpch to mark failed queries #16182 (ding-young)
  • Adjust slttest to pass without RUST_BACKTRACE enabled #16251 (alamb)
  • Handle dicts for distinct count #15871 (blaginin)
  • Add --substrait-round-trip option in sqllogictests #16183 (gabotechs)
  • Minor: fix upgrade papercut pub use PruningStatistics #16264 (alamb)

Credits

Thank you to everyone who contributed to this release. Here is a breakdown of commits (PRs merged) per contributor.

    30	dependabot[bot]
    29	Andrew Lamb
    16	xudong.w
    14	Adrian Garcia Badaracco
    10	Chen Chongchen
     8	Gabriel
     8	Oleks V
     7	miro
     6	Tommy shu
     6	kamille
     5	Lokesh
     5	Tim Saucer
     4	Dmitrii Blaginin
     4	Jay Zhan
     4	Nuno Faria
     4	Yongting You
     4	logan-keede
     3	Christian
     3	Daniël Heres
     3	Liam Bao
     3	Phillip LeBlanc
     3	Piotr Findeisen
     3	ding-young
     2	Andy Grove
     2	Atahan Yorgancı
     2	Brayan Jules
     2	Georgi Krastev
     2	Jax Liu
     2	Jérémie Drouet
     2	LB7666
     2	Leonardo Yvens
     2	Qi Zhu
     2	Sergey Zhukov
     2	Shruti Sharma
     2	Tai Le Manh
     2	aditya singh rathore
     2	ake
     2	cht42
     2	gstvg
     2	kosiew
     2	niebayes
     2	张林伟
     1	Ahmed Mezghani
     1	Alexander Droste
     1	Andy Yen
     1	Arka Dash
     1	Arttu
     1	Dan Harris
     1	David Hewitt
     1	Davy
     1	Ed Seidl
     1	Eshed Schacham
     1	Evgenii Khramkov
     1	Florent Monjalet
     1	Galim Bikbov
     1	Garam Choi
     1	Hamir Mahal
     1	Hendrik Makait
     1	Jonathan Chen
     1	Joseph Fahnestock
     1	Kevin Zimmerman
     1	Lordworms
     1	Lía Adriana
     1	Matt Butrovich
     1	Namgung Chan
     1	Nelson Antunes
     1	Patrick Sullivan
     1	Raz Luvaton
     1	Ruihang Xia
     1	Ryan Roelke
     1	Sam Hughes
     1	Shehab Amin
     1	Sile Zhou
     1	Simon Vandel Sillesen
     1	Tom Montgomery
     1	UBarney
     1	Victorien
     1	Xiangpeng Hao
     1	Zaki
     1	chen quan
     1	delamarch3
     1	discord9
     1	hozan23
     1	irenjj
     1	jsai28
     1	m09526
     1	suibianwanwan
     1	the0ninjas
     1	wiedld

Thank you also to everyone who contributed in other ways such as filing issues, reviewing PRs, and providing feedback on this release.