Apache DataFusion 48.0.0 Changelog
This release consists of 267 commits from 89 contributors. See credits at the end of this changelog for more information.
Breaking changes:
- Attach Diagnostic to syntax errors #15680 (logan-keede)
- Change
flatten so it does only a level, not recursively #15160 (delamarch3) - Improve
simplify_expressions rule #15735 (xudong963) - Support WITHIN GROUP syntax to standardize certain existing aggregate functions #13511 (Garamda)
- Add Extension Type / Metadata support for Scalar UDFs #15646 (timsaucer)
- chore: fix clippy::large_enum_variant for DataFusionError #15861 (rroelke)
- Feat: introduce
ExecutionPlan::partition_statistics API #15852 (xudong963) - refactor: remove deprecated
ParquetExec #15973 (miroim) - refactor: remove deprecated
ArrowExec #16006 (miroim) - refactor: remove deprecated
MemoryExec #16007 (miroim) - refactor: remove deprecated
JsonExec #16005 (miroim) - feat: metadata handling for aggregates and window functions #15911 (timsaucer)
- Remove
Filter::having field #16154 (findepi) - Shift from Field to FieldRef for all user defined functions #16122 (timsaucer)
- Change default SQL mapping for
VARCAHR from Utf8 to Utf8View #16142 (zhuqi-lucas) - Minor: remove unused IPCWriter #16215 (alamb)
- Reduce size of
Expr struct #16207 (hendrikmakait)
Performance related:
- Apply pre-selection and computation skipping to short-circuit optimization #15694 (acking-you)
- Add a fast path for
optimize_projection #15746 (xudong963) - Speed up
optimize_projection by improving is_projection_unnecessary #15761 (xudong963) - Speed up
optimize_projection #15787 (xudong963) - Support
GroupsAccumulator for Avg duration #15748 (shruti2522) - Optimize performance of
string::ascii function #16087 (tlm365)
Implemented enhancements:
- Set DataFusion runtime configurations through SQL interface #15594 (kumarlokesh)
- feat: Add option to adjust writer buffer size for query output #15747 (m09526)
- feat: Add
datafusion-spark crate #15168 (shehabgamin) - feat: create helpers to set the max_temp_directory_size #15919 (jdrouet)
- feat: ORDER BY ALL #15772 (PokIsemaine)
- feat: support min/max for struct #15667 (chenkovsky)
- feat(proto): udf decoding fallback #15997 (leoyvens)
- feat: make error handling in indent explain consistent with that in tree #16097 (chenkovsky)
- feat: coerce to/from fixed size binary to binary view #16110 (chenkovsky)
- feat: array_length for fixed size list #16167 (chenkovsky)
- feat: ADD sha2 spark function #16168 (getChan)
- feat: create builder for disk manager #16191 (jdrouet)
- feat: Add Aggregate UDF to FFI crate #14775 (timsaucer)
- feat(small): Add
BaselineMetrics to generate_series() table function #16255 (2010YOUY01) - feat: Add Window UDFs to FFI Crate #16261 (timsaucer)
Fixed bugs:
- fix: serialize listing table without partition column #15737 (chenkovsky)
- fix: describe Parquet schema with coerce_int96 #15750 (chenkovsky)
- fix: clickbench type err #15773 (chenkovsky)
- Fix: fetch is missing in
replace_order_preserving_variants method during EnforceDistribution optimizer #15808 (xudong963) - Fix: fetch is missing in
EnforceSorting optimizer (two places) #15822 (xudong963) - fix: Avoid mistaken ILike to string equality optimization #15836 (srh)
- Map file-level column statistics to the table-level #15865 (xudong963)
- fix(avro): Respect projection order in Avro reader #15840 (nantunes)
- fix: correctly specify the nullability of
map_values return type #15901 (rluvaton) - Fix CI in main #15917 (blaginin)
- fix: sqllogictest on Windows #15932 (nuno-faria)
- fix: fold cast null to substrait typed null #15854 (discord9)
- Fix:
build_predicate_expression method doesn't process false expr correctly #15995 (xudong963) - fix: add an “expr_planners” method to SessionState #15119 (niebayes)
- fix: overcounting of memory in first/last. #15924 (ashdnazg)
- fix: track timing for coalescer's in execution time #16048 (waynexia)
- fix: stack overflow for substrait functions with large argument lists that translate to DataFusion binary operators #16031 (fmonjalet)
- fix: coerce int96 resolution inside of list, struct, and map types #16058 (mbutrovich)
- fix: Add coercion rules for Float16 types #15816 (etseidl)
- fix: describe escaped quoted identifiers #16082 (jfahne)
- fix: Remove trailing whitespace in
Display for LogicalPlan::Projection #16164 (atahanyorganci) - fix: metadata of join schema #16221 (chenkovsky)
- fix: add missing row count limits to TPC-H queries #16230 (0ax1)
- fix: NaN semantics in GROUP BY #16256 (chenkovsky)
Documentation updates:
- Add DataFusion 47.0.0 Upgrade Guide #15749 (alamb)
- Improve documentation for format
OPTIONS clause #15708 (marvelshan) - doc: Adding Feldera as known user #15799 (comphead)
- docs: add ArkFlow #15826 (chenquan)
- Fix
from_unixtime function documentation #15844 (Viicos) - Upgrade-guide: Downgrade “FileScanConfig –> FileScanConfigBuilder” headline #15883 (simonvandel)
- doc: Update known users docs #15895 (comphead)
- Add
union_tag scalar function #14687 (gstvg) - Fix typo in introduction.md #15910 (tom-mont)
- Add
FormatOptions to Config #15793 (blaginin) - docs: Label
bloom_filter_on_read as a reading config #15933 (nuno-faria) - Implement Parquet filter pushdown via new filter pushdown APIs #15769 (adriangb)
- Enable repartitioning on MemTable. #15409 (wiedld)
- Updated extending operators documentation #15612 (the0ninjas)
- chore: Replace MSRV link on main page with Github badge #16020 (comphead)
- Add note to upgrade guide for removal of
ParquetExec, AvroExec, CsvExec, JsonExec #16034 (alamb) - docs: Clarify that it is only the name of the field that is ignored #16052 (alamb)
- [Docs]: Added SQL example for all window functions #16074 (Adez017)
- Fix CI on main: Add window function examples in code #16102 (alamb)
- chore: Remove SMJ experimental status in docs #16072 (comphead)
- doc: fix indent format explain #16085 (chenkovsky)
- Update documentation for
datafusion.execution.collect_statistics #16100 (alamb) - Make
SessionContext::register_parquet obey collect_statistics config #16080 (adriangb) - Improve the DML / DDL Documentation #16115 (alamb)
- docs: Fix typos and minor grammatical issues in Architecture docs #16119 (patrickcsullivan)
- Set
TrackConsumersPool as default in datafusion-cli #16081 (ding-young) - Minor: Fix links in substrait readme #16156 (alamb)
- Add macro for creating DataFrame (#16090) #16104 (cj-zhukov)
- doc: Move
dataframe! example into dedicated example #16197 (comphead) - doc: add diagram to describe how DataSource, FileSource, and DataSourceExec are related #16181 (onlyjackfrost)
- Clarify documentation about gathering statistics for parquet files #16157 (alamb)
- Add change to VARCHAR in the upgrade guide #16216 (alamb)
- Add iceberg-rust to user list #16246 (jonathanc-n)
- Prepare for 48.0.0 release: Version and Changelog #16238 (xudong963)
Other:
- Enable setting default values for target_partitions and planning_concurrency #15712 (nuno-faria)
- minor: fix doc comment #15733 (niebayes)
- chore(deps-dev): bump http-proxy-middleware from 2.0.6 to 2.0.9 in /datafusion/wasmtest/datafusion-wasm-app #15738 (dependabot[bot])
- Avoid computing unnecessary statstics #15729 (xudong963)
- chore(deps): bump libc from 0.2.171 to 0.2.172 #15745 (dependabot[bot])
- Final release note touchups #15741 (alamb)
- Refactor regexp slt tests #15709 (kumarlokesh)
- ExecutionPlan: add APIs for filter pushdown & optimizer rule to apply them #15566 (adriangb)
- Coerce and simplify FixedSizeBinary equality to literal binary #15726 (leoyvens)
- Minor: simplify code in datafusion-proto #15752 (alamb)
- chore(deps): bump clap from 4.5.35 to 4.5.36 #15759 (dependabot[bot])
- Support
Accumulator for avg duration #15468 (shruti2522) - Show current SQL recursion limit in RecursionLimitExceeded error message #15644 (kumarlokesh)
- Minor: fix flaky test in
aggregate.slt #15786 (xudong963) - Minor: remove unused logic for limit pushdown #15730 (zhuqi-lucas)
- chore(deps): bump sqllogictest from 0.28.0 to 0.28.1 #15788 (dependabot[bot])
- Add try_new for LogicalPlan::Join #15757 (kumarlokesh)
- Minor: eliminate unnecessary struct creation in session state build #15800 (Rachelint)
- chore(deps): bump half from 2.5.0 to 2.6.0 #15806 (dependabot[bot])
- Add
or_fun_call and unnecessary_lazy_evaluations lints on core #15807 (Rachelint) - chore(deps): bump env_logger from 0.11.7 to 0.11.8 #15823 (dependabot[bot])
- Support unparsing
UNION for distinct results #15814 (phillipleblanc) - Add
MemoryPool::memory_limit to expose setting memory usage limit #15828 (Rachelint) - Preserve projection for inline scan #15825 (jayzhan211)
- Minor: cleanup hash table after emit all #15834 (jayzhan211)
- chore(deps): bump pyo3 from 0.24.1 to 0.24.2 #15838 (dependabot[bot])
- Minor: fix potential flaky test in aggregate.slt #15829 (bikbov)
- Fix
ILIKE expression support in SQL unparser #15820 (ewgenius) - Make
Diagnostic easy/convinient to attach by using macro and avoiding map_err #15796 (logan-keede) - Feature/benchmark config from env #15782 (ctsk)
- predicate pruning: support cast and try_cast for more types #15764 (adriangb)
- Fix: fetch is missing in
plan_with_order_breaking_variants method #15842 (xudong963) - Fix
CoalescePartitionsExec proto serialization #15824 (lewiszlw) - Fix build failure caused by new
CoalescePartitionsExec::with_fetch method #15849 (lewiszlw) - Fix ScalarValue::List comparison when the compared lists have different lengths #15856 (gabotechs)
- chore: More details to
No UDF registered error #15843 (comphead) - chore(deps): bump clap from 4.5.36 to 4.5.37 #15853 (dependabot[bot])
- Remove usage of
dbg! #15858 (phillipleblanc) - Minor: Interval singleton #15859 (jayzhan211)
- Make aggr fuzzer query builder more configurable #15851 (Rachelint)
- chore(deps): bump aws-config from 1.6.1 to 1.6.2 #15874 (dependabot[bot])
- Add slt tests for
datafusion.execution.parquet.coerce_int96 setting #15723 (alamb) - Improve
ListingTable / ListingTableOptions docs #15767 (alamb) - Migrate Optimizer tests to insta, part2 #15884 (qstommyshu)
- Improve documentation for
FileSource, DataSource and DataSourceExec #15766 (alamb) - Implement min max for dictionary types #15827 (XiangpengHao)
- chore(deps): bump blake3 from 1.8.1 to 1.8.2 #15890 (dependabot[bot])
- Respect ignore_nulls in array_agg #15544 (joroKr21)
- Set HashJoin seed #15783 (ctsk)
- Saner handling of nulls inside arrays #15149 (joroKr21)
- Keeping pull request in sync with the base branch #15894 (xudong963)
- Fix
flatten scalar function when inner list is FixedSizeList #15898 (gstvg) - support OR operator in binary
evaluate_bounds #15716 (davidhewitt) - infer placeholder datatype for IN lists #15864 (kczimm)
- Fix allow_update_branch #15904 (xudong963)
- chore(deps): bump tokio from 1.44.1 to 1.44.2 #15900 (dependabot[bot])
- chore(deps): bump assert_cmd from 2.0.16 to 2.0.17 #15909 (dependabot[bot])
- Factor out Substrait consumers into separate files #15794 (gabotechs)
- Unparse
UNNEST projection with the table column alias #15879 (goldmedal) - Migrate Optimizer tests to insta, part3 #15893 (qstommyshu)
- Minor: cleanup datafusion-spark scalar functions #15921 (alamb)
- Fix ClickBench extended queries after update to APPROX_PERCENTILE_CONT #15929 (alamb)
- Add extended query for checking improvement for blocked groups optimization #15936 (Rachelint)
- Speedup
character_length #15931 (Dandandan) - chore(deps): bump tokio-util from 0.7.14 to 0.7.15 #15918 (dependabot[bot])
- Migrate Optimizer tests to insta, part4 #15937 (qstommyshu)
- fix query results for predicates referencing partition columns and data columns #15935 (adriangb)
- chore(deps): bump substrait from 0.55.0 to 0.55.1 #15941 (dependabot[bot])
- Fix main CI by adding
rowsort to slt test #15942 (xudong963) - Improve sqllogictest error reporting #15905 (gabotechs)
- refactor filter pushdown apis #15801 (adriangb)
- Add additional tests for filter pushdown apis #15955 (adriangb)
- Improve filter pushdown optimizer rule performance #15959 (adriangb)
- Reduce rehashing cost for primitive grouping by also reusing hash value #15962 (Rachelint)
- chore(deps): bump chrono from 0.4.40 to 0.4.41 #15956 (dependabot[bot])
- refactor: replace
unwrap_or with unwrap_or_else for improved lazy… #15841 (NevroHelios) - add benchmark code for
Reuse rows in row cursor stream #15913 (acking-you) - [Update] : Removal of duplicate CI jobs #15966 (Adez017)
- Segfault in ByteGroupValueBuilder #15968 (thinkharderdev)
- make can_expr_be_pushed_down_with_schemas public again #15971 (adriangb)
- re-export can_expr_be_pushed_down_with_schemas to be public #15974 (adriangb)
- Migrate Optimizer tests to insta, part5 #15945 (qstommyshu)
- Show LogicalType name for
INFORMATION_SCHEMA #15965 (goldmedal) - chore(deps): bump sha2 from 0.10.8 to 0.10.9 #15970 (dependabot[bot])
- chore(deps): bump insta from 1.42.2 to 1.43.1 #15988 (dependabot[bot])
- [datafusion-spark] Add Spark-compatible hex function #15947 (andygrove)
- refactor: remove deprecated
AvroExec #15987 (miroim) - Substrait: Handle inner map fields in schema renaming #15869 (cht42)
- refactor: remove deprecated
CsvExec #15991 (miroim) - Migrate Optimizer tests to insta, part6 #15984 (qstommyshu)
- chore(deps): bump nix from 0.29.0 to 0.30.1 #16002 (dependabot[bot])
- Implement RightSemi join for SortMergeJoin #15972 (irenjj)
- Migrate Optimizer tests to insta, part7 #16010 (qstommyshu)
- chore(deps): bump sysinfo from 0.34.2 to 0.35.1 #16027 (dependabot[bot])
- refactor: move
should_enable_page_index from mod.rs to opener.rs #16026 (miroim) - chore(deps): bump sqllogictest from 0.28.1 to 0.28.2 #16037 (dependabot[bot])
- chores: Add lint rule to enforce string formatting style #16024 (Lordworms)
- Use human-readable byte sizes in
EXPLAIN #16043 (tlm365) - Docs: Add example of creating a field in
return_field_from_args #16039 (alamb) - Support
MIN and MAX for DataType::List #16025 (gabotechs) - Improve docs for Exprs and scalar functions #16036 (alamb)
- Add h2o window benchmark #16003 (2010YOUY01)
- Fix Infer prepare statement type tests #15743 (brayanjuls)
- style: simplify some strings for readability #15999 (hamirmahal)
- support simple/cross lateral joins #16015 (jayzhan211)
- Improve error message on Out of Memory #16050 (ding-young)
- chore(deps): bump the arrow-parquet group with 7 updates #16047 (dependabot[bot])
- chore(deps): bump petgraph from 0.7.1 to 0.8.1 #15669 (dependabot[bot])
- [datafusion-spark] Add Spark-compatible
char expression #15994 (andygrove) - chore(deps): bump substrait from 0.55.1 to 0.56.0 #16091 (dependabot[bot])
- Add test that demonstrate behavior for
collect_statistics #16098 (alamb) - Refactor substrait producer into multiple files #16089 (gabotechs)
- Fix temp dir leak in tests #16094 (findepi)
- Label Spark functions PRs with spark label #16095 (findepi)
- Added SLT tests for IMDB benchmark queries #16067 (kumarlokesh)
- chore(CI) Upgrade toolchain to Rust-1.87 #16068 (kadai0308)
- minor: Add benchmark query and corresponding documentation for Average Duration #16105 (logan-keede)
- Use qualified names on DELETE selections #16033 (nuno-faria)
- chore(deps): bump testcontainers from 0.23.3 to 0.24.0 #15989 (dependabot[bot])
- Clean up ExternalSorter and use upstream kernel #16109 (alamb)
- Test Duration in aggregation
fuzz tests #16111 (alamb) - Move PruningStatistics into datafusion::common #16069 (adriangb)
- Revert use file schema in parquet pruning #16086 (adriangb)
- Minor: Add
ScalarFunctionArgs::return_type method #16113 (alamb) - Fix
contains function expression #16046 (liamzwbao) - chore: Use materialized data for filter pushdown tests #16123 (comphead)
- chore: Upgrade rand crate and some other minor crates #16062 (comphead)
- Include data types in logical plans of inferred prepare statements #16019 (brayanjuls)
- CI: Fix extended test failure #16144 (2010YOUY01)
- Fix: handle column name collisions when combining UNION logical inputs & nested Column expressions in maybe_fix_physical_column_name #16064 (LiaCastaneda)
- adding support for Min/Max over LargeList and FixedSizeList #16071 (logan-keede)
- Move prepare/parameter handling tests into
params.rs #16141 (liamzwbao) - Minor: Add
Accumulator::return_type and StateFieldsArgs::return_type to help with upgrade to 48 #16112 (alamb) - Support filtering specific sqllogictests identified by line number #16029 (gabotechs)
- Enrich GroupedHashAggregateStream name to ease debugging Resources exhausted errors #16152 (ahmed-mez)
- chore(deps): bump uuid from 1.16.0 to 1.17.0 #16162 (dependabot[bot])
- Clarify docs and names in parquet predicate pushdown tests #16155 (alamb)
- Minor: Fix name() for FilterPushdown physical optimizer rule #16175 (adriangb)
- migrate tests in
pool.rs to use insta #16145 (lifan-ake) - refactor(optimizer): Add support for dynamically adding test tables #16138 (atahanyorganci)
- [Minor] Speedup TPC-H benchmark run with memtable option #16159 (Dandandan)
- Fast path for joins with distinct values in build side #16153 (Dandandan)
- chore: Reduce repetition in the parameter type inference tests #16079 (jsai28)
- chore(deps): bump tokio from 1.45.0 to 1.45.1 #16190 (dependabot[bot])
- Improve
unproject_sort_expr to handle arbitrary expressions #16127 (phillipleblanc) - chore(deps): bump rustyline from 15.0.0 to 16.0.0 #16194 (dependabot[bot])
- migrate
logical_plan tests to insta #16184 (lifan-ake) - chore(deps): bump clap from 4.5.38 to 4.5.39 #16204 (dependabot[bot])
- implement
AggregateExec.partition_statistics #15954 (UBarney) - Propagate .execute() calls immediately in
RepartitionExec #16093 (gabotechs) - Set aggregation hash seed #16165 (ctsk)
- Fix ScalarStructBuilder::build() for an empty struct #16205 (Blizzara)
- Return an error on overflow in
do_append_val_inner #16201 (liamzwbao) - chore(deps): bump testcontainers-modules from 0.12.0 to 0.12.1 #16212 (dependabot[bot])
- Substrait: handle identical grouping expressions #16189 (cht42)
- Add new stats pruning helpers to allow combining partition values in file level stats #16139 (adriangb)
- Implement schema adapter support for FileSource and add integration tests #16148 (kosiew)
- Minor: update documentation for PrunableStatistics #16213 (alamb)
- Remove use of deprecated dict_ordered in datafusion-proto (#16218) #16220 (cj-zhukov)
- Minor: Print cargo command in bench script #16236 (2010YOUY01)
- Simplify FileSource / SchemaAdapterFactory API #16214 (alamb)
- Add dicts to aggregation fuzz testing #16232 (blaginin)
- chore(deps): bump sysinfo from 0.35.1 to 0.35.2 #16247 (dependabot[bot])
- Improve performance of constant aggregate window expression #16234 (suibianwanwank)
- Support compound identifier when parsing tuples #16225 (hozan23)
- Schema adapter helper #16108 (kosiew)
- Update tpch, clickbench, sort_tpch to mark failed queries #16182 (ding-young)
- Adjust slttest to pass without RUST_BACKTRACE enabled #16251 (alamb)
- Handle dicts for distinct count #15871 (blaginin)
- Add
--substrait-round-trip option in sqllogictests #16183 (gabotechs) - Minor: fix upgrade papercut
pub use PruningStatistics #16264 (alamb)
Credits
Thank you to everyone who contributed to this release. Here is a breakdown of commits (PRs merged) per contributor.
30 dependabot[bot]
29 Andrew Lamb
16 xudong.w
14 Adrian Garcia Badaracco
10 Chen Chongchen
8 Gabriel
8 Oleks V
7 miro
6 Tommy shu
6 kamille
5 Lokesh
5 Tim Saucer
4 Dmitrii Blaginin
4 Jay Zhan
4 Nuno Faria
4 Yongting You
4 logan-keede
3 Christian
3 Daniël Heres
3 Liam Bao
3 Phillip LeBlanc
3 Piotr Findeisen
3 ding-young
2 Andy Grove
2 Atahan Yorgancı
2 Brayan Jules
2 Georgi Krastev
2 Jax Liu
2 Jérémie Drouet
2 LB7666
2 Leonardo Yvens
2 Qi Zhu
2 Sergey Zhukov
2 Shruti Sharma
2 Tai Le Manh
2 aditya singh rathore
2 ake
2 cht42
2 gstvg
2 kosiew
2 niebayes
2 张林伟
1 Ahmed Mezghani
1 Alexander Droste
1 Andy Yen
1 Arka Dash
1 Arttu
1 Dan Harris
1 David Hewitt
1 Davy
1 Ed Seidl
1 Eshed Schacham
1 Evgenii Khramkov
1 Florent Monjalet
1 Galim Bikbov
1 Garam Choi
1 Hamir Mahal
1 Hendrik Makait
1 Jonathan Chen
1 Joseph Fahnestock
1 Kevin Zimmerman
1 Lordworms
1 Lía Adriana
1 Matt Butrovich
1 Namgung Chan
1 Nelson Antunes
1 Patrick Sullivan
1 Raz Luvaton
1 Ruihang Xia
1 Ryan Roelke
1 Sam Hughes
1 Shehab Amin
1 Sile Zhou
1 Simon Vandel Sillesen
1 Tom Montgomery
1 UBarney
1 Victorien
1 Xiangpeng Hao
1 Zaki
1 chen quan
1 delamarch3
1 discord9
1 hozan23
1 irenjj
1 jsai28
1 m09526
1 suibianwanwan
1 the0ninjas
1 wiedld
Thank you also to everyone who contributed in other ways such as filing issues, reviewing PRs, and providing feedback on this release.