14.0.0 (2022-11-04)
Full Changelog
Breaking changes:
Implemented enhancements:
- Automatically register tables if ObjectStore root is configured #4094
- Simplify small
InList expressions #4089 - Support
SET command #4067 - add uuid() function to generate unique uuid per row #4045
- Publish benchmark crate so that it can be used as a library in Ballista #4016
- Add statistics methods to
TableProvider trait for use in cost-based optimizations in the logical plan #3983 - Implement
current_time Function #3982 - Implement
current_date Function #3981 - Put common code used for testing code into datafusion/test_utils.rs #3960
- Print the configurations of ConfigOptions in an ordered way so that we can directly compare the equality of two ConfigOptions by their debug strings #3952
- Don't make dependants install protoc #3947
- Implement right anti join and support it in HashBuildProbeOrder #3946
- Implement right semi join and support it in HashBuildProbeOrder #3945
- Refactor
simplify_expressions and expr_simplifier #3934 - Implement serialization for
ScalarValue::FixedSizeBinary #3928 - Support inlining view / dataframes logical plan #3913
- Plans with tables from
TableProviderFactorys can't be serialized #3906 - Simplify
a AND a and a OR a. #3895 - Allow configuring statistics on TPC-H benchmarks #3888
- CI checks stuck in queued mode #3883
- Multiple optimizer passes #3879
- datafusion-proto does not support view table scan #3874
- TableProviderFactories need to be async and return a Result to be useful #3866
- Factorize common AND factors out of OR predicates to support filterPushDown as possible #3858
- Replace
concat_ws with concat when the delimiter is empty string #3857 - Concatenate contiguous literal arguments of
concat_ws when doing the expression simplification #3856 - Partition and Sort Enforcement #3854
- Enable mimalloc by default in benchmarks #3851
- Add collect statistics configuration #3847
- [SQL] - Support cache/uncache table syntax #3842
- Filter pushdown doesn't seem to apply for filter on TPC-H Q17 #3839
- Support pushdown multi-columns in PageIndex pruning. #3834
- Consolidate
Expr manipulation code so it is more discoverable and make it easier to use #3808 - Leverage input array's null buffer for regex replace to optimize sparse arrays #3803
- Improve join cardinality estimation when there is no overlap in the min/max values #3802
- datafusion-cli up to date check is failing on master #3798
- Optimize benchmark q2 subquery filter #3789
- Benchmark should infer schema when running against Parquet #3776
- Allow specialized physical functions to provide hints for the array adapter #3762
- [User Guide] Add
EXPLAIN to SQL reference #3755 - move
type coercion for agg/agg udf #3752 - Prevent Cargo.lock for datafusion-cli being out-of-date #3744
- Add example of expr apis including simplification and coercion #3740
- support
type coercion for ScalarFunction expr in the logical phase #3731 - Add support for DISTINCT projections in
decorrelate_where_exists #3724 - Add type coercion rule for
CONCAT and CONCAT_WS #3720 - Expose and document a simpler public API for simplify expressions #3709
- Expose + document the type coercion API publicly #3708
- Concatenate contiguous literal arguments of
CONCAT during the expression simplification. #3683 - DataFusion 13.0.0 Release #3671
- Add division by
0 rules in the expression simplification #3663 - Compressed CSV/JSON Read #3641
- remove type coercion for agg #3623
- extract or clause as predicate for join rels #3577
- Improve performance of
regex_replace #3518 - Add benchmarks for parquet queries with filter pushdown enabled #3457
- Make type coercion rule more robust #3390
ViewTable::scan ignores filters and limits #3249- Add
CREATE VIEW documentation to user guide #3211 - Push additional parquet filtering into the parquet scan [EPIC] #3147
- Remove
core/logical_plan module #2683 - Datafusion Optimizer Enhancement #2255
- [Optimizer] Eliminate self compare self #2252
- Break datafusion crate into smaller crates #1750
- Benchmark
constellation-rs/amadeus's parquet implementation #1341 - Use
parquet2 async reader in physical_plan/parquet #1058 - Table Scan Enhancement Plan #944
- Implement parquet page-level skipping with column index, using min/max stats #847
- Support min/max statistics in ParquetTable and ParquetExec #537
Fixed bugs:
- Clippy failing on master #4100
- Panic when the number of partitions of the pipeline that throws the exception is inconsistent with the number of partitions output by the query #4096
- FieldNotFound when field is available #4083
- SingleDistinctToGroupBy being applied too broadly #4082
- single_distinct_to_groupby strips qualifiers from group-by expressions #4049
- Another Internal error when parquet predicate pushdown is enabled "Error evaluating filter predicate: #4046
- Decimal multiplied by Float produces incorrect results #4035
- Cannot query external table - TableScan replaced with EmptyExec #4027
- benchmark q17 produces incorrect result #4026
- benchmark q14 produces incorrect result #4025
- benchmark q11 producing incorrect results #4023
- Internal error when parquet predicate pushdown is enabled “Error evaluating filter predicate:” #4006
- Incorrect results with parquet filtering pushdown enabled #4005
- Wrong results when parquet page index filtering is enabled #4002
- Output schema of semi join has invalid projection added after HashBuildProbeOrder #4001
async deserialization functions are unintuitive and possibly insecure #3977Expr::to_bytes can produce output that hits Expr::from_bytes recursion limit #3968- Bug on propagating arrow field metadata #3964
- Predicate still has cast when comparing Timestamp(Nano, None) to a timestamp literal, so can't be pushed down or used for pruning #3938
- Error using
IN list on dictionary encoded data: InList does not support datatype Dictionary(Int32, Utf8). #3936 - Internal error in CAST from Timestamp[us] #3922
- ScalarValue not implemented for FixedSizeBinary types #3910
- [DOC] - There are unsupported DDL in the official documentation #3904
- datafusion-proto deserialize with Substring(str [from int] [for int]) fails #3901
count(Literal) gives wrong column name #3891projection_push_down adds duplicate projections with multiple passes #3881- Default physical planner generates empty relation for DROP TABLE, CREATE MEMORY TABLE, etc #3873
- Binary expression canonical names are incorrect in some cases #3865
- Using the window function lag causes panic. #3830
- chrono crate : specify 0.4.22 as the minimum version due to spurious build failures #3827
- datafusion-proto deserialize with q16 sql fails #3820
- Filter predicates should not be aliased #3795
- Write csv not save all lines of dataframe #3783
- Regression in simplifying expressions in subqueries #3760
- DataFusionError(Internal(“The size of the sorted batch is larger than the size of the input batch: 2120 > 2312”)) #3747
- “labeler” PR check is broken #3743
DataFrame::select_columns doesn't work with names containing “.” #3733- TPC-H Query 1 has regressed #3729
- [RUST][Datafusion] What causes “Error: Execution(“file size of 4 is less than footer”)” error? #3800
- Field names containing periods such as f.c cannot work #3682
- TableProvider implementation for DataFrame does not support filter pushdown #3681
- using Decimal(0) make system panicked #3665
- Cannot query some parquet files in S3, but they work locally #3633
col / col returns 1 when col = 0 #3615- register_csv allow space in table_path #3589
- Hardcoded u64 for WindowFrameBound fields #3571
docs.rs cannot build datafusion-proto crate #3538- Row Hash loads whole aggregation state to memory before sending #3460
- approx_percentile_cont return wrong result when scan multi parquet files. #3140
- User guide is incorrect regarding using CLI to register CSV files using schema inference #3001
- Exception: Internal error, Exception: Schema error #2938
- Version 0.6.0 Panic error during SQL execution #2738
- wrong result when operation parquet #2044
- Local object store accepts file:/// as base path, but LocalStore returns meta without the prefix. #1923
- Reading nested parquet files results in
index out of bounds #1383 - (negation) with NULL literals does not work: can‘t be evaluated because the expression’s type is Utf8, not signed #1192- Inconsistent cast behavior #957
- single_distinct_to_groupby no longer drops qualifiers #4050 [sql] (andygrove)
Documentation updates:
- Clarify in docs that Identifiers are made lower-case in SQL query #2374
- Fix broken links in contributor guide #3956 (Jefffrey)
- add create view explanation #3925 (retikulum)
- Update
datafusion-examples README #3814 (alamb) - Add Seafowl to list of projects using DataFusion #3792 (mildbyte)
Closed issues:
- [QUESTION] How many times should be the function
create_name called when executing a query? #3900 - Improve the
Expr string format #3878 - Simplify division by zero (division by one / multiplication by zero / multiplication by one) for Decimal types as well #3643
- InList: merge check branch #2833
- Optimization InList: compare the float data type using OrderedFloat<T> #2831
- Outdated section of the add function of the contribution guide #2560
- Optimize InList implementation with native types rather than ScalarValue #2165
- Improve testing of optimizers using EXPLAIN #1118
- Crash on parsing sql query with Cyrillic letters #184
- [EPIC] Support all TPC-H queries in benchmark #158
- Implement optional second argument to ltrim and rtrim functions #144
- Benchmark crate does not have a SIMD feature #124
- ColumnarValue::into_array should not require batch #113
- [Rust] Parquet data source does not support complex types #83
Merged pull requests:
- Appease new clippy #4101 (alamb)
- minor: Split parquet reader up into smaller modules #4099 (alamb)
- [MINOR] Update
SET in cli.md #4098 (waitingkuo) - fix: Scheduler panic routing errors #4097 (yukkit)
- Automatically register tables if ObjectStore root is configured #4095 (avantgardnerio)
- minor: Use Operator::swap #4092 (alamb)
- Simplify small InListExpr #4090 (Dandandan)
- Minor: Add arrow-rs ticket reference and turn some comments into docstrings #4088 (alamb)
- Support Dictionary in InListExpr #4070 (tustvold)
- support
SET variable #4069 [sql] (waitingkuo) - Add in list bench #4068 (tustvold)
- Improve Error Handling and Readibility for downcasting
StructArray #4061 (retikulum) - Build tests separately from running #4060 (alamb)
- Simplify InListExpr ~20-70% Faster #4057 (tustvold)
- MINOR: Print unoptimized logical plan in execute_query of tpch benchmark #4056 (viirya)
- Minor: clean the code in
eliminate_filter #4055 (HaoYang670) - Implement
current_time scalar function #4054 (naosense) - Cleanup hash_utils adding support for decimal256 and f16 #4053 (tustvold)
- Fix multicolumn parquet predicate pushdown (#4046) #4048 (tustvold)
- Add CI checks that we can serde all benchmark queries #4047 (andygrove)
- Enable more benchmark verification tests #4044 (andygrove)
- Extract common parquet testing code to
parquet-test-util crate #4042 (alamb) - add uuid() function #4041 (Jimexist)
- Update to arrow 26, change timezones #4039 [sql] (tustvold)
- Fix Decimal and Floating type coerce rule #4038 (viirya)
- Reserve the literal expression of
Count function #4031 [sql] (HaoYang670) - Implement current_date scalar function #4022 (comphead)
- Fix predicate pushdown bugs: project columns within DatafusionArrowPredicate (#4005) (#4006) #4021 (tustvold)
- minor: remove redundant code/TODO #4019 (jackwener)
- Add CI check to verify that benchmark queries return the expected results #4015 (andygrove)
- Minor: Add TODO and tracking ticket reference #4012 (alamb)
- Add right anti join support and support it in HashBuildProbeOrder #4011 (Dandandan)
- MINOR: Generate expected benchmark query results #4010 (andygrove)
- Minor: remove unecessary clippy allow #4008 (alamb)
- Minor: Do what clippy says and clean up some code #4007 (alamb)
- Improve Error Handling and Readibility for downcasting
Date32Array #4004 (retikulum) - Don't add projection for semi joins in HashBuildProbeOrder #4000 (Dandandan)
- Minor: use
DataType::is_nested #3995 (alamb) - [minor] bump prettier version #3992 (Jimexist)
- Add parquet predicate pushdown metrics #3989 (alamb)
- Pin datafusion-proto build dependencies #3987 (tustvold)
- Add TableProvider.statistics method #3986 (andygrove)
- Add Pull Request guidelines to contributor guide #3985 (alamb)
- Update protos #3979 (tustvold)
- Revert async changes but keep deltalake working #3978 (avantgardnerio)
- Correctness integration test for parquet filter pushdown #3976 (alamb)
- MINOR: Stop pretty printing batches in benchmark when there are no results #3974 (andygrove)
- MINOR: Re-export Cast struct #3971 (andygrove)
- fix: check recursion limit in
Expr::to_bytes #3970 (crepererum) - [Part1] Partition and Sort Enforcement, PhysicalExpr enhancement #3969 (mingmwang)
- Support pushdown multi-columns in PageIndex pruning. #3967 (Ted-Jiang)
- Fix benchmarks README formatting #3966 (Jefffrey)
- Bug fix on DFField to Field conversion: preserve metadata #3965 (metesynnada)
- Informative Error Message for LAG and LEAD functions #3963 (mustafasrepo)
- Minor: Add some docstrings to
FileScanConfig and RuntimeEnv #3962 (alamb) - Move common code used for testing code into datafusion/test_utils #3961 (alamb)
- Update minimum chrono dependency to 0.4.22 #3959 (alamb)
- Implement right semi join and support in HashBuildProbeorder #3958 (Dandandan)
- Print the configurations of ConfigOptions in an ordered way so that we can directly compare the equality of two ConfigOptions by their debug strings #3953 (yahoNanJing)
- Vendor Generated Protobuf Code (#3947) #3950 (tustvold)
- Implement serialization for ScalarValue::FixedSizeBinary #3943 (retikulum)
- Consolidate physical join code into
datafusion/core/src/physical_plan/joins #3942 (alamb) - Add optimizer test for simplifying predicates on timestamps #3939 (alamb)
- Add test for querying predicate on dictionary #3937 (alamb)
- fix: return error for unsupported SQL #3933 (Kikkon)
- doc: fix doc about
CREATE TABLE IF NOT EXISTS #3932 (jackwener) - Refactor Expr::Cast to use a struct. #3931 [sql] (jackwener)
- minor: fix some typo. #3930 (jackwener)
- chore: update cranelift-related dependencies #3926 (xudong963)
- Change cast error from Internal to NotImplemented #3924 (alamb)
- Support inlining view / dataframes logical plan #3923 (Dandandan)
- Add test for Simplify redundant predicates #3915 (src255)
- Implement ScalarValue for FixedSizeBinary #3911 (maxburke)
- Add serde for plans with tables from
TableProviderFactorys #3907 (avantgardnerio) - Support filter/limit pushdown for views/dataframes #3905 (Dandandan)
- Factorize common AND factors out of OR predicates to support filterPu… #3903 (Ted-Jiang)
- Add
Substring(str [from int] [for int]) support in datafusion-proto #3902 (r4ntix) - Revert “Factorize common AND factors out of OR predicates to supportfilter Pu… (#3859)” #3897 (alamb)
- MINOR: Add notes on Apache Reporter #3893 (andygrove)
- Allow configuring collection of statistics during TPC-H benchmarks #3889 (isidentical)
- Improve formatting of binary expressions #3884 [sql] (andygrove)
- Multiple optimizer passes #3880 (andygrove)
- [MINOR] Update docs with newly added configuration values #3877 (alamb)
- [MINOR] Add a hint about how to resolve the
Cargo.lock CI check #3876 (alamb) - Add
LogicalPlan::ViewTable support in datafusion-proto #3875 (r4ntix) - Optimize the
concat_ws function #3869 (HaoYang670) - Implement foundational filter selectivity analysis #3868 (isidentical)
- Update
TableProviderFactory trait to support real-world use-cases #3867 (avantgardnerio) - put subquery's equal clause into join on clauses instead of filter cl… #3862 (AssHero)
- Factorize common AND factors out of OR predicates to support filterPu… #3859 (Ted-Jiang)
- Enable mimalloc by default in benchmark #3853 (Dandandan)
- Refactor
Expr::Between to use a struct #3850 [sql] (b41sh) - Handle cardinality estimation for disjoint inner and outer joins #3848 (isidentical)
- Add setting for statistics collection #3846 (Dandandan)
- Update to arrow 25.0.0 #3844 [sql] (tustvold)
- Tweak list of optimization rules #3841 (Dandandan)
- Refactor Expr::GetIndexedField to use a struct #3838 [sql] (ygf11)
- Infer the count of maximum distinct values from min/max #3837 (isidentical)
- Refactor
Expr::Like, Expr::ILike, Expr::SimilarTo to use a struct #3836 [sql] (b41sh) - Refactor Expr::BinaryExpr to use a struct #3835 [sql] (zhoudongyan)
- update postgres version to 15 in integration test #3831 (Jimexist)
- Fix the panic when lpad/rpad parameter is negative #3829 (ZuoTiJia)
- MINOR: Document SHOW ALL in the users guide #3826 (alamb)
- MINOR: Add datafusion-cli documentation on showing configuration #3825 (alamb)
- Add/Remove Division Rules #3824 (retikulum)
- Minor: Sort the output of SHOW ALL by config name #3823 [sql] (alamb)
- Add
precision != 0 check when making decimal type #3818 [sql] (HaoYang670) - Infer schema when running benchmarks against parquet #3817 (andygrove)
- Finish removing deprecated
datafusion::logical_plan module #3816 (andygrove) - Clarify initial example with respect to capitalization #3815 (alamb)
- Improve expression simplification by running it twice #3811 (alamb)
- Make expression manipulation consistent and easier to use:
combine/split filter conjunction, etc #3810 (alamb) - Consolidate expression manipulation functions into
datafusion_optimizer #3809 (alamb) - Optimize
regexp_replace when the input is a sparse array #3804 (isidentical) - Stop ignoring errors when writing DataFrame to csv, parquet, json #3801 (andygrove)
- Update datafusion-cli Cargo.lock to fix CI check on master #3799 (alamb)
- MINOR: Benchmark regression tests #3790 (andygrove)
- MINOR: Optimizer example and docs, deprecate
Expr::name #3788 (andygrove) - Join cardinality computation for cost-based nested join optimizations #3787 (isidentical)
- Optimizer now simplifies multiplication, division, module arg is a literal Decimal zero or one #3782 (drrtuy)
- Implement parquet page-level skipping with column index, using min/ma… #3780 (Ted-Jiang)
- Bump actions/labeler from 4.0.1 to 4.0.2 #3779 (dependabot[bot])
- MINOR: correct
ListingOptions.try_new docs to include the enabled stat collection #3775 (isidentical) - Teach a negative NULL expression to return NULL instead of an error #3771 (drrtuy)
- Add benchmarks for testing row filtering #3769 (thinkharderdev)
- move type coercion of agg and agg_udaf to logical phase #3768 (liukun4515)
- User Guide: Add
EXPLAIN to SQL reference #3767 (unvalley) - Allow specialized implementations to produce hints for the array adapter #3765 (isidentical)
- Fix optimizer regression with simplifying expressions in subquery filters #3764 (andygrove)
- Run all
datafusion-examples in CI tests #3761 (alamb) - MINOR: Remove deprecated module
datafusion::logical_plan::plan #3759 (andygrove) - Refactor
Expr::Case to use a struct #3757 [sql] (andygrove) - Do not run labeler CI check if it would fail due to permissions #3756 (alamb)
- MINOR: Improvements to
scalar_subquery_to_join error handling #3754 (andygrove) - Always track the final size of the in-mem sorted arrays #3753 (isidentical)
- Fix DataFrame::select_columns to handle column names with a period #3751 (zhoudongyan)
- Fix
ListingTableUrl to decode percent #3750 (unvalley) - remove
type coercion for physical ScalarFunction #3749 (liukun4515) - CI: Add a new run to check whether
datafusion-cli lock file is up-to-date #3745 (isidentical) - Add datafusion example of expression apis #3741 (alamb)
- fix subquery where exists distinct #3732 (b41sh)
- Remove some uneeded code in
CommonSubexprEliminate #3730 (alamb) - Consolidate and better tests for expression re-rewriting / aliasing #3727 (alamb)
- Fix output schema generated by CommonSubExprEliminate #3726 (alex-natzka)
- Add type coercion rule for
concat and concat_ws #3721 (HaoYang670) - Expose and document a simpler public API for simplify expressions #3719 (ygf11)
- Remove dead code in
UnwrapCastExprRewriter that may mask errors #3703 (alamb) - Fix
DataFrame::with_column to handle creating column names with a period #3700 (alamb) - Add simplification rules for the
CONCAT function #3684 (HaoYang670) - Compressed CSV/JSON support #3642 [sql] (Licht-T)
- Simplify serialization by removing redundant
PrimitiveScalarValue #3612 (alamb) - Pushdown single column predicates from ON join clauses #3578 (AssHero)
- Simplify the serialization of
ScalarValue::List #3547 (alamb) - Generate hash aggregation output in smaller record batches #3461 (milenkovicm)
- Improve doc on lowercase treatment of columns on SQL #3385 (nanicpc)