16.0.0 (2023-01-12)

Full Changelog

Breaking changes:

  • Remove unused ExecutionPlan::relies_input_order (has been replaced with required_input_ordering) #4856 (alamb)
  • Add DataFrame::into_view instead of implementing TableProvider (#2659) #4778 (tustvold)

Implemented enhancements:

  • Support custom window frame with AVG aggregate function #4845
  • add sqllogicaltest for tpch and remove some duplicated test. #4801
  • Catalog Snapshot Isolation #4697
  • Support select .. FROM 'parquet.file' in datafusion-cli #4580

Fixed bugs:

  • Regression: write_csv result has incorrect formatting #4876
  • Incorrect results for join condition against current master branch #4844
  • Match Postgres for stddev and variance on less than 3 values #4843
  • JOIN ... USING (columns) works incorrectly with multiple columns (joined-over columns are missing in the output) #4674
  • ROW_NUMBER window function inconsistent across partitions in multi-threaded runtime #4673
  • SELECT ... FROM (tbl1 UNION tbl2) wrongly works like SELECT DISTINCT ... FROM (tbl1 UNION tbl2) #4667
  • DataFrame TableProvider Circular Reference #2659

Documentation updates:

Closed issues:

  • Remove tests from sql_integration that were ported to sqllogictest #4498
  • How to register a http url to the object_store #4491
  • optimizer: support unsigned <-> decimal for unwrap_cast_in_comparion rule #4287
  • Add SQL support for NATURAL JOIN #117
  • [Datafusion] Datafusion queries involving a column name that begins with a number produces unexpected results #108

Merged pull requests:

  • docs: improve Column::normalize_with_schemas docs #4871 (crepererum)
  • Skip EliminateCrossJoin rule when meet non-empty join filter #4869 (ygf11)
  • Support for SQL Natural Join #4863 [sql] (Jefffrey)
  • Minor: Move test data into datafusion/core/tests/data #4855 (alamb)
  • Covariance single row input & null skipping #4852 (korowa)
  • Document ability to select directly from files in datafusion-cli #4851 (alamb)
  • Fix push_down_projection through a distinct #4849 (Jefffrey)
  • Support using var/var_pop/stddev/stddev_pop in window expressions with custom frames #4848 (jonmmease)
  • Update variance/stddev to work with single values #4847 (jonmmease)
  • Implement retract_batch for AvgAccumulator #4846 (jonmmease)
  • Support wildcard select on multiple column using joins #4840 [sql] (Jefffrey)
  • Orthogonalize distribution and sort enforcement rules into EnforceDistribution and EnforceSorting #4839 (mustafasrepo)
  • support select .. FROM 'parquet.file' in datafusion-cli #4838 (unconsolable)
  • Remove tests from sql_integration that were ported to sqllogictest #4836 (matthewwillian)
  • add tpch sqllogicaltest and remove some duplicated test #4802 (jackwener)

16.0.0-rc1 (2023-01-07)

Full Changelog

Breaking changes:

Implemented enhancements:

  • Move the ExtractEquijoinPredicate behind the SubqueryFilterToJoin #4759
  • Remove the config datafusion.execution.coalesce_target_batch_size #4756
  • SimplifyExpressions will fail when rebuild equijoin with alias #4754
  • Provide a constructor for the ConfigOptions with HashMap<String, String> #4752
  • Non-deprecated support for planning SQL without DDL #4720
  • Add regression tests for planning TPC-DS queries #4718
  • Move the extracting join keys logic to optimizer #4710
  • Support compression in IPCWriter #4708
  • Support prepared statement parameter type inference #4700
  • PruningPredicate Use Physical not Logical Predicate #4695
  • Support for executing infinite files #4692
  • Add a sort rule to remove unnecessary SortExecs from physical plan #4686
  • Install protoc automatically when building datafusion/proto crate #4684
  • Make DfSchema wrap SchemaRef #4680
  • Reorder the physical plan optimizer rules #4678
  • Inconsistent behavior with PostgreSQL to decide Window Expressions ordering #4641
  • Returns error too late when parsing invalid file compression type. #4636
  • Make OptimizerConfig a Trait #4631
  • Move Optimize onto DataFrame #4626
  • Make LogicalPlanBuilder Consuming #4622
  • Make DataFrame Consuming #4621
  • rules don't need to recursion inside themselves #4613
  • [window function] support min max with self define sliding window. #4603
  • Add try_optimize for all_rules #4598
  • Refine the physical plan serialization and deserialization #4597
  • Normalize datafusion configuration names #4595
  • Add need_data_exchange in the ExecutionPlan to indicate whether a physical operator needs data exchange #4585
  • Bump Datafusion sql-parser dependency to 0.28 #4573
  • tpch test exist duplicated #4563
  • user-defined aggregate function as window function #4552
  • Convert a Prepare Logical Plan into a Logical Plan with all parameters replaced with values #4550
  • FileStream requires fake ObjectStore when ParquetFileReaderFactory is used #4533
  • Avoid reading the entire file in ChunkedStore #4524
  • Enrich filter statistics predictions with estimated column boundaries #4518
  • Show window frame info in physical plan #4509
  • Add sqllogictest auto labeler #4507
  • Optimize is_distinct_from / is_not_distinct_from #4482
  • Add window func related logic plan to proto ability. #4480
  • Make window function related struct public. #4479
  • Improve partition file explain plan display to show groupings #4466
  • Add support for non-column key for equijoin when eliminating cross join to inner join #4442
  • Remove the schema checking from CrossJoinExec::try_new #4431
  • Initial support for prepared statement #4426
  • Add support for NTILE built-in Window Function #4403
  • Add Support for MIN, MAX Aggregate Functions when run with custom window frames #4402
  • Support INSERT INTO statement #4397
  • Enhancement: split the SQL planner into smaller modules #4392
  • Proposal: Improve the join keys of logical plan #4389
  • Add MergeSubqueryAlias rule #4383
  • Optimizer rule support subqueryAlias #4381
  • Rewrite simple regex expressions #4370
  • Revisit get_statistics_with_limit() method in datasource mod #4323
  • Support for type coercion for a (Timestamp, Utf8) pair #4311
  • replace the operation about decimal to the arrow-rs kernel #4289
  • change date_part return types to f64 #3997
  • Better api for setting ConfigOptions from SessionContext #3908
  • Make ConfigOptions easier to work with #3886
  • An asynchronous version of CatalogList/CatalogProvider/SchemaProvider #3777
  • Allow configs to be set with string values #3500
  • support scientific notation for SQL literals #3448
  • Adopt physical plan serde from arrow-ballista #3257
  • Improve codebase readability and error messages by and consistently handle downcasting #3152
  • Re-enable where_clauses_object_safety #3081
  • optimize/simplify the literal data type and remove unnecessary cast、try_cast #3031
  • Move datafusion-substrait crate into arrow-datafusion repo #2646
  • [enhancement] rules don't need to recursion inside themselves #2620
  • Add support for GROUPING SETS syntax in SQL planner #2469
  • Optimize EXISTS subquery expressions by rewriting as semi-join #2351
  • Add Delta Lake TableProvider #525
  • Support window functions with window frame #361

Fixed bugs:

  • PushdownFilter rule exist bug will cause filter change wrong #4822
  • Unlimited memory consumption in RepartitionExec #4816
  • Physical Optimizer Config Mutation Doesn't Take Effect #4806
  • cargo test failed error: linking with cc failed: exit status: 1 #4790
  • Parquet files generated by DataFusion cannot be read by Apache Spark #4782
  • datafusion-physical-expr doesn't compile when blake3/traits-preview is enabled #4781
  • Multiple ways to express like / ilike / not like / not ilike #4765
  • SessionState::optimize and SessionState::create_physical_plan Don't Update Query Start Time #4747
  • Page Filtering Incorrectly Handles Pages with Different Row Counts #4744
  • cargo test failing on master due to tpcds_logical_q41 stackoverflow #4728
  • PruningPredicate Different Evaluation Context from Query #4693
  • Skipping optimizer rule due to create_name not supporting wildcard #4681
  • Create physical plan bug: got Arrow schema with 1 and DataFusion schema with 0 #4677
  • Timestamp <-> Date32 compare doesn't work #4672
  • Wrongly use the function clamp #4654
  • Fix the clippy errors #4653
  • Filter Null Keys Update Not Taking Effect #4638
  • Should not generate duplicate sort keys from Window expr's partition by keys #4635
  • common_sub_expression_eliminate exists bug #4575
  • Confusing “Bare” in doesn't exist messages #4571
  • having shouldn't include alias in projection #4556
  • wrong comment about having #4554
  • drop view t1, t2, ... and drop table t1, t2, ... silently ignores arguments past the first #4531
  • Extract from timestamp doesn't support nanosecond #4528
  • prepare_select_exprs don't need outer_query_schema #4526
  • Table names with periods are not handled correctly #4513
  • Push_down_projection push redundant column. #4486
  • Planner don't generate SubqueryAlias #4483
  • Planner generate replicated Projection | SubqueryAlias #4481
  • apply_table_alias will ignore alias_name when columns is empty. #4454
  • Fix output_ordering of WindowAggExec #4438
  • Incorrect error for plus/minus operations over timestamps and dates #4420
  • Optimization rule filter_push_down causes FieldNotFound error #4401
  • Should not convert a normal non-inner join to Cross Join when there are non-equal Join conditions #4363
  • MemoryConsumer::try_grow Underflow #4328
  • Potential MemoryManager Deadlock #4325
  • create external table should fail to parse if syntax is incorrect #4262
  • Nullif func states support for Boolean type, but fails if this is attempted #4205
  • ProjectionPushDown rule don't consider the alias in projection. #4174
  • Stack overflow planning complex query #4065
  • Can not use extract <part> on the value of now() #3980
  • Bug with intervals and logical and/or #3944
  • CoalesceBatches doesn't provide correct elapsed_compute info in metrics #3894
  • Paniced at to_timestamp_micros function when the timestamp is too large. #3832
  • Optimizer casts decimals to different values on different platforms #3791
  • CSV inference reads in the whole file to memory, regardless of row limit #3658
  • after type coercion CommonSubexprEliminate will produce invalid projection #3635
  • panic at attempt to multiply with overflow when doing math on Decimal128 columns #3437
  • Precedence bug with date comparison to date plus interval #3408
  • Median aggregation using DataFrame panics: “AggregateState is not a scalar aggregate” #3105
  • date_part does't work for now() #3096
  • hash_join panics when join keys have different data types #2877
  • Memory manager triggers unnecessary spills #2829
  • Address performance/execution plan of TPCH query 9 #77

Documentation updates:

  • Add a new open source project that is use DataFusion as query engine #4768 (francis-du)

Closed issues:

  • move the tests in planner #4798
  • Make it easier to update sqltestlogic test expected output (“test script completion mode”) #4570
  • Make ConfigOption names into an Enum #4517
  • Implement null / empty string handling for sqllogictest #4500
  • Write a blog about parquet predicate pushdown #3464
  • Ensure column names are equivalent with or without optimization #1123

Merged pull requests: